Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output of parse-csv() #1018

Closed
michaelhkay opened this issue Feb 11, 2024 · 3 comments
Closed

Output of parse-csv() #1018

michaelhkay opened this issue Feb 11, 2024 · 3 comments
Labels
Editorial Minor typos, wording clarifications, example fixes, etc. XQFO An issue related to Functions and Operators

Comments

@michaelhkay
Copy link
Contributor

I propose making some simplifications to the output of parse-csv() to make it more amenable to processing.

  1. Represent each row as a map, rather than as a structure with a data field and an accessor function. Note that implementations worried about memory usage can devise a custom map implementation optimised for the case where many maps have the same regular structure. (cf recent thread about Javascript "shapes")
  2. The key for a field in this map should be an integer if (i) column-names is set to false, or (ii) the column in question does not have a unique header name; in other cases it should be the name from the header.
  3. Replace the top-level columns record with a simple array of field names. It's easy enough to map names to positions using index-of.

I also propose changing the name to csv-to-maps for consistency with csv-to-table and csv-to-arrays.

We should advocate use of csv-to-arrays where data is to be accessed positionally, and csv-to-maps where it is to be accessed by column names, and optimise the design accordingly.

Looking at a use case, the first example (§15.4.7.1) would be unnecessary if as proposed we change csv-to-xml to generate XHTML directly, But if it were needed, it would change from

let $csv := fn:parse-csv(`name,city{$crlf}Bob,Berlin`)
return <table>
   <thead>{
      for $column in $csv?columns?fields
         return <th>{ $column }</th>
   }</thead>
   <tbody>{
      for $row in $csv?rows return <tr>
         { for $field in $row?fields return <td>{ $field }</td> }
      </tr>
   }</tbody>
</table>

to

let $csv := fn:parse-csv(`name,city{$crlf}Bob,Berlin`)
return <table>
   <thead>{
      for $column in $csv?columns
         return <th>{ $column }</th>
   }</thead>
   <tbody>{
      for $row in $csv?rows return <tr>
         { for $column in $csv?columns return <td>{ $row?$column }</td> }
      </tr>
   }</tbody>
</table>

@michaelhkay
Copy link
Contributor Author

Apart from being simpler, other benefits include:

(a) it is possible to compare results using deep-equal (not possible before because of the function items)
(b) It is possible to serialize array{$result?rows} nicely as JSON.

@ChristianGruen ChristianGruen added Editorial Minor typos, wording clarifications, example fixes, etc. XQFO An issue related to Functions and Operators labels Feb 11, 2024
@ChristianGruen
Copy link
Contributor

  1. Represent each row as a map, rather than as a structure with a data field and an accessor function.
    […]

I like this. If I remember correctly, one incentive for the given structure was the perspective of converting CSV bidirectionally, with $input => parse-csv() => serialize-csv() giving us the original contents. We have abandoned this idea at some point along the way. It certainly makes life easier, and we should choose a result format that’s easier to process.

  1. The key for a field in this map should be an integer if (i) column-names is set to false, or (ii) the column in question does not have a unique header name; in other cases it should be the name from the header.

Input with duplicate header names is quite common. I would suggest combining the data to the existing entry.

Here’s approximate input that I was just confronted with. With the proposed solution, …

"key,value,value,value,value
id2342,5,343,1,78
id1342,800"
=> csv:parse(map { 'format': 'xquery' })

…would result in…

map { 'key': id2342, 'value': ('5', '343', '1', '78') },
map { 'key': id1342, 'value': ('800')

We could also offer a duplicate option, i.e. the one from map:merge or a modified version.

  1. Replace the top-level columns record with a simple array of field names. It's easy enough to map names to positions using index-of.

A plain sequence would make it even easier.

I also propose changing the name to csv-to-maps for consistency with csv-to-table and csv-to-arrays.

fn:parse-csv would be my favorite (see #748). On the one hand, we should try to be consistent with the existing function set, which offers fn:parse-xml, fn:parse-json and fn:parse-html. Next…

We should advocate use of csv-to-arrays where data is to be accessed positionally, and csv-to-maps where it is to be accessed by column names, and optimise the design accordingly.

…I believe that we should offer a function that seems like a good default solution for most users. As the name of the other function csv-to-arrays implies, it gives you a more low-level access to the input in case it’s needed (e..g if you have nested structures, which cannot be addressed easily with regular column names).

Finally, we could this default for fn:csv-doc.

@michaelhkay
Copy link
Contributor Author

Closed in favour of #1052.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Editorial Minor typos, wording clarifications, example fixes, etc. XQFO An issue related to Functions and Operators
Projects
None yet
Development

No branches or pull requests

2 participants