Output of parse-csv() #1018

michaelhkay · 2024-02-11T07:01:28Z

I propose making some simplifications to the output of parse-csv() to make it more amenable to processing.

Represent each row as a map, rather than as a structure with a data field and an accessor function. Note that implementations worried about memory usage can devise a custom map implementation optimised for the case where many maps have the same regular structure. (cf recent thread about Javascript "shapes")
The key for a field in this map should be an integer if (i) column-names is set to false, or (ii) the column in question does not have a unique header name; in other cases it should be the name from the header.
Replace the top-level columns record with a simple array of field names. It's easy enough to map names to positions using index-of.

I also propose changing the name to csv-to-maps for consistency with csv-to-table and csv-to-arrays.

We should advocate use of csv-to-arrays where data is to be accessed positionally, and csv-to-maps where it is to be accessed by column names, and optimise the design accordingly.

Looking at a use case, the first example (§15.4.7.1) would be unnecessary if as proposed we change csv-to-xml to generate XHTML directly, But if it were needed, it would change from

let $csv := fn:parse-csv(`name,city{$crlf}Bob,Berlin`)
return <table>
   <thead>{
      for $column in $csv?columns?fields
         return <th>{ $column }</th>
   }</thead>
   <tbody>{
      for $row in $csv?rows return <tr>
         { for $field in $row?fields return <td>{ $field }</td> }
      </tr>
   }</tbody>
</table>

to

let $csv := fn:parse-csv(`name,city{$crlf}Bob,Berlin`)
return <table>
   <thead>{
      for $column in $csv?columns
         return <th>{ $column }</th>
   }</thead>
   <tbody>{
      for $row in $csv?rows return <tr>
         { for $column in $csv?columns return <td>{ $row?$column }</td> }
      </tr>
   }</tbody>
</table>

The text was updated successfully, but these errors were encountered:

michaelhkay · 2024-02-11T07:47:01Z

Apart from being simpler, other benefits include:

(a) it is possible to compare results using deep-equal (not possible before because of the function items)
(b) It is possible to serialize array{$result?rows} nicely as JSON.

ChristianGruen · 2024-02-11T09:47:02Z

Represent each row as a map, rather than as a structure with a data field and an accessor function.
[…]

I like this. If I remember correctly, one incentive for the given structure was the perspective of converting CSV bidirectionally, with $input => parse-csv() => serialize-csv() giving us the original contents. We have abandoned this idea at some point along the way. It certainly makes life easier, and we should choose a result format that’s easier to process.

The key for a field in this map should be an integer if (i) column-names is set to false, or (ii) the column in question does not have a unique header name; in other cases it should be the name from the header.

Input with duplicate header names is quite common. I would suggest combining the data to the existing entry.

Here’s approximate input that I was just confronted with. With the proposed solution, …

"key,value,value,value,value
id2342,5,343,1,78
id1342,800"
=> csv:parse(map { 'format': 'xquery' })

…would result in…

map { 'key': id2342, 'value': ('5', '343', '1', '78') },
map { 'key': id1342, 'value': ('800')

We could also offer a duplicate option, i.e. the one from map:merge or a modified version.

Replace the top-level columns record with a simple array of field names. It's easy enough to map names to positions using index-of.

A plain sequence would make it even easier.

I also propose changing the name to csv-to-maps for consistency with csv-to-table and csv-to-arrays.

fn:parse-csv would be my favorite (see #748). On the one hand, we should try to be consistent with the existing function set, which offers fn:parse-xml, fn:parse-json and fn:parse-html. Next…

We should advocate use of csv-to-arrays where data is to be accessed positionally, and csv-to-maps where it is to be accessed by column names, and optimise the design accordingly.

…I believe that we should offer a function that seems like a good default solution for most users. As the name of the other function csv-to-arrays implies, it gives you a more low-level access to the input in case it’s needed (e..g if you have nested structures, which cannot be addressed easily with regular column names).

Finally, we could this default for fn:csv-doc.

michaelhkay · 2024-02-29T10:11:47Z

Closed in favour of #1052.

ChristianGruen added Editorial Minor typos, wording clarifications, example fixes, etc. XQFO An issue related to Functions and Operators labels Feb 11, 2024

michaelhkay closed this as completed Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output of parse-csv() #1018

Output of parse-csv() #1018

michaelhkay commented Feb 11, 2024

michaelhkay commented Feb 11, 2024

ChristianGruen commented Feb 11, 2024

michaelhkay commented Feb 29, 2024

Output of parse-csv() #1018

Output of parse-csv() #1018

Comments

michaelhkay commented Feb 11, 2024

michaelhkay commented Feb 11, 2024

ChristianGruen commented Feb 11, 2024

michaelhkay commented Feb 29, 2024