Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

413: Spec for CSV-related functions #719

Merged
merged 7 commits into from
Nov 14, 2023
Merged

Conversation

fidothe
Copy link
Contributor

@fidothe fidothe commented Sep 21, 2023

This PR contains error fixes (typos, examples that contradicted the spec text), some (hopefully) improved language and one breaking change.

The current draft uses the type map(xs:integer, xs:string) for the column-names option to fn:csv-to-xdm and fn:csv-to-xml. This PR flips that to map(xs:string, xs:integer). It turns out that the examples were already using this, and it seems to me that having the names entry in the csv-columns-record record type be the transposed version of the column-names option that creates it, rather than be the same thing, is counterproductive.

I can think of some examples (a CSV split into several chunks, with only the first containing the headers) where being able to feed the names entry right back into another invocation of fn:csv-to-xdm would be useful. If nothing else it's confusing and not obvious, or I wouldn't have messed up the examples, and somebody would have noticed during the review process...

@fidothe
Copy link
Contributor Author

fidothe commented Sep 27, 2023

@ChristianGruen just flagging this for your attention - there's a trivial-but-significant change in the column-names option for fn:csv-to-xdm and fn:csv-to-xml. The map type has its key/value types transposed now. (The examples already had that by mistake, see my first comment)

@ChristianGruen
Copy link
Contributor

@ChristianGruen just flagging this for your attention - there's a trivial-but-significant change in the column-names option for fn:csv-to-xdm and fn:csv-to-xml. The map type has its key/value types transposed now. (The examples already had that by mistake, see my first comment)

Thank you for pointing this out.

@ChristianGruen
Copy link
Contributor

@fidothe I had another look at the current range of parsing functions in the spec (#748), and I’m feeling a bit uneasy that each function has its own semantics. In our implementation, we tried to offer a consistent approach for different input formats, and I believe it would be great if we could pursue something similar in the standard function set.

Maybe we have a chance to get fn:parse-csv further simplified, such that a single function call suffices to get a representation that can be further processed. One major requirement would probably be to add the column-names option to the function. Ideally (I believe/hope), we could thus reduce the CSV functions to 1 or 2. For JSON, two parsing functions exist (fn:parse-json, fn:json-to-xml), but I was frequently asked in the past why this decision has been taken, and I couldn’t give a convincing answer.

If fn:parse-csv can be simplified, we could eventually add an fn:csv-doc function.

@fidothe
Copy link
Contributor Author

fidothe commented Oct 16, 2023

@ChristianGruen, I completely agree that we should standardise the parsing functions as much as possible - I'm very happy to change the input parameters to have the name value.

There are issues with what the functions do, and why they're named the way they are - I think the fn:parse-json / fn:json-to-xml confusion is a good example of the problem. the parse- prefix perhaps makes it sound like a more universal function than it really is. Effectively, for JSON there are json-to-xdm and json-to-xml functions, but we published the standard with parse-json, so we're stuck with that name. Here, I think there's a slightly different problem being masked by the same function naming convention problem as with the JSON-related functions.

CSV is a much wilder format, and with much poorer definition, than JSON. fn:csv-to-xdm and fn-csv-to-xml both assume that the most common, and easiest to handle, variations of the format are what you're dealing with and make that (I hope) very easy.

For anyone dealing with wilder variations (nested hierarchical data and the multiple-data-sets-separated-by-blank-columns examples I've dealt with before spring to mind) the output from those functions is not great, and will never be great without making the functions so complex that they're difficult for everyone to use, and a nightmare to maintain. That's what fn:parse-csv is for.

There is a bare minimum of parsing that is useful in all cases - handling the separators and quoting - and is a pain to implement from scratch yourself. That bare minimum is what fn:parse-csv produces, and csv-to-xdm and csv-to-xml effectively piggyback on that. Building custom code to handle wilder data from that simple data structure is much more straightforward than working with the output from csv-to-xdm, using XSLT post-processing on the output from csv-to-xml will only work for XSLT engines and a subset of the weirder data variations.

Whether that should be called fn:parse-csv is a different question. If we're promoting convention following in the spec (which I can get behind), then perhaps what is fn:csv-to-xdm should be called fn:parse-csv.

I do think that the distinctions between the three CSV parsing functions are significant enough to keep them separate, though.

fn:csv-fetch-field-by-column is a different case. It's proposed so that someone dealing with complex CSV data can create structures that look and work like the default csv-to-xdm rows output without reinventing the wheel. That feels useful to me, but fairly niche and if someone were to argue that the extra clutter caused by having yet another csv- prefixed function outweighed its utility, I'd be sympathetic.

(This is a symptom of a larger issue with standard-programming-language-namespaces in XPath, because the significant overhead of custom XML namespaces as a surrogate for java/python-like package namespaces means we have cram a lot of undifferentiated names into a global namespace)

And, of course, we're leaving the issue of needing to add CSV-string generation functions so that users can conveniently and consistently write as well as read CSV data.

Perhaps we could entertain the idea of using (in this case) the name prefix csv- for all csv-related functions? csv-to-xml, csv-to-xdm, and csv-fetch-field-by-column are fine, and perhaps parse-csv could be renamed csv-basic-parse or something like that to both move it out of the list of parse- functions, given that it's not intended as the primary function to use with CSV data, and move it within the csv- list of functions.

@ChristianGruen
Copy link
Contributor

@fidothe Thanks for your feedback and your assessment.

It seems we both agree that it was unfortunate to have both fn:parse-json and fn:json-to-xml. Personally, I would have preferred a single fn:parse-json function that creates different outputs, based on the given options (but it’s certainly too late to question this in the spec). That’s basically what we did in our implementation, and my impression is that this has turned out to work perfectly well for most users.

I can see what was your reasoning for proposing more than just one CSV function. I noticed that all options of fn:parse-csv are also supported by the two other CSV functions. This feels a bit redundant, though. What about including an additional method option (possible values: parse, xdm, xml) to control the generated output, similar to fn:serialize? This would allow us to write:

parse-csv($input, map { 'method': 'xdm' })
csv-doc($url, map { 'column-names': true(), 'method': 'xml' })

An alternative could be to use the result of parse-csv as input for the two more advanced functions.

We should certainly offer more than what 90% of users need, but I think we should make the basic functionality as simple as possible. Being able to process hierarchical CSV data is an interesting challenge, but after all, that’s something our users and customers never confronted us with.

Talking about csv-fetch-field-by-column, I would suggest making it self-contained by embedding it into the parsed-csv-structure-record record and giving it a much shorter name. You could then do things like:

let $csv := parse-csv($input, map { 'method': 'xdm' })
for $row in $csv?rows
for $field in $csv?field
return $csv?get($row, $field)

And, of course, we're leaving the issue of needing to add CSV-string generation functions so that users can conveniently and consistently write as well as read CSV data.

The most consistent approach would be to introduce CSV as serialization method.

@ChristianGruen
Copy link
Contributor

Related (serialize functions): #760

@fidothe
Copy link
Contributor Author

fidothe commented Oct 24, 2023

Following up from the QTCG meeting of 2023-10-17, a summary of (and response to) the bits of the discussion that focussed more specifically on CSV parsing, instead of parsing other data formats in general. That more general topic has been covered in #748.

Parsing CSV in particular

There was still confusion around why parse-csv and csv-to-xdm are separate functions, and I think it stems from the question of what a parse-* function is expected to do, now there are several of them that deal with more complex data formats.

parse-csv should return an immediately useful XDM result. Basically, csv-to-xdm should be called parse-csv and parse-csv should be called something else, perhaps fn:csv-rows-to-sequence.

There was also confusion over why fn:csv-fetch-field-by-column was there. The intent was that it should be possible to build a data structure like the one fn:csv-to-xdm returns from a complex CSV file that does not fit the simple 2D table structure csv-to-* expect. This would be achieved by using the simple output from fn:parse-csv, which does nothing more than handle field and row delimiting, plus quote handling. fn:csv-fetch-field-by-column provides a way to perform lookup of values in a sequence by index or name, by referencing a supplied hash.

This is basic functionality provided by almost all languages in their CSV libraries. In the output from fn:csv-to-xdm the field key of a row record does exactly that, and is effectively fn:csv-fetch-field-by-column bound to the row record through partial function application.

This does feel like a worthwhile toolkit to provide, especially given the csv-row-record type, which expects such a function to be part of it. If that record type gains traction, being able to relatively easily produce things that conform to it has real value.

Finally, there was discussion about handling other delimiters that are used by some CSV files. The only example I have noted down was the use of # as a signal that a line in a CSV was a comment. This is covered in the Open Knowledge Foundation (OKFN)'s CSV dialect definition. That seems worth supporting. I have not yet had a chance to dig more deeply into BaseX's CSV support for things like this, which I will do before responding more fully to this specific point.

@ChristianGruen
Copy link
Contributor

Thanks again.

parse-csv should return an immediately useful XDM result. Basically, csv-to-xdm should be called parse-csv and parse-csv should be called something else, perhaps fn:csv-rows-to-sequence to create this alternative output?

This would certainly be an option. Regarding fn:csv-rows-to-sequence, have you thought about adding a format/method option to parse-csv as an alternative?

There was also confusion over why fn:csv-fetch-field-by-column was there.

I remember there was a suggestion to embed such a function in the result of the parse-csv function (similar to what we know from fn:random-number-generator). This way, there may be no need to supply the columns as argument. Ideally, you could do something like $parsed-csv?get($row, $column). What was the reason that this was dropped (sorry if this has already been communicated)?

I have not yet had a chance to dig more deeply into BaseX's CSV support for things like this, which I will do before responding more fully to this specific point.

BaseX can’t provide a solution for that. Instead, our experience is that…

  1. 80% of the CSV files are regular, usually exported from Excel (thus, with varying field separators, depending on the regional settings of the Excel installation),
  2. 15% vary in the way how backslashes and quotes need to be interpreted, and
  3. the remaining 5% of cases that are irregular and cannot be processed automatically, but instead need to be partitioned manually/heuristically after having checked the contents. Such a data.csv file can look as follows:
Here are the cities:

id,city
1,Munich
2,Paris

Here are the countries:

1,Germany

…so I was wondering how much need there is to support input that is not two-dimensional. Personally, I haven’t come across OKFN files yet.

These are simple errors that happened when the functions were changed
around.
There were some things like bad syntax and using ${var} instead of
{$var} in string value templates that I didn't spot until running them
against an implementation.
Particularly:
* bring csv-to-xdm and csv-to-xml options into line with each other.
* ensure things are called -delimiter instead of sometimes -separator.
* Fix examples which had `column-names` as a map(xs:string, xs:integer)
  rather than map(xs:integer, xs:string)...
* Add missing error codes

Added several examples to csv-to-xdm.  (Most of these really need
pulling into qt4tests.)
Some old key names were still there, and there were some formatting
issues.

Moved a number of examples to the qt4test test suite
Change type of column-names map option to map(xs:string, xs:integer)
from map(xs:integer, xs:string). It turns out I had used this format for
all the examples, and the more I thought about it, the more it seemed
unhelpful to have a map like `map { 1: "a", 2: "b" }` produce a
`csv-columns-record` whose `names` entry was `map { "a": 1, "b": 2 }`.

Not least, that prevented the `names` entry from a `csv-columns-record`
being used as the `column-names` option in another invocation, which
might be desirable if a user has one CSV with headers and subsequent
CSVs of the same schema but without the header line (transaction logs
split for size, perhaps).
Rename parse-csv to csv-to-simple-rows
Rename csv-to-xdm to parse-csv

Remove csv-fetch-field-by-column
@fidothe
Copy link
Contributor Author

fidothe commented Nov 9, 2023

I have made the changes discussed in the meeting on 2023-11-07: fn:csv-fetch-field-by-function is deleted, and csv-to-xdm and parse-csv have been renamed, to parse-csv and csv-to-simple-rows respectively.

@fidothe fidothe marked this pull request as ready for review November 9, 2023 09:35
@ndw ndw added the Propose Merge without Discussion Change is editorial or minor label Nov 13, 2023
@ndw
Copy link
Contributor

ndw commented Nov 14, 2023

The group agreed to merge this PR at meeting 054

@ndw ndw merged commit 370dae6 into qt4cg:master Nov 14, 2023
2 checks passed
@ChristianGruen ChristianGruen removed the Propose Merge without Discussion Change is editorial or minor label Jan 2, 2024
@michaelhkay michaelhkay added the Overtaken PR was accepted but has no effect on current spec label Mar 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Overtaken PR was accepted but has no effect on current spec
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants