-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
413: Spec for CSV-related functions #719
Conversation
@ChristianGruen just flagging this for your attention - there's a trivial-but-significant change in the |
Thank you for pointing this out. |
cf02e28
to
177e461
Compare
@fidothe I had another look at the current range of parsing functions in the spec (#748), and I’m feeling a bit uneasy that each function has its own semantics. In our implementation, we tried to offer a consistent approach for different input formats, and I believe it would be great if we could pursue something similar in the standard function set. Maybe we have a chance to get If |
@ChristianGruen, I completely agree that we should standardise the parsing functions as much as possible - I'm very happy to change the input parameters to have the name There are issues with what the functions do, and why they're named the way they are - I think the CSV is a much wilder format, and with much poorer definition, than JSON. For anyone dealing with wilder variations (nested hierarchical data and the multiple-data-sets-separated-by-blank-columns examples I've dealt with before spring to mind) the output from those functions is not great, and will never be great without making the functions so complex that they're difficult for everyone to use, and a nightmare to maintain. That's what There is a bare minimum of parsing that is useful in all cases - handling the separators and quoting - and is a pain to implement from scratch yourself. That bare minimum is what Whether that should be called I do think that the distinctions between the three CSV parsing functions are significant enough to keep them separate, though.
(This is a symptom of a larger issue with standard-programming-language-namespaces in XPath, because the significant overhead of custom XML namespaces as a surrogate for java/python-like package namespaces means we have cram a lot of undifferentiated names into a global namespace) And, of course, we're leaving the issue of needing to add CSV-string generation functions so that users can conveniently and consistently write as well as read CSV data. Perhaps we could entertain the idea of using (in this case) the name prefix |
@fidothe Thanks for your feedback and your assessment. It seems we both agree that it was unfortunate to have both I can see what was your reasoning for proposing more than just one CSV function. I noticed that all options of parse-csv($input, map { 'method': 'xdm' })
csv-doc($url, map { 'column-names': true(), 'method': 'xml' }) An alternative could be to use the result of We should certainly offer more than what 90% of users need, but I think we should make the basic functionality as simple as possible. Being able to process hierarchical CSV data is an interesting challenge, but after all, that’s something our users and customers never confronted us with. Talking about let $csv := parse-csv($input, map { 'method': 'xdm' })
for $row in $csv?rows
for $field in $csv?field
return $csv?get($row, $field)
The most consistent approach would be to introduce CSV as serialization method. |
Related (serialize functions): #760 |
Following up from the QTCG meeting of 2023-10-17, a summary of (and response to) the bits of the discussion that focussed more specifically on CSV parsing, instead of parsing other data formats in general. That more general topic has been covered in #748. Parsing CSV in particularThere was still confusion around why
There was also confusion over why This is basic functionality provided by almost all languages in their CSV libraries. In the output from This does feel like a worthwhile toolkit to provide, especially given the Finally, there was discussion about handling other delimiters that are used by some CSV files. The only example I have noted down was the use of |
Thanks again.
This would certainly be an option. Regarding
I remember there was a suggestion to embed such a function in the result of the parse-csv function (similar to what we know from
BaseX can’t provide a solution for that. Instead, our experience is that…
…so I was wondering how much need there is to support input that is not two-dimensional. Personally, I haven’t come across OKFN files yet. |
These are simple errors that happened when the functions were changed around.
There were some things like bad syntax and using ${var} instead of {$var} in string value templates that I didn't spot until running them against an implementation.
Particularly: * bring csv-to-xdm and csv-to-xml options into line with each other. * ensure things are called -delimiter instead of sometimes -separator. * Fix examples which had `column-names` as a map(xs:string, xs:integer) rather than map(xs:integer, xs:string)... * Add missing error codes Added several examples to csv-to-xdm. (Most of these really need pulling into qt4tests.)
Some old key names were still there, and there were some formatting issues. Moved a number of examples to the qt4test test suite
Change type of column-names map option to map(xs:string, xs:integer) from map(xs:integer, xs:string). It turns out I had used this format for all the examples, and the more I thought about it, the more it seemed unhelpful to have a map like `map { 1: "a", 2: "b" }` produce a `csv-columns-record` whose `names` entry was `map { "a": 1, "b": 2 }`. Not least, that prevented the `names` entry from a `csv-columns-record` being used as the `column-names` option in another invocation, which might be desirable if a user has one CSV with headers and subsequent CSVs of the same schema but without the header line (transaction logs split for size, perhaps).
Rename parse-csv to csv-to-simple-rows Rename csv-to-xdm to parse-csv Remove csv-fetch-field-by-column
033e5ab
to
11a2c27
Compare
I have made the changes discussed in the meeting on 2023-11-07: |
The group agreed to merge this PR at meeting 054 |
This PR contains error fixes (typos, examples that contradicted the spec text), some (hopefully) improved language and one breaking change.
The current draft uses the type
map(xs:integer, xs:string)
for thecolumn-names
option tofn:csv-to-xdm
andfn:csv-to-xml
. This PR flips that tomap(xs:string, xs:integer)
. It turns out that the examples were already using this, and it seems to me that having thenames
entry in thecsv-columns-record
record type be the transposed version of thecolumn-names
option that creates it, rather than be the same thing, is counterproductive.I can think of some examples (a CSV split into several chunks, with only the first containing the headers) where being able to feed the
names
entry right back into another invocation offn:csv-to-xdm
would be useful. If nothing else it's confusing and not obvious, or I wouldn't have messed up the examples, and somebody would have noticed during the review process...