Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defining delimiter inside CSV files to import #111

Open
nick-rv opened this issue Nov 7, 2024 · 4 comments
Open

Defining delimiter inside CSV files to import #111

nick-rv opened this issue Nov 7, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@nick-rv
Copy link

nick-rv commented Nov 7, 2024

Following the #110 issue i could import my dataset successfully, using the default comma delimiter.

But several thousands of records could not be indexed because they contain commas inside some of their fields.
In order to have a correct result, it seems that to be able to choose an arbitrary character as delimiter is the solution.

Attempted Solutions

I tried to change my initial csv file delimiter from ";" to ",", and the import job could reach its end.

Proposal

One idea would be to allow the definition of a chosen delimiter inside the pelias.json conf file:

{
...
"imports": {
"adminLookup": {
"enabled": false
},
"csv": {
"delimiter": "§",
"datapath": "/data",
"files": ["adresses-france.csv"]
}
}
}

This character would be used as a value for the delimiter attribute of the csv-parser instance: https://csv.js.org/parse/options/delimiter/
To apply this configuration for all the csv files to import seems ok by my point of view.

References

#110

Thanks!

@missinglink
Copy link
Member

missinglink commented Nov 8, 2024

If we add this option it should be per file rather than global IMO.

The solution suggested in this issue description would mean that the provided delimiter would be used for all files listed in the array, I suspect this would become an issue when there are a mix of comma-delimited and other-delimited files in the list.

As the files field is of type Array<string> we could consider prefixing a parsing hint string such as:

"files": ["tsv://adresses-france.csv"]

Also worth considering are the other parser options here, I could see someone asking to be able to modify some of the other rules at a later date, this string prefix method doesn't scale well in that regard.

Additionally, I don't recall if we support compressed files such as .csv.gz, if so we'd need to consider the impact of these hints on that, as well as removing the prefix in the right places before attempting to download or decompress the file.

An alternative would be to change the type of the files field to Array<string|object> which is a little messier but more extendible.

Finally one option is to simply say that this library only supports commas, document that and expect users to format shift their data to meet those requirements.

@missinglink
Copy link
Member

missinglink commented Nov 8, 2024

Also worth mentioning the csv-parse library we use has an open issue to automatically discover delimiters.

We could simply wait for that to land and avoid introducing any changes to pelias/config which would later become obsolete.

adaltas/node-csv#400

@missinglink
Copy link
Member

My preference would be to wait for the linked PR to land and then enabling the auto-discover option.

@nick-rv
Copy link
Author

nick-rv commented Dec 5, 2024

This looks like a good pragmatic approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants