Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip and log malformed rows during COPY #1872

Closed
ray6080 opened this issue Jul 31, 2023 · 6 comments · Fixed by #4184
Closed

Skip and log malformed rows during COPY #1872

ray6080 opened this issue Jul 31, 2023 · 6 comments · Fixed by #4184
Assignees
Labels
data-import-export Issues related to data importing or exporting, such as copy to/from statements high-priority usability Issues related to better usability experience, including bad error messages

Comments

@ray6080
Copy link
Contributor

ray6080 commented Jul 31, 2023

Sometimes it's not easy to have a fully cleaned dataset, which could be annoying when it comes to data loading, as users need to repeatedly try through COPY and modify their dataset as we throw errors.
To provide a more smooth COPY experience, it would be good to add an option to COPY statement which allows skipping malformed rows either silently or through logging.
This should cover copying of both node and rel tables.

Notes:

@ray6080
Copy link
Contributor Author

ray6080 commented Jan 9, 2024

The "show warnings" feature in MySQL can be a reference. https://dev.mysql.com/doc/refman/8.0/en/show-warnings.html

@semihsalihoglu-uw semihsalihoglu-uw added usability Issues related to better usability experience, including bad error messages data-import-export Issues related to data importing or exporting, such as copy to/from statements labels Jan 9, 2024
@semihsalihoglu-uw
Copy link
Contributor

We can have an IGNORE_ERRORS=true parameter to COPY FROM and LOAD FROM statements.

@ray6080 ray6080 assigned royi-luo and unassigned andyfengHKU Aug 13, 2024
@ray6080
Copy link
Contributor Author

ray6080 commented Aug 13, 2024

To get this feature fully done, we need to support several things:

  • Allow ignoring errors inside the CSV reader, which supports LOAD FROM with option IGNORE_ERRORS=true.
  • Allow ignoring errors inside casting functions. This needs more work to push casting inside readers.
  • Allow ignoring errors like duplicated primary key, missing primary keys from node tables, etc., inside the batch insert pipelines (NodeBatchInsert, IndexLookup).
  • Report skipped rows at the end of execution in the query result, including file name, row idx and content of the errored row. (Need a separate fTable inside ResultCollector to collect warning information).

@royi-luo
Copy link
Collaborator

royi-luo commented Aug 20, 2024

#4067 (comment)

  • A TODO: we should add a local cache for warnings that we periodically flush to the shared warning cache. This will reduce the amount of lock contention, improving performance.

@royi-luo
Copy link
Collaborator

To get this feature fully done, we need to support several things:

* [x]   Allow ignoring errors inside the CSV reader, which supports `LOAD FROM` with option `IGNORE_ERRORS=true`.

* [ ]  Allow ignoring errors inside casting functions. This needs more work to push casting inside readers.

* [ ]  Allow ignoring errors like duplicated primary key, missing primary keys from node tables, etc., inside the batch insert pipelines (NodeBatchInsert, IndexLookup).

* [x]  Report skipped rows at the end of execution in the query result, including file name, row idx and content of the errored row. (Need a separate fTable inside ResultCollector to collect warning information).

Update: a function call show_warnings() was added that reports all cached warnings

@ray6080
Copy link
Contributor Author

ray6080 commented Dec 2, 2024

See left todo in #4579.

@ray6080 ray6080 closed this as completed Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-import-export Issues related to data importing or exporting, such as copy to/from statements high-priority usability Issues related to better usability experience, including bad error messages
Projects
None yet
5 participants