-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating with MegaLinter #1239
Comments
Thank you for your interest! There is an open issue to change csvclean's behavior somewhat, so you might have different ideas on how to integrate with that proposal (or you might have new proposals): #195 (comment) I think this issue would simplify your integration - let me know what you think and/or if any other changes would help. |
@jpmckinney Very nice! Yes, those would be terrific. I'm reticent to add more work to your plate and the idea of changing an interface that others may be reliant upon makes me nervous. That's why I was thinking of a wrapper as opposed to changing the interface. But yeah, absolutely, those changes would be huge. |
Well, #195 changes the interface, so it will require a major version – if there are any other interface changes we might as well do them at the same time, as I won't be making another major version in a long time, I suspect. |
I think https://avatars.githubusercontent.com/u/17111824 is all we have. Thank you :) |
Okay, 2.0.0 is ready to be released. You can read the changes at https://csvkit.readthedocs.io/en/latest/changelog.html#unreleased If no other suggestions, I'll date the changelog and release. |
Absolutely perfect. Thank you! 😄 |
from the (un)release notes:
So I understand -- and this is me being dense and is not a reflection on you -- the following will be true:
Again, so I understand correctly... A. if modifying options other than filename="/path/to/the/file.csv"
csvclean "$filename" --join-short-rows > "$filename" I'm thinking that it would be safer to do something like: filename="/path/to/the/file.csv"
tmp_fileame="$(mktemp)"
csvclean "$filename" --join-short-rows > "$tmp_filename"
mv -f "$tmp_filename" "$filename" (so the new file writes don't clobber the original file if it hasn't been completely read) C. will running If so, does it still use the previous format? (sample follows)
(yes, I know you know what the output looked like.. I'm putting this more for my own reference and for anyone who comes along after wanting to see the history of the change) D. follow-up, can I still use this regex to capture errors?:
( |
I was going to ask question E but decided not to because it can be done outside of filename="/path/to/filename.csv"
csvclean --join-short-rows "$filename" | sed -Ene "s|^Line[[:space:]]*([[:digit:]]+)[^:]*:[[:space:]]*(.*)|Line \1 $filename : \2|p' ..just in case anyone else's IDE gets confused when parsing warnings.. |
... or when
Yes, all correct. This brings csvclean in line with other csvkit tools, which all write to standard output.
Those also only write to standard output. The only thing that writes to a file is when using in2csv with
Tools read line by line wherever possible. The tools (or scenarios) that do or don't are listed here: https://csvkit.readthedocs.io/en/latest/contributing.html#streaming-versus-buffering
Yes, that would be problematic.
Yes, I think you have to do it that way. There are a bunch of options at https://backreference.org/2011/01/29/in-place-editing-of-files/ https://serverfault.com/q/135507 and https://unix.stackexchange.com/a/204378 but I think the simple
It will write out line length errors, like before.
No, it uses the format that was used for the
No, you'll have to adjust to something like |
I can't get your command to run ( csvstack, for example, has |
yeah, sorry about that.
Rock on. That would be nice. Thank you.
Cool. I got the
Yeah, I suspected that as well (hence the
Yes, please. That would be most excellent.
Cool. Got it. Thank you.
Yuppers. |
I made some other changes. Please see the updated draft release notes. |
Thank you so, so much. Your dedication to csvkit and open source is admirable and exemplary. |
@jpmckinney I just read through the 2.0.0 release notes and I'm very excited. If someone would have told me when I was a little kid that I would get this excited about a release of a tool that cleans up files on a computer that represent data as a series of newline-terminated rows with individual values separated into columns by commas... |
Yay :D 2.0.0 is released! 🎉 |
I think this issue is complete? |
Hello!
First of, a great many thanks for your work with csvkit! I appreciate you and your contributions to making this world a better place! (seriously!!)
I'm working on adding
csvclean
into MegaLinter as a new linter. MegaLinter already includes a bunch of linters for various programming languages, tools, and file formats, including JSON, TOML, XML, and YAML. The effort is being tracked in oxsecurity/megalinter#3493Typically, when MegaLinter runs, it'll run linters to find issues and report the findings. For linters that support it, it can also run with
APPLY_FIXES
where a tool (e.g., shfmt for reformatting Bash scripts) will update the source file and push the results back to the repository either as a commit on a branch or in a new PR.I'm planning on incorporating that functionality with the use of
--dry-run
when not fixing things (e.g., whenAPPLY_FIXES
is false, usecsvclean --dry-run
; when true, usecsvclean
without the extra flag).When
csvclean
is run with--dry-run
on a file with known issues, it looks like this:(all sent to STDOUT)
When run without
--dry-run
, it looks like this:with the errors sent to
acronyms_err.csv
and the updated file written toacronyms_out.csv
. This is all expected given the documentation.When MegaLinter runs, it can look for errors with a regex (e.g.,
(?i)^line[[:space:]]*([[:digit:]]*):
). I could use a regex that would match either (e.g.,(?i)^(line[[:space:]]*([[:digit:]]*):|([[:digit:]]*)[[:space:]]*errors?[[:space:]]logged)
on STDOUT (i.e., with or without--dry-run
would still match if errors were found). In all cases, regardless of whether there were errors or not,csvclean
return a0
response.I can wrap
csvclean
with some extra shell magic:(return non-zero if
csvclean
throws an error; if it's successful, return non-zero when the_err.csv
file exists (so, if no_err.csv
file exists, return 0))But really where I'm focusing is having a unified output (i.e., the same regardless of the presence of
--dry-run
) that I could easily parse on the other side. Something like:(I just put that together in this issue -- I don't know if it's syntactically correct or if it'll even run, but hopefully that's sufficient to get the point across)
So... my questions...
The text was updated successfully, but these errors were encountered: