Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added BOM capability for output files (#1267) #1274

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

alvaro-osvaldo-tm
Copy link

  • Added the '--add-bom' parameter for almost utilities

- Added the '--add-bom' parameter for almost utilities

Signed-off-by: Álvaro Osvaldo <[email protected]>
@alvaro-osvaldo-tm alvaro-osvaldo-tm changed the title feat: Added BOM capability for output files (1267) feat: Added BOM capability for output files (#1267) Feb 9, 2025
@alvaro-osvaldo-tm
Copy link
Author

alvaro-osvaldo-tm commented Feb 9, 2025

Implementation

Implemented the feature to optionality add UTF-8 Byte Order Mark (BOM) into output content in all utilities,
except csvpy and sql2csv

Solution

  • The UTF-8 BOM only will be added if the parameter '--add-bom' is specified, otherwise is ignored.
  • The parameter configuration and execution was implemented in the file csvkit/features/AddBOM.py ,
    I used a 'feature' pattern to avoid 'spaghetti code', no problem if the code need to be put into CSVKitUtility class.
    • The advantage of this approach is the code is more clear.
    • But the CSVToolKit is not prepared for it as seen in 'argument' method. Also, a few more CPU cycles will be perceived, if the user process a HUGE amount of files.

Tests

  • A attached a end-to-end test script

  • No unit test was made because the tests use 'StringIO' as 'input file', but the BOM need to be added as bytes using 'TextIOWrapper'.

    • If you want, I can implement a conversion in 'CSVToolkit' and 'LazyFile' to enable the tests.
  • All PyTests and end-to-end tests passed in the following versions:

    • Python 3.8.20
    • Python 3.9.21
    • Python 3.10.16
    • Python 3.11.11
    • Python 3.12.8
      • Except csvgrep and csvcut due CSVToolkitbug.

Checklist

  • Unit Testing
  • End-to-end Testing

Considerations

References

Signed-off-by: Álvaro Osvaldo <[email protected]>
@@ -245,6 +248,8 @@ def _init_common_parser(self):
help='Insert a column of line numbers at the front of the output. Useful when piping to grep or as a '
'simple primary key.')

AddBOM.argument(self.argparser,self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please conform to the existing code, instead of introducing an entirely different code organization pattern.

@@ -134,6 +135,8 @@ def run(self):
if 'f' not in self.override_flags:
self.input_file = self._open_input_file(self.args.input_path)

AddBOM.run(self.output_file, self.args)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just inline the 2 lines of code here, instead of creating a new 54-line file:

if getattr(self.args, 'add_bom', False):
    self.output_file.buffer.write(BOM_UTF8)

And of course do from codecs import BOM_UTF8 with the other imports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants