feat: Added BOM capability for output files (#1267) #1274

alvaro-osvaldo-tm · 2025-02-09T20:57:14Z

Added the '--add-bom' parameter for almost utilities

- Added the '--add-bom' parameter for almost utilities Signed-off-by: Álvaro Osvaldo <[email protected]>

alvaro-osvaldo-tm · 2025-02-09T21:02:22Z

Implementation

Implemented the feature to optionality add UTF-8 Byte Order Mark (BOM) into output content in all utilities,
except csvpy and sql2csv

Solution

The UTF-8 BOM only will be added if the parameter '--add-bom' is specified, otherwise is ignored.
The parameter configuration and execution was implemented in the file csvkit/features/AddBOM.py ,
I used a 'feature' pattern to avoid 'spaghetti code', no problem if the code need to be put into CSVKitUtility class.
- The advantage of this approach is the code is more clear.
- But the CSVToolKit is not prepared for it as seen in 'argument' method. Also, a few more CPU cycles will be perceived, if the user process a HUGE amount of files.

Tests

A attached a end-to-end test script
- test-1267.txt
No unit test was made because the tests use 'StringIO' as 'input file', but the BOM need to be added as bytes using 'TextIOWrapper'.
- If you want, I can implement a conversion in 'CSVToolkit' and 'LazyFile' to enable the tests.
All PyTests and end-to-end tests passed in the following versions:
- Python 3.8.20
- Python 3.9.21
- Python 3.10.16
- Python 3.11.11
- Python 3.12.8
  - Except csvgrep and csvcut due CSVToolkitbug.

Checklist

Unit Testing
End-to-end Testing

Considerations

I found some bugs related to CSVToolkit that affect this implementation, I reported each in csvgrep: Catastrophic failure in Python 3.12 #1275 , csvcut: Catastrophic failure in Python 3.12 #1276, and Ignoring PythonIOEnconing #1277
The CSVToolkit architecture always output in UTF-8, even if PYTHONIOENCONDING environment variable is set,
therefore, only UTF-8 BOM is supported.

References

Feature Pattern.https://martinfowler.com/articles/feature-toggles.html

Signed-off-by: Álvaro Osvaldo <[email protected]>

jpmckinney · 2025-02-13T21:32:20Z

csvkit/cli.py

@@ -245,6 +248,8 @@ def _init_common_parser(self):
                help='Insert a column of line numbers at the front of the output. Useful when piping to grep or as a '
                     'simple primary key.')

+        AddBOM.argument(self.argparser,self)


Please conform to the existing code, instead of introducing an entirely different code organization pattern.

jpmckinney · 2025-02-13T21:35:49Z

csvkit/cli.py

@@ -134,6 +135,8 @@ def run(self):
        if 'f' not in self.override_flags:
            self.input_file = self._open_input_file(self.args.input_path)

+        AddBOM.run(self.output_file, self.args)


Just inline the 2 lines of code here, instead of creating a new 54-line file:

if getattr(self.args, 'add_bom', False): self.output_file.buffer.write(BOM_UTF8)

And of course do from codecs import BOM_UTF8 with the other imports.

feat: Added BOM capability for output files (1267)

e9c8aa1

- Added the '--add-bom' parameter for almost utilities Signed-off-by: Álvaro Osvaldo <[email protected]>

alvaro-osvaldo-tm changed the title ~~feat: Added BOM capability for output files (1267)~~ feat: Added BOM capability for output files (#1267) Feb 9, 2025

alvaro-osvaldo-tm mentioned this pull request Feb 9, 2025

Can in2csv add a byte order mark (BOM) so that when opening csv in Excel it correctly formats unicode text? #1267

Open

chore: Fixed type in method

aee71d6

Signed-off-by: Álvaro Osvaldo <[email protected]>

jpmckinney reviewed Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added BOM capability for output files (#1267) #1274

feat: Added BOM capability for output files (#1267) #1274

alvaro-osvaldo-tm commented Feb 9, 2025

alvaro-osvaldo-tm commented Feb 9, 2025 •

edited

Loading

jpmckinney Feb 13, 2025

jpmckinney Feb 13, 2025

feat: Added BOM capability for output files (#1267) #1274

Are you sure you want to change the base?

feat: Added BOM capability for output files (#1267) #1274

Conversation

alvaro-osvaldo-tm commented Feb 9, 2025

alvaro-osvaldo-tm commented Feb 9, 2025 • edited Loading

Implementation

Solution

Tests

Checklist

Considerations

References

jpmckinney Feb 13, 2025

Choose a reason for hiding this comment

jpmckinney Feb 13, 2025

Choose a reason for hiding this comment

alvaro-osvaldo-tm commented Feb 9, 2025 •

edited

Loading