You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current output favors preserving as much information as possible from the original json, but there is some duplication, and a bunch of columns can be removed as they're rarely super useful.
The new --optimized mode will generate CSVs that drop a bunch of columns to save space:
These are the most commonly not present or duplicate ones, where the missing data can be inferred from the columns left over, or with the cashtags, hashtags, mentions, with twitter-text for example.
Should probably fix #36 and #47 first before this.
The text was updated successfully, but these errors were encountered:
I'm interested in hearing where the need for this optimization arose. Was it a problem generating the CSV, or reading the generated CSV in another application? It sounds like the latter?
Just trying to deduplicate columns and remove mostly empty ones, so more can fit into memory, and other tools like great expectations or pandas profiling have an easier time.
The current output favors preserving as much information as possible from the original json, but there is some duplication, and a bunch of columns can be removed as they're rarely super useful.
The new
--optimized
mode will generate CSVs that drop a bunch of columns to save space:(exact list to be revised later)
These are the most commonly not present or duplicate ones, where the missing data can be inferred from the columns left over, or with the cashtags, hashtags, mentions, with twitter-text for example.
Should probably fix #36 and #47 first before this.
The text was updated successfully, but these errors were encountered: