You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
I would like to use qsv with a 14-16 GB dataset on a 16 GB machine. The OS occupies 1-2 GB. The file is stored on a slow HDD, so I want all programs to use mmap to avoid reads into anonymous memory evicting cached pages.
Compression merely lowers the minimum optimal page cache size and the necessary byte count to read from disk. Some swapping overhead remains even with miraculous 90% compression.
Parquet with Snappy would still have some memory problems but is compatible with Pandas. However, this adds the maintenance burden of another parser, whereas I see Arrow as just skipping the existing parser
Additional context
Some stats are probably implemented by mutating the in-memory data, again occupying double the memory. Copy-on-write still provides benefits from lazy and faster loading.
# use Parquet, JSONL and Arrow files in SQL queries
qsv sqlp data.csv "select * from data join read_parquet('data2.parquet') as t2 ON data.c1 = t2.c1"
qsv sqlp data.csv "select * from data join read_ndjson('data2.jsonl') as t2 on data.c1 = t2.c1"
qsv sqlp data.csv "select * from data join read_ipc('data2.arrow') as t2 ON data.c1 = t2.c1"
qsv sqlp SKIP_INPUT "select * from read_parquet('data.parquet') order by col1 desc limit 100"
qsv sqlp SKIP_INPUT "select * from read_ndjson('data.jsonl') as t1 join read_ipc('data.arrow') as t2 on t1.c1 = t2.c1"
I'll investigate having qsv support Arrow format natively when the polars feature is enabled.
FYI, I'm actually looking at Polars to get cloud storage support as well (that's why cloud Polars feature is in Cargo.toml, but commented out), and I'll club Arrow (along with the other formats Polars supports) while doing so (see #2263).
Is your feature request related to a problem? Please describe.
I would like to use
qsv
with a 14-16 GB dataset on a 16 GB machine. The OS occupies 1-2 GB. The file is stored on a slow HDD, so I want all programs to usemmap
to avoid reads into anonymous memory evicting cached pages.Describe the solution you'd like
Detect and accept the
.arrow
extension where currently only.csv
is expected. Load using https://docs.rs/polars-arrow/0.45.1/polars_arrow/mmap/fn.mmap_unchecked.html .Describe alternatives you've considered
mmap
, such as via https://docs.rs/polars-arrow/0.45.1/polars_arrow/io/ipc/read/index.html allows keeping the dataset in.arrow
. At least there'd be no overhead of conversion to.csv
, and other programs won't empty the page cache.Additional context
Some stats are probably implemented by mutating the in-memory data, again occupying double the memory. Copy-on-write still provides benefits from lazy and faster loading.
Polars is already present since #828. https://github.com/pola-rs/polars-cli exemplifies a CLI reading both
.csv
and.arrow
.The text was updated successfully, but these errors were encountered: