Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support almost larger-than-memory datasets in Apache Arrow format #2410

Open
danielzgtg opened this issue Jan 5, 2025 · 1 comment
Open
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.

Comments

@danielzgtg
Copy link

Is your feature request related to a problem? Please describe.
I would like to use qsv with a 14-16 GB dataset on a 16 GB machine. The OS occupies 1-2 GB. The file is stored on a slow HDD, so I want all programs to use mmap to avoid reads into anonymous memory evicting cached pages.

Describe the solution you'd like
Detect and accept the .arrow extension where currently only .csv is expected. Load using https://docs.rs/polars-arrow/0.45.1/polars_arrow/mmap/fn.mmap_unchecked.html .

Describe alternatives you've considered

  • Arrow support without mmap, such as via https://docs.rs/polars-arrow/0.45.1/polars_arrow/io/ipc/read/index.html allows keeping the dataset in .arrow. At least there'd be no overhead of conversion to .csv, and other programs won't empty the page cache.
  • Compression merely lowers the minimum optimal page cache size and the necessary byte count to read from disk. Some swapping overhead remains even with miraculous 90% compression.
    • Snappy (with CSV). interoperating with my other Python scripts would require support to be added to https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
    • Parquet with Snappy would still have some memory problems but is compatible with Pandas. However, this adds the maintenance burden of another parser, whereas I see Arrow as just skipping the existing parser

Additional context
Some stats are probably implemented by mutating the in-memory data, again occupying double the memory. Copy-on-write still provides benefits from lazy and faster loading.

Polars is already present since #828. https://github.com/pola-rs/polars-cli exemplifies a CLI reading both .csv and .arrow.

@jqnatividad
Copy link
Collaborator

jqnatividad commented Jan 5, 2025

Thanks for the detailed request @danielzgtg .

BTW, sqlp was actually patterned after polars-cli and also supports reading Arrow files.

qsv/src/cmd/sqlp.rs

Lines 113 to 118 in 0c8a827

# use Parquet, JSONL and Arrow files in SQL queries
qsv sqlp data.csv "select * from data join read_parquet('data2.parquet') as t2 ON data.c1 = t2.c1"
qsv sqlp data.csv "select * from data join read_ndjson('data2.jsonl') as t2 on data.c1 = t2.c1"
qsv sqlp data.csv "select * from data join read_ipc('data2.arrow') as t2 ON data.c1 = t2.c1"
qsv sqlp SKIP_INPUT "select * from read_parquet('data.parquet') order by col1 desc limit 100"
qsv sqlp SKIP_INPUT "select * from read_ndjson('data.jsonl') as t1 join read_ipc('data.arrow') as t2 on t1.c1 = t2.c1"

I'll investigate having qsv support Arrow format natively when the polars feature is enabled.

FYI, I'm actually looking at Polars to get cloud storage support as well (that's why cloud Polars feature is in Cargo.toml, but commented out), and I'll club Arrow (along with the other formats Polars supports) while doing so (see #2263).

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog.
Projects
None yet
Development

No branches or pull requests

2 participants