Support almost larger-than-memory datasets in Apache Arrow format #2410

danielzgtg · 2025-01-05T07:00:05Z

Is your feature request related to a problem? Please describe.
I would like to use qsv with a 14-16 GB dataset on a 16 GB machine. The OS occupies 1-2 GB. The file is stored on a slow HDD, so I want all programs to use mmap to avoid reads into anonymous memory evicting cached pages.

Describe the solution you'd like
Detect and accept the .arrow extension where currently only .csv is expected. Load using https://docs.rs/polars-arrow/0.45.1/polars_arrow/mmap/fn.mmap_unchecked.html .

Describe alternatives you've considered

Arrow support without mmap, such as via https://docs.rs/polars-arrow/0.45.1/polars_arrow/io/ipc/read/index.html allows keeping the dataset in .arrow. At least there'd be no overhead of conversion to .csv, and other programs won't empty the page cache.
Compression merely lowers the minimum optimal page cache size and the necessary byte count to read from disk. Some swapping overhead remains even with miraculous 90% compression.
- Snappy (with CSV). interoperating with my other Python scripts would require support to be added to https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
- Parquet with Snappy would still have some memory problems but is compatible with Pandas. However, this adds the maintenance burden of another parser, whereas I see Arrow as just skipping the existing parser

Additional context
Some stats are probably implemented by mutating the in-memory data, again occupying double the memory. Copy-on-write still provides benefits from lazy and faster loading.

Polars is already present since #828. https://github.com/pola-rs/polars-cli exemplifies a CLI reading both .csv and .arrow.

The text was updated successfully, but these errors were encountered:

jqnatividad · 2025-01-05T14:12:04Z

Thanks for the detailed request @danielzgtg .

BTW, sqlp was actually patterned after polars-cli and also supports reading Arrow files.

qsv/src/cmd/sqlp.rs

Lines 113 to 118 in 0c8a827

    
             # use Parquet, JSONL and Arrow files in SQL queries 
        
              qsv sqlp data.csv "select * from data join read_parquet('data2.parquet') as t2 ON data.c1 = t2.c1" 
        
              qsv sqlp data.csv "select * from data join read_ndjson('data2.jsonl') as t2 on data.c1 = t2.c1" 
        
              qsv sqlp data.csv "select * from data join read_ipc('data2.arrow') as t2 ON data.c1 = t2.c1" 
        
              qsv sqlp SKIP_INPUT "select * from read_parquet('data.parquet') order by col1 desc limit 100" 
        
              qsv sqlp SKIP_INPUT "select * from read_ndjson('data.jsonl') as t1 join read_ipc('data.arrow') as t2 on t1.c1 = t2.c1"

I'll investigate having qsv support Arrow format natively when the polars feature is enabled.

FYI, I'm actually looking at Polars to get cloud storage support as well (that's why cloud Polars feature is in Cargo.toml, but commented out), and I'll club Arrow (along with the other formats Polars supports) while doing so (see #2263).

jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Jan 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support almost larger-than-memory datasets in Apache Arrow format #2410

Support almost larger-than-memory datasets in Apache Arrow format #2410

danielzgtg commented Jan 5, 2025

jqnatividad commented Jan 5, 2025 •

edited

Loading

Support almost larger-than-memory datasets in Apache Arrow format #2410

Support almost larger-than-memory datasets in Apache Arrow format #2410

Comments

danielzgtg commented Jan 5, 2025

jqnatividad commented Jan 5, 2025 • edited Loading

jqnatividad commented Jan 5, 2025 •

edited

Loading