Releases: lancedb/lance
v0.3.12 Upgrade arrow-rs and bug fixes
- Upgraded arrow-rs dependency to 33.0 (Waiting on datafusion for 34.0 upgrade).
- Nested Dictionary fields are now parsed and written correctly.
- More progress towards OPQ implementation.
What's Changed
- Matrix mul and transpose by @eddyxu in #661
- Recursively set dictionaries in struct fields by @gsilvestrin in #662
- Upgrading arrow version to 33.0 by @gsilvestrin in #665
- [Rust] sampling over matrix. by @eddyxu in #666
- Sorting dataset versions by @gsilvestrin in #668
Full Changelog: v0.3.11...v0.3.12
v0.3.11 Bug fix release
Bug fix for reading variable length list arrays (welcome @gsilvestrin).
We're working on windows support (welcome to @dnsco) and OPQ implementation for vector index, so stay tuned!
What's Changed
- Windows support by @dnsco in #651
- Trigger CI when workflow changes by @changhiskhan in #653
- Compute SVD by @eddyxu in #658
- Fix offsets when reading arrays by @gsilvestrin in #657
New Contributors
- @dnsco made their first contribution in #651
- @gsilvestrin made their first contribution in #657
Full Changelog: v0.3.10...v0.3.11
v0.3.10 Easier debugging for vector index
You can now choose to bypass the ANN index even if it was available and perform vector search using brute-force. This helps with debugging ANN results. Note that SIMD is still applicable during brute-force search.
What's Changed
- [Bug] Fix passing metric type during PQ index building by @eddyxu in #644
- [python] Allow user to bypass ANN index and search using brute-force … by @changhiskhan in #645
- expand tilde paths in python by @ananis25 in #621
- Fix binary encoder handling array buffer slicing by @eddyxu in #649
Full Changelog: v0.3.9...v0.3.10
v0.3.9 limited python support for predicate pushdown
By default pyarrow compute Expressions doesn't serialize to sql strings. This patch release enables a limited set of filter pushdowns via python. Supported syntax:
- field references
- Operators: > < >= <= = == !=
- conjunctions / disjunctions
This enables querying via duckdb without needing to load the whole dataset into memory first.
e.g., duckdb.query("SELECT * FROM dataset WHERE id=5")
What's Changed
Full Changelog: v0.3.8...v0.3.9
v0.3.8 Improved random access for non-numeric columns and duckdb extension
You can now query lance datasets outside of python using duckdb! Thanks to @dacort for making the lance extension play nice with duckdb. dbt-duckdb-lance anyone? You can find the extension under integration/duckdb_lance
.
We're also very excited to release a very substantial performance optimization for random access for non-numeric columns.
Previously, if you wanted to fetch a string or blob column along with nearest neighbor search results, the non-optimized binary decoder take could add up to 5-20x latency overhead, depending on the sparsity of the indices. In this release we've optimized the take performance so this is basically a free operation.
While most of the work in Rust is completed for filter pushdown, we've had to delay the general release for this feature until we're able to overcome some rough edges making pyarrow compute Expressions play nice with datafusion and sqlparser-rs. It'll be worth the wait though we promise!
Cosine similarity is shipped but the recall performance is lower, due to some issues during index creation. We recommend that you stick with the default L2 distance metric until we address this in the coming few releases.
We'd love to hear from you!
What's Changed
- Update extension for v0.7.0 compatibility by @dacort in #599
- Remove -j from DuckDB build script by @changhiskhan in #601
- a minor preparatory refactor by @changhiskhan in #598
- fix gha duckdb trigger paths by @changhiskhan in #602
- Use MetricType to specify the metric / distance compute function by @eddyxu in #600
- [Python] Specify metric type in Dataset.create_index by @eddyxu in #603
- [Rust] Implement a datafusion phyiscal expr Column that can reads nested columns by @eddyxu in #610
- benchmark query performance on 768D vectors by @changhiskhan in #607
- Parse sql filter clause to create datafusion physical expression by @eddyxu in #609
- Schema exclude fields by @eddyxu in #613
- Exec filter during Scan by @eddyxu in #612
- workaround to prevent the segfault until we figure out the real problem by @changhiskhan in #616
- Improve random access on binary encoding by @eddyxu in #615
- [Python] Support filter pushdown from Python Dataset API by @eddyxu in #618
- refactor benchmark to use cosine similarity by @changhiskhan in #611
- Encoding shared slices of arrays. by @eddyxu in #620
- Fix plain encoding by @eddyxu in #622
- Fix crash with column projection with ann search by @eddyxu in #624
- Relax data type matching float numbers in filter pushdown by @eddyxu in #625
- python integration tests for vector index by @changhiskhan in #623
- Remove filter pushdown from python api for now by @changhiskhan in #628
- PlainDecoder take on boolean values by @eddyxu in #627
- remove debug prints by @changhiskhan in #633
- Scan node to detect channel close and gracefully break the scan. by @eddyxu in #635
New Contributors
Full Changelog: v0.3.7...v0.3.8
v0.3.7 Duration and Null support
Thanks @ananis25 for implementing Lance support for Duration and Null arrow arrays!
We've also completed the core implementation of cosine distance (with SIMD) and refactored the distance functions to be pluggable. Next release will expose this as a public API in Rust and Python
What's Changed
- support arrow-rs duration type in lance by @ananis25 in #590
- Don't inherit the index_section if mode is Overwrite by @changhiskhan in #592
- [Rust] Cosine Distance by @eddyxu in #595
- Refactor L2 distance into a separate mod by @eddyxu in #596
- support arrow-rs null array in lance by @ananis25 in #594
- refactor Scanner::try_into_stream by @changhiskhan in #597
Full Changelog: v0.3.6...v0.3.7
v0.3.6 Time travel
Welcome to @ananis25 and @yah01 !
This release enables time travel capability allowing you to check out the latest version as of a certain date and time.
We've refactored the query and index creation code to make room for multiple distance metrics.
What's Changed
- Refactor distance computation to allow pluggable dist function by @eddyxu in #589
- Add an argument to checkout a dataset as of a certain timestamp by @ananis25 in #585
New Contributors
Full Changelog: v0.3.5...v0.3.6
v0.3.5 Fast take and Decimal{128, 256} support
What's Changed
- Add take() to retrieve rows by indices by @yah01 in #562
- add crateio badge by @changhiskhan in #580
- quick update for main Rust readme by @changhiskhan in #583
- Fix version timestamp issue by @changhiskhan in #582
- Decimal128 and Decimal256 support by @changhiskhan in #584
Full Changelog: v0.3.4...v0.3.5
v0.3.4 Bug fixes and ergonomics
This is a minor release with bug fixes, documentation and ergonomic improvements for vectors indices.
Welcome our newest contributor @yah01
What's Changed
- Support converting Boolean Array in DuckDB by @eddyxu in #554
- [DuckDB] Fix scan column projection in duckdb extension by @eddyxu in #555
- [DuckDB] add a missing boolean array check by @eddyxu in #556
- Use Datafusion physical ExecutionPlan node as I/O exec node by @eddyxu in #558
- Fix missing duckdb symbol on Linux by @eddyxu in #560
- Improve
take_rows()
performance by binary search by @yah01 in #564 - Use the released crate for DataFusion by @eddyxu in #570
- Update duckdb version. Closes #569 by @changhiskhan in #571
- [python] Add version to python package. Closes #561 by @changhiskhan in #572
- utility to convert to pyarrow table with vectors by @changhiskhan in #574
- Clippy fixes by @eddyxu in #579
- Fix IVF_PQ index creation crash if one of the cluster has no value. by @eddyxu in #575
- Enforce SIMD alignment during index creation by @changhiskhan in #578
New Contributors
Preview
- We're most of the way there for a DuckDB extension to read Lance datasets natively (i.e., without python)
- We're integrating datafusion to enable pushdown of filter predicates
Full Changelog: v0.3.2...v0.3.4
v0.3.2 Speed up index creation by more than 60x
We discovered two thing making index creation unnecessarily long:
- Instead of using KMeans++ to initialize, just do random initialization
- Turn of BLAS on macOS because it turned out to be super slow
On macbook air, index creation goes from 25min on sift1m to 24s.
On Ubuntu, it's roughly a 6x speedup.
What's Changed
- Remove duckdb extension based on c++ codebase. by @eddyxu in #530
- Reader honor batch size by @eddyxu in #532
- Rust based duckdb extension by @eddyxu in #529
- pretty-print: add pretty print by @AsadullahFarooqi in #469
- Add unit tests to verify the read of struct by @eddyxu in #537
- [DuckDB] Support Struct, and a separate crate for re-usable components to build duckdb extensions by @eddyxu in #534
- [Rust] Fix main branch CI failure and bump arrow version by @eddyxu in #538
- metrics for vector index by @changhiskhan in #535
- [DuckDB] Add list support. by @eddyxu in #540
- Update README for vector index copy by @jaichopra in #542
- Support writing pandas dataframes directly by @changhiskhan in #543
- Improve IVF_PQ index creation performance by @eddyxu in #545
- update metrics by @changhiskhan in #547
- Fix duckdb CI failure by @eddyxu in #546
- Remove Mac accelerate code by @eddyxu in #548
- update docs by @changhiskhan in #549
- Set intel mac release version to 10.15 by @eddyxu in #551
- Test cli in Github Action by @eddyxu in #550
- Changhiskhan/nb docs by @changhiskhan in #552
- minor fix for GHA and github by @changhiskhan in #553
New Contributors
- @AsadullahFarooqi made their first contribution in #469
Full Changelog: v0.3.1...v0.3.2