Releases: lancedb/lance
v0.3.1 Index creation tool
We added an index creation tool that's 2x faster than FAISS.
Accessible in python via Dataset.create_index
What's Changed
- Add unit test for Dataset::take_rows by @eddyxu in #523
- Create index API from Python by @eddyxu in #524
- Implement a kmean optimized for Arrow-backed vectors in pure rust. by @eddyxu in #525
- Reimplement IVF_PQ index. by @eddyxu in #519
- Use pure rust kmean in IVF and PQ by @eddyxu in #526
- update release actions by @changhiskhan in #527
- [rust] make scanning in order configurable by @changhiskhan in #528
Full Changelog: v0.3.0...v0.3.1
v0.3.0 Rusty Lances and Friendly Neighbors
Sayonara C++, bonjour Rust
What started out as a holiday hack has become a full-blown Rust rewrite.
As we say farewell to our much beloved C++ implementation, we welcome a major new feature to Lance: the vector index.
- Lance's vector index is fast and has a small memory footprint. From disk, we benchmark average latencies of 1ms on vanilla macbook airs for 1M vectors.
- Your data, vectors, and index can live in harmony under one roof so you don't need to manage a separate index or service.
- You can choose to manage and retrieve additional features with the vectors with very little performance impact.
What's Changed
- Only increase cursor if file success to write by @eddyxu in #435
- GHA to add python 3.11 (and upgrade to duckdb 0.6.1) by @changhiskhan in #434
- ScannerStream accepts early stop by @eddyxu in #437
- upgrade arrow-rs to 31.0 by @eddyxu in #438
- L2 distance by @eddyxu in #439
- Create DataFragment and DataFile during Dataset write process by @eddyxu in #440
- Rust Dataset Write API by @eddyxu in #441
- [Rust] Read Partially from a plain encoded batch by @eddyxu in #443
- Get range in var-binary encoding by @eddyxu in #444
- Productionlize Flat Index by @eddyxu in #442
- Make Scan an ExecNode by @eddyxu in #445
- Take record by Row ID by @eddyxu in #446
- Implement Take for dictionary decoder. by @eddyxu in #447
- Merge two RecordBatch by @eddyxu in #449
- Integrate flat index by @eddyxu in #448
- Support limit offset as ExecNode by @changhiskhan in #450
- Read IVF_PQ index by @eddyxu in #451
- Cli to operate on dataset indices by @eddyxu in #452
- [RUST] python (re)integration v1 by @changhiskhan in #436
- Support writing dictionary values (at the dataset level). by @eddyxu in #454
- Replace ObjectReader as a pub trait. by @eddyxu in #459
- [Rust] Implement LocalObjectReader that holds an open file to improve performance. by @eddyxu in #460
- inherit from pyarrow Dataset/Scanner by @changhiskhan in #462
- [RUST] Flat index benchmark by @eddyxu in #461
- Generate spotify dataset with embeddings. by @eddyxu in #453
- Fix pylance typo and float32 array conversion. by @eddyxu in #463
- Write index metadata with a new version by @eddyxu in #466
- [rust] fix projection in Dataset:take_rows by @changhiskhan in #464
- blas feature flag by @changhiskhan in #467
- Sift dataset generation by @eddyxu in #472
- Improve scan perf by re-enable prefetching in ScanNode by @eddyxu in #473
- Changhiskhan/new docs by @changhiskhan in #474
- Fix AVX and NEON L2 distance computation. by @eddyxu in #476
- add recall metric computation by @changhiskhan in #475
- Fix reader assertion on manifest buffer size by @eddyxu in #478
- [Rust] Minimal dataset append support by @eddyxu in #482
- Pass nprobes parameter from python by @changhiskhan in #480
- add a test_dataset function to compute the recall for lance by @changhiskhan in #479
- Split sparse index read into chunks based on optimal I/O size for the media by @eddyxu in #483
- Fix codespace prebuild by @eddyxu in #485
- Make ObjectReader prefetch size configurable by @eddyxu in #486
- Add a refine stage for vector search by @eddyxu in #488
- add nprobes as parameter to benchmark by @changhiskhan in #484
- refine factor by @changhiskhan in #489
- Use ordered buffer in plain decoder by @eddyxu in #493
- New rust+pyo3 based pylance by @eddyxu in #494
- Fast count rows by @eddyxu in #490
- Count rows in python dataset, and setup GHA again by @eddyxu in #495
- Sayonara C++ by @eddyxu in #497
- [Rust] Dataset Overwrite, and Version Checkout by @eddyxu in #496
- Load S3 credentials using default credentials chain by @eddyxu in #498
- Fix doc build by @eddyxu in #499
- File format spec by @eddyxu in #500
- Doc build fix by @eddyxu in #501
- Schema evolution document by @eddyxu in #503
- update the python readme for pypi by @changhiskhan in #504
- Handle null strings for both cases where nullability is set or not. by @eddyxu in #509
- update main github readme by @changhiskhan in #508
- [python] write_dataset returns new dataset by @changhiskhan in #517
- Changhiskhan/list versions by @changhiskhan in #516
- Refine Factor is None by default by @eddyxu in #518
Full Changelog: v0.2.9...v0.3.0
v0.2.9 pandas extension type for inline images
And also, we've started to implement Lance is Rust. A new kickass vector indexing feature will be coming soon once we do some more cleanup and hook the Rust module back into python.
What's Changed
- [DuckDB] Add macro to check window size by @eddyxu in #395
- [pandas] Add pandas extension type for ImageBinary by @changhiskhan in #398
- python 3.11 is updating and causing error by @changhiskhan in #397
- [RUST] Initialize read support in Rust. by @eddyxu in #401
- Add missing logical type conversions by @eddyxu in #404
- [RUST] Schema projection by @eddyxu in #403
- [RUST] Data file reader by @eddyxu in #402
- [Rust] Decoder for dictionary encoding by @eddyxu in #406
- [Rust] Support full scan for BooleanArray by @changhiskhan in #407
- [Rust] Basic reading support for nested fields. by @eddyxu in #408
- Add unit tests for all supported primitive types by @changhiskhan in #409
- [RUST] Binary encoder and null support. by @eddyxu in #411
- [Rust] Fix Cargo publish by @eddyxu in #410
- [RUST] Large binary support by @eddyxu in #412
- Add support for fixed size list by @changhiskhan in #413
- Jaichopra/nuscenes converter by @jaichopra in #364
- Add Support for Fixed Size Binary Full scan by @changhiskhan in #414
- Bare minimal scanner in Rust by @eddyxu in #415
- Set field IDs. by @eddyxu in #417
- [Rust] Read/Write Protobuf-backed struct directly from file or buffers. by @eddyxu in #418
- [Rust] Lance File Writer by @eddyxu in #419
- [Rust] Write dictionary data by @eddyxu in #420
- [RUST] Write List/LargeList/FixedSizeList/FixedSizeBinary by @eddyxu in #421
- fix byte range and iterator bug by @changhiskhan in #422
- Fix dict order in logical type to be consistent with C++ by @eddyxu in #425
- Limits notebook GHA to only run when C++ / Python changes. by @eddyxu in #427
- Implement futures::Stream for Scanner by @eddyxu in #426
- Append column to RecordBatch by @eddyxu in #429
- [Rust] Read batch with rowid as a meta column. by @eddyxu in #430
- [RUST] argmin and argmax kernel for numeric array by @eddyxu in #432
Full Changelog: v0.2.8...v0.2.9
v0.2.8 Happy Holidays!
This release contains the following:
- A full-fledged ML data quality improvement workflow using Lance showing model performance insights, detecting mislabels, and doing active learning. An experimental integration with Label Studio is demonstrated as well.
- Critical bug fix affected read/write of dictionary columns
- Imagenet dataset converter
What's Changed
- [BUG] Fix reading version aux data reading and writing by @eddyxu in #384
- [Benchmark] upload scripts for coco / imagenet benchmark dataset by @eddyxu in #385
- Closes #387 by @changhiskhan in #388
- Data quality notebook and associated code by @changhiskhan in #389
- [DUCKDB] Do not build PyTorch by default by @eddyxu in #392
- brew pin python by @changhiskhan in #391
- fix off by one error using negative indices for diff'ing by @changhiskhan in #383
- Fix GHA for duckdb extension by @changhiskhan in #394
- [DUCKDB] Add a Derivative macro by @eddyxu in #393
- [Benchmark] Create imagenet from raw dataset by @eddyxu in #386
- Various fixes for imagenet and fmt changes by @changhiskhan in #396
Full Changelog: v0.2.7...v0.2.8
v0.2.7 Dataset Diff and Metrics computation, and Dataset Version Metadata
What's Changed
- create and update tarball for pets by @changhiskhan in #372
- [C++] Sanity check to verify column does not overlap when merging a new table by @eddyxu in #375
- update notebooks so s3 credentials are not required by @changhiskhan in #376
- Add function to get version as of a certain date. Also formatting by @changhiskhan in #378
- convenience for comparing metrics across versions by @changhiskhan in #379
- Changhiskhan/datadiff by @changhiskhan in #380
- Refactor dataset diff and compute metric by @changhiskhan in #381
- [C++] Attach new schema update when update dataset by @eddyxu in #374
Full Changelog: v0.2.6...v0.2.7
v0.2.6 Schema evolution bug fixes, Google Colab support, and more datasets
What's Changed
- [C++] Remove unused Reader APIs by @eddyxu in #344
- [Python] fix timezone issue with version timestamp by @changhiskhan in #345
- [C++] add Dataset::Make(string) API by @eddyxu in #346
- [DUCKDB] Native duckdb lance reader by @eddyxu in #347
- [DUCKDB] Read a special version of dataset by @eddyxu in #350
- [DUCKDB] Fix duckdb manylinux build by @eddyxu in #351
- [Python] Add colab badge to notebooks by @eddyxu in #354
- [Notebook] ML dev cycle for DINO by @eddyxu in #355
- [DUCKDB] fix type mapping for other int types by @changhiskhan in #359
- [Python] Fix lance.dataset open local related path by @eddyxu in #365
- [C++] Store relative path for data files by @eddyxu in #368
- [C++] Add RAII util (defer) to auto cleanup / close resources after exiting the scope by @eddyxu in #369
- [Python] Convert of ImageNet 1K into Lance dataset by @eddyxu in #366
- [Python] Imagenet data quality analytics notebook by @eddyxu in #370
Full Changelog: v0.2.5...v0.2.6
v0.2.5 Schema evolution, support merging with arrow Table
What's Changed
- [DOC] Fix notebook build by @eddyxu in #339
- [Python] lance.write_dataset takes pandas DataFrame by @eddyxu in #342
- [DOC] update readme docs to cater for import pathways from df/parquet by @jaichopra in #340
- [Python] Improve PyTorch dataset ergonomic by @eddyxu in #336
- [C++] Add columns from in-memory table by @eddyxu in #337
- [Python] append column with a in-memory Pyarrow Table by @eddyxu in #338
- [C++][Python] Add timestamp to each manifest version. by @eddyxu in #343
Full Changelog: v0.2.4...v0.2.5
v0.2.4: Schema Evolution and Append Column
Support Schema Evolution via Append Column.
What's Changed
- [Notebook] fixes for notebook backing the blog post by @changhiskhan in #316
- [C++] Append column by @eddyxu in #299
- [Python] Append columns by @eddyxu in #318
- [Use column projection during update by @eddyxu in https://github.com//pull/322
- update to duckdb 0.6 by @changhiskhan in #312
- [Python] Support add column via Expression. by @eddyxu in #324
- [Python] Expose projection for append column by @eddyxu in #325
- [C++] Support column projection during add_columns via expression by @eddyxu in #326
- [Python] Pytorch Dataset uses Fragment instead of files and support versions by @eddyxu in #327
- [C++] Move writer API a private API by @eddyxu in #329
- [C++] Refectory Metadata class to eliminate protobuf reference. by @eddyxu in #328
- [C++] Performance profiling and improvement by @eddyxu in #333
- [C++] Upgrade
lq
cmd tool to be able to inspect new versioned format by @eddyxu in #334
Full Changelog: v0.2.3...v0.2.4
v0.2.3 Bugfix release; breaks dataset proto schema
What's Changed
- [C++] Project schema via field Ids and Schema intersection by @eddyxu in #305
- when writing in batches, handle all na arrays properly by @changhiskhan in #306
- [C++] Use LanceFragment to build I/O exec plan by @eddyxu in #307
- [CI] Fix Github Action warning to upgrade nodejs 12 based actions by @eddyxu in #309
- Update README.md by @changhiskhan in #310
- Temporarily pin duckdb to 0.5.1 by @changhiskhan in #313
- Notebook for new blog post on versioning by @changhiskhan in #311
- [C++] Fix reading dictionary values from manifest files by @eddyxu in #314
Full Changelog: v0.2.2...v0.2.3
v0.2.2 Python notebooks and CV dataset conversion.
What's Changed
- [DOC] Update README.md by @jaichopra in #294
- [DUCKDB] Script to upload lance extension zip by @changhiskhan in #295
- [C++] Scan Node reads multiple files by @eddyxu in #300
- [Python] Add lance.util.duckdb to help install the extension transparently by @changhiskhan in #301
- [Python] Notebook fixes by @changhiskhan in #303
- [Python] Make dataset conversion a feature by @changhiskhan in #304
Full Changelog: v0.2.1...v0.2.2