v0.2.1
Breaking Changes
- SemanticChunker no longer accepts SentenceTransformer models directly; instead, this release uses the
SentenceTransformerEmbeddings
class, which can take in a model directly. Future releases will add the functionality to auto-detect and create embeddings inside theAutoEmbeddings
class. - By default,
semantic
optional installation now depends onModel2VecEmbeddings
and hencemodel2vec
python package from this release onwards, due to size and speed benefits.Model2Vec
uses static embeddings which are good enough for the task of chunking while being 10x faster than standard Sentence Transformers and being a 10x lighter dependency. SemanticChunker
andSDPMChunker
now use the argumentchunk_size
instead ofmax_chunk_size
for uniformity across the chunkers, but the internal representation remains the same.
What's Changed
- [BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
- [DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
- Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32
- Use
__slots__
instead ofslots=True
for python3.9 support by @bhavnicksm in #34 - Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35
- [FEAT] Add SentenceTransformerEmbeddings, EmbeddingsRegistry and AutoEmbeddings provider support by @bhavnicksm in #44
- Refactor BaseChunker, SemanticChunker and SDPMChunker to support BaseEmbeddings by @bhavnicksm in #45
- Add initial OpenAIEmbeddings support to Chonkie ✨ by @bhavnicksm in #46
- [DOCS] Add info about initial embeddings support and how to add custom embeddings by @bhavnicksm in #47
- [FEAT] - Add model2vec embedding models by @sky-2002 in #41
- [FEAT] Add support for Model2VecEmbeddings + Switch default embeddings to Model2VecEmbeddings by @bhavnicksm in #49
- [fix] Reorganize optional dependencies in pyproject.toml: rename 'sem… by @bhavnicksm in #51
- [Fix] Token counts from Tokenizers and Transformers adding special tokens by @bhavnicksm in #52
- [Fix] Refactor WordChunker, SentenceChunker pre-chunk splitting for reconstruction tests + minor changes by @bhavnicksm in #53
- [Refactor] Optimize similarity calculation by using np.divide for imp… by @bhavnicksm in #54
New Contributors
- @mrmps made their first contribution in #29
- @jasonacox made their first contribution in #30
- @sky-2002 made their first contribution in #41
Full Changelog: v0.2.0...v0.2.1