Skip to content

v0.2.1

Compare
Choose a tag to compare
@bhavnicksm bhavnicksm released this 22 Nov 10:40
· 346 commits to main since this release
f5768e8

Breaking Changes

  • SemanticChunker no longer accepts SentenceTransformer models directly; instead, this release uses the SentenceTransformerEmbeddings class, which can take in a model directly. Future releases will add the functionality to auto-detect and create embeddings inside the AutoEmbeddings class.
  • By default, semantic optional installation now depends on Model2VecEmbeddings and hence model2vec python package from this release onwards, due to size and speed benefits. Model2Vec uses static embeddings which are good enough for the task of chunking while being 10x faster than standard Sentence Transformers and being a 10x lighter dependency.
  • SemanticChunker and SDPMChunker now use the argument chunk_size instead of max_chunk_size for uniformity across the chunkers, but the internal representation remains the same.

What's Changed

  • [BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
  • [DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
  • Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32
  • Use __slots__ instead of slots=True for python3.9 support by @bhavnicksm in #34
  • Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35
  • [FEAT] Add SentenceTransformerEmbeddings, EmbeddingsRegistry and AutoEmbeddings provider support by @bhavnicksm in #44
  • Refactor BaseChunker, SemanticChunker and SDPMChunker to support BaseEmbeddings by @bhavnicksm in #45
  • Add initial OpenAIEmbeddings support to Chonkie ✨ by @bhavnicksm in #46
  • [DOCS] Add info about initial embeddings support and how to add custom embeddings by @bhavnicksm in #47
  • [FEAT] - Add model2vec embedding models by @sky-2002 in #41
  • [FEAT] Add support for Model2VecEmbeddings + Switch default embeddings to Model2VecEmbeddings by @bhavnicksm in #49
  • [fix] Reorganize optional dependencies in pyproject.toml: rename 'sem… by @bhavnicksm in #51
  • [Fix] Token counts from Tokenizers and Transformers adding special tokens by @bhavnicksm in #52
  • [Fix] Refactor WordChunker, SentenceChunker pre-chunk splitting for reconstruction tests + minor changes by @bhavnicksm in #53
  • [Refactor] Optimize similarity calculation by using np.divide for imp… by @bhavnicksm in #54

New Contributors

Full Changelog: v0.2.0...v0.2.1