Merge pull request #9 from bhavnicksm/development

Disentangle the Embedding Model from SemanticChunker + Update DOCS and README
chonkie-ai · Nov 7, 2024 · 977d1d6 · 977d1d6
2 parents 68e3272 + e82ed56
commit 977d1d6
Show file tree

Hide file tree

Showing 9 changed files with 478 additions and 141 deletions.
diff --git a/DOCS.md b/DOCS.md
diff --git a/README.md b/README.md
@@ -1,19 +1,31 @@
-![Chonkie Logo](https://github.com/bhavnicksm/chonkie/blob/6b1b1953494d47dda9a19688c842975184ccc986/assets/chonkie_logo_br_transparent_bg.png)
-# 🦛 Chonkie
+<div align='center'>
 
-so i found myself making another RAG bot (for the 2342148th time) and meanwhile, explaining to my juniors about why we should use chunking in our RAG bots, only to realise that i would have to write chunking all over again unless i use the bloated software library X or the extremely feature-less library Y. _WHY CAN I NOT HAVE GOOD THINGS IN LIFE, UGH?_
+![Chonkie Logo](/assets/chonkie_logo_br_transparent_bg.png)
+
+# 🦛 Chonkie ✨
+
+[![PyPI version](https://img.shields.io/pypi/v/chonkie.svg)](https://pypi.org/project/chonkie/)
+[![License](https://img.shields.io/github/license/bhavnicksm/chonkie.svg)](https://github.com/bhavnicksm/chonkie/blob/main/LICENSE)
+[![Documentation](https://img.shields.io/badge/docs-DOCS.md-blue.svg)](DOCS.md)
+![Package size](https://img.shields.io/badge/size-21MB-blue)
+[![Downloads](https://static.pepy.tech/badge/chonkie)](https://pepy.tech/project/chonkie)
+[![GitHub stars](https://img.shields.io/github/stars/bhavnicksm/chonkie.svg)](https://github.com/bhavnicksm/chonkie/stargazers)
+
+</div>
+
+so i found myself making another RAG bot (for the 2342148th time) and meanwhile, explaining to my juniors about why we should use chunking in our RAG bots, only to realise that i would have to write chunking all over again unless i use the bloated software library X or the extremely feature-less library Y. _WHY CAN I NOT HAVE SOMETHING JUST RIGHT, UGH?_
 
 Can't i just install, import and run chunking and not have to worry about dependencies, bloat, speed or other factors?
 
 Well, with chonkie you can! (chonkie boi is a gud boi)
 
-✅ Feature-rich: All the CHONKs you'd ever need </br>
-✅ Easy to use: Install, Import, CHONK </br>
-✅ Fast: CHONK at the speed of light! zooooom </br>
-✅ Wide support: Supports all your favorite tokenizer CHONKS </br>
-✅ Light-weight: No bloat, just CHONK </br>
-✅ Cute CHONK mascoot </br>
-✅ Moto Moto's favorite python library </br>
+**🚀 Feature-rich**: All the CHONKs you'd ever need </br>
+**✨ Easy to use**: Install, Import, CHONK </br>
+**⚡ Fast**: CHONK at the speed of light! zooooom </br>
+**🌐 Wide support**: Supports all your favorite tokenizer CHONKS </br>
+**🪶 Light-weight**: No bloat, just CHONK </br>
+**🦛 Cute CHONK mascot**: psst it's a pygmy hippo btw </br>
+**❤️ [Moto Moto](#acknowledgements)'s favorite python library** </br>
 
 What're you waiting for, **just CHONK it**!
 
@@ -47,8 +59,13 @@ tokenizer = Tokenizer.from_pretrained("gpt2")
 chunker = TokenChunker(tokenizer)
 
 # Chunk some text
-chunks = chunker("Woah! I believe Chonkie, the chunking library is so cool! I love the tiny hippo hehe.")
-print(chunks)
+chunks = chunker("Woah! Chonkie, the chunking library is so cool!",
+                  "I love the tiny hippo hehe.")
+
+# Access chunks
+for chunk in chunks:
+    print(f"Chunk: {chunk.text}")
+    print(f"Tokens: {chunk.token_count}")
 ```
 
 More example usages given inside the [DOCS](/DOCS.md)
@@ -69,16 +86,17 @@ More on these methods and the approaches taken inside the [DOCS](/DOCS.md)
 
 Chonkie was developed with the support and contributions of the open-source community. We would like to thank the following projects and individuals for their invaluable help:
 
-- **Hugging Face** for their amazing [tokenizers](https://github.com/huggingface/tokenizers) library, which provides the backbone for our tokenization needs.
+- **OpenAI** for their amazing [tiktoken](https://github.com/openai/tiktoken) library, which provides the backbone for our tokenization needs.
 - **spaCy** for their powerful [spaCy](https://spacy.io/) library, which we use for advanced sentence segmentation.
 - **Sentence Transformers** for their [sentence-transformers](https://www.sbert.net/) library, which enables semantic chunking.
 - The contributors and maintainers of various open-source projects that have inspired and supported the development of Chonkie.
 
-Special thanks to **Moto Moto** for endorsing Chonkie with his famous quote: 
-> "I like them big, I like them chonkie."
-
 And to all the users and contributors who have provided feedback, reported issues, and helped improve Chonkie.
 
+Special thanks to **[Moto Moto](https://www.youtube.com/watch?v=I0zZC4wtqDQ&t=5s)** for endorsing Chonkie with his famous quote: 
+> "I like them big, I like them chonkie."
+>                                         ~ Moto Moto
+
 # Citation
 
 If you use Chonkie in your research, please cite it as follows:

diff --git a/pyproject.toml b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "chonkie"
-version = "0.0.3"
+version = "0.1.0"
 description = "🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library"
 readme = "README.md"
 requires-python = ">=3.8"
@@ -23,7 +23,7 @@ classifiers = [
     "Programming Language :: Python :: 3.11"
 ]
 dependencies = [
-    "tokenizers>=0.13.0"
+    "autotiktokenizer", "tokenizers>=0.13.0"
 ]
 
 [project.urls]

diff --git a/src/chonkie/__init__.py b/src/chonkie/__init__.py
@@ -4,15 +4,15 @@
     WordChunker,
     SentenceChunker,
     SemanticChunker,
-    SPDMChunker,
+    SDPMChunker,
     Chunk, 
     SentenceChunk,
     SemanticChunk,
     Sentence,
     SemanticSentence
 )
 
-__version__ = "0.0.3"
+__version__ = "0.1.0"
 __name__ = "chonkie"
 __author__ = "Bhavnick Minhas"
 
@@ -30,5 +30,5 @@
     "WordChunker",
     "SentenceChunker",
     "SemanticChunker",
-    "SPDMChunker"
+    "SDPMChunker"
 ]
diff --git a/src/chonkie/chunker/__init__.py b/src/chonkie/chunker/__init__.py
@@ -3,7 +3,7 @@
 from .word import WordChunker
 from .sentence import Sentence, SentenceChunk, SentenceChunker
 from .semantic import SemanticSentence, SemanticChunk, SemanticChunker
-from .spdm import SPDMChunker
+from .sdpm import SDPMChunker
 
 
 __all__ = [
@@ -17,5 +17,5 @@
     "SemanticSentence",
     "SemanticChunk",
     "SemanticChunker",
-    "SPDMChunker"
+    "SDPMChunker"
 ]
diff --git a/src/chonkie/chunker/spdm.py → src/chonkie/chunker/sdpm.py b/src/chonkie/chunker/spdm.py → src/chonkie/chunker/sdpm.py
@@ -1,11 +1,26 @@
-from typing import List
+from typing import List, Union
+import warnings
+import importlib
+
 from .semantic import SemanticChunker, SemanticChunk, Sentence
 
-class SPDMChunker(SemanticChunker):
+# Check if sentence-transformers is available
+SENTENCE_TRANSFORMERS_AVAILABLE = importlib.util.find_spec("sentence_transformers") is not None
+if SENTENCE_TRANSFORMERS_AVAILABLE:
+    try:
+        from sentence_transformers import SentenceTransformer
+    except ImportError:
+        SENTENCE_TRANSFORMERS_AVAILABLE = False
+        warnings.warn("Failed to import sentence-transformers despite it being installed. SemanticChunker will not work.")
+else:
+    warnings.warn("sentence-transformers is not installed. SemanticChunker will not work.")
+
+
+class SDPMChunker(SemanticChunker):
     def __init__(
         self,
         tokenizer,
-        sentence_transformer_model: str,
+        embedding_model: Union[str, SentenceTransformer], 
         similarity_threshold: float = None,
         similarity_percentile: float = None,
         max_chunk_size: int = 512,
@@ -14,15 +29,15 @@ def __init__(
         spacy_model: str = "en_core_web_sm",
         skip_window: int = 1  # How many chunks to skip when looking for similarities
     ):
-        """Initialize the SPDMChunker.
+        """Initialize the SDPMChunker.
         
         Args:
             Same as SemanticChunker, plus:
             skip_window: Number of chunks to skip when looking for similarities
         """
         super().__init__(
             tokenizer=tokenizer,
-            sentence_transformer_model=sentence_transformer_model,
+            embedding_model=embedding_model,
             max_chunk_size=max_chunk_size,
             similarity_threshold=similarity_threshold,
             similarity_percentile=similarity_percentile,

diff --git a/src/chonkie/chunker/semantic.py b/src/chonkie/chunker/semantic.py
@@ -1,5 +1,5 @@
 from dataclasses import dataclass
-from typing import List, Optional
+from typing import List, Optional, Union
 import numpy as np
 import re
 import importlib.util
@@ -8,6 +8,28 @@
 from .base import BaseChunker
 from .sentence import Sentence, SentenceChunk
 
+import warnings
+
+# Check if spacy is available
+SPACY_AVAILABLE = importlib.util.find_spec("spacy") is not None
+if SPACY_AVAILABLE:
+    try:
+        import spacy
+    except ImportError:
+        SPACY_AVAILABLE = False
+        warnings.warn("Failed to import spacy despite it being installed. Using heuristic mode only.")
+
+SENTENCE_TRANSFORMERS_AVAILABLE = importlib.util.find_spec("sentence_transformers") is not None
+if SENTENCE_TRANSFORMERS_AVAILABLE:
+    try:
+        from sentence_transformers import SentenceTransformer
+    except ImportError:
+        SENTENCE_TRANSFORMERS_AVAILABLE = False
+        warnings.warn("Failed to import sentence-transformers despite it being installed. SemanticChunker will not work.")
+else:
+    warnings.warn("sentence-transformers is not installed. SemanticChunker will not work.")
+
+
 @dataclass
 class SemanticSentence(Sentence): 
     text: str
@@ -28,7 +50,7 @@ class SemanticChunker(BaseChunker):
     def __init__(
         self,
         tokenizer: Tokenizer,
-        sentence_transformer_model: str,
+        embedding_model: Union[str, SentenceTransformer],
         similarity_threshold: Optional[float] = None,
         similarity_percentile: Optional[float] = None,
         max_chunk_size: int = 512,
@@ -40,7 +62,7 @@ def __init__(
 
         Args:
             tokenizer: Tokenizer for counting tokens
-            sentence_transformer_model: Name of the sentence-transformers model to load
+            embedding_model: Name of the sentence-transformers model to load
             max_chunk_size: Maximum tokens allowed per chunk
             similarity_threshold: Absolute threshold for semantic similarity (0-1)
             similarity_percentile: Percentile threshold for similarity (0-100)
@@ -74,31 +96,26 @@ def __init__(
         self.sentence_mode = sentence_mode
 
         # Initialize sentence transformer
-        if not importlib.util.find_spec("sentence_transformers"):
+        if not SENTENCE_TRANSFORMERS_AVAILABLE:
             raise ImportError(
                 "sentence-transformers is not installed. "
                 "Install it with 'pip install sentence-transformers'"
             )
-        try:
-            from sentence_transformers import SentenceTransformer
-            self.sentence_transformer = SentenceTransformer(sentence_transformer_model)
-        except Exception as e:
-            raise ImportError(
-                f"Failed to load sentence-transformers model '{sentence_transformer_model}'. "
-                f"Error: {str(e)}"
-            ) from e
+        if isinstance(embedding_model, str):
+            self.embedding_model = self._load_sentence_transformer_model(embedding_model)
+        else:
+            self.embedding_model = embedding_model
 
         # Initialize spaCy if explicitly requested
         self.nlp = None
         if sentence_mode == "spacy":
-            if not importlib.util.find_spec("spacy"):
+            if not SPACY_AVAILABLE:
                 raise ImportError(
                     "spaCy is not installed. Install it with 'pip install spacy' "
                     "and download the model with 'python -m spacy download en_core_web_sm', "
                     "or use sentence_mode='heuristic' instead."
                 )
             try:
-                import spacy
                 self.nlp = spacy.load(spacy_model)
             except OSError as e:
                 raise ImportError(
@@ -107,6 +124,18 @@ def __init__(
                     "or use sentence_mode='heuristic' instead."
                 ) from e
 
+    def _load_sentence_transformer_model(self, model_name: str) -> SentenceTransformer:
+        """Load a sentence-transformers model by name."""
+        try:
+            model = SentenceTransformer(model_name)
+        except Exception as e:
+            raise ImportError(
+                f"Failed to load sentence-transformers model '{model_name}'. "
+                f"Make sure it is installed and available."
+            ) from e
+        return model
+
+
     def _split_sentences_spacy(self, text: str) -> List[str]:
         """Split text into sentences using spaCy."""
         doc = self.nlp(text)
@@ -157,7 +186,7 @@ def _prepare_sentences(self, text: str) -> List[Sentence]:
             current_idx = end_idx
 
         # Batch compute embeddings for all sentences
-        embeddings = self.sentence_transformer.encode(raw_sentences, convert_to_numpy=True)
+        embeddings = self.embedding_model.encode(raw_sentences, convert_to_numpy=True)
 
         # Batch compute token counts
         token_counts = [len(encoding) for encoding in self._encode_batch(raw_sentences)]
@@ -179,12 +208,14 @@ def _prepare_sentences(self, text: str) -> List[Sentence]:
 
     def _get_semantic_similarity(self, embedding1: np.ndarray, embedding2: np.ndarray) -> float:
         """Compute cosine similarity between two embeddings."""
-        similarity = self.sentence_transformer.similarity(embedding1, embedding2)
+        similarity = self.embedding_model.similarity(embedding1, embedding2)
         return similarity
 
     def _compute_group_embedding(self, sentences: List[Sentence]) -> np.ndarray:
         """Compute mean embedding for a group of sentences."""
-        return np.mean([sent.embedding for sent in sentences], axis=0)
+        return np.divide(np.sum([(sent.embedding * sent.token_count) for sent in sentences], axis=0),
+                         np.sum([sent.token_count for sent in sentences]),
+                         dtype=np.float32)
 
     def _group_sentences(self, sentences: List[Sentence]) -> List[List[Sentence]]:
         """Group sentences based on semantic similarity, ignoring token count limits.