Skip to content

Commit

Permalink
Merge pull request #9 from bhavnicksm/development
Browse files Browse the repository at this point in the history
Disentangle the Embedding Model from SemanticChunker + Update DOCS and README
  • Loading branch information
bhavnicksm authored Nov 7, 2024
2 parents 68e3272 + e82ed56 commit 977d1d6
Show file tree
Hide file tree
Showing 9 changed files with 478 additions and 141 deletions.
368 changes: 306 additions & 62 deletions DOCS.md

Large diffs are not rendered by default.

50 changes: 34 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,31 @@
![Chonkie Logo](https://github.com/bhavnicksm/chonkie/blob/6b1b1953494d47dda9a19688c842975184ccc986/assets/chonkie_logo_br_transparent_bg.png)
# 🦛 Chonkie
<div align='center'>

so i found myself making another RAG bot (for the 2342148th time) and meanwhile, explaining to my juniors about why we should use chunking in our RAG bots, only to realise that i would have to write chunking all over again unless i use the bloated software library X or the extremely feature-less library Y. _WHY CAN I NOT HAVE GOOD THINGS IN LIFE, UGH?_
![Chonkie Logo](/assets/chonkie_logo_br_transparent_bg.png)

# 🦛 Chonkie ✨

[![PyPI version](https://img.shields.io/pypi/v/chonkie.svg)](https://pypi.org/project/chonkie/)
[![License](https://img.shields.io/github/license/bhavnicksm/chonkie.svg)](https://github.com/bhavnicksm/chonkie/blob/main/LICENSE)
[![Documentation](https://img.shields.io/badge/docs-DOCS.md-blue.svg)](DOCS.md)
![Package size](https://img.shields.io/badge/size-21MB-blue)
[![Downloads](https://static.pepy.tech/badge/chonkie)](https://pepy.tech/project/chonkie)
[![GitHub stars](https://img.shields.io/github/stars/bhavnicksm/chonkie.svg)](https://github.com/bhavnicksm/chonkie/stargazers)

</div>

so i found myself making another RAG bot (for the 2342148th time) and meanwhile, explaining to my juniors about why we should use chunking in our RAG bots, only to realise that i would have to write chunking all over again unless i use the bloated software library X or the extremely feature-less library Y. _WHY CAN I NOT HAVE SOMETHING JUST RIGHT, UGH?_

Can't i just install, import and run chunking and not have to worry about dependencies, bloat, speed or other factors?

Well, with chonkie you can! (chonkie boi is a gud boi)

Feature-rich: All the CHONKs you'd ever need </br>
Easy to use: Install, Import, CHONK </br>
Fast: CHONK at the speed of light! zooooom </br>
Wide support: Supports all your favorite tokenizer CHONKS </br>
Light-weight: No bloat, just CHONK </br>
Cute CHONK mascoot </br>
Moto Moto's favorite python library </br>
**🚀 Feature-rich**: All the CHONKs you'd ever need </br>
** Easy to use**: Install, Import, CHONK </br>
** Fast**: CHONK at the speed of light! zooooom </br>
**🌐 Wide support**: Supports all your favorite tokenizer CHONKS </br>
**🪶 Light-weight**: No bloat, just CHONK </br>
**🦛 Cute CHONK mascot**: psst it's a pygmy hippo btw </br>
**❤️ [Moto Moto](#acknowledgements)'s favorite python library** </br>

What're you waiting for, **just CHONK it**!

Expand Down Expand Up @@ -47,8 +59,13 @@ tokenizer = Tokenizer.from_pretrained("gpt2")
chunker = TokenChunker(tokenizer)

# Chunk some text
chunks = chunker("Woah! I believe Chonkie, the chunking library is so cool! I love the tiny hippo hehe.")
print(chunks)
chunks = chunker("Woah! Chonkie, the chunking library is so cool!",
"I love the tiny hippo hehe.")

# Access chunks
for chunk in chunks:
print(f"Chunk: {chunk.text}")
print(f"Tokens: {chunk.token_count}")
```

More example usages given inside the [DOCS](/DOCS.md)
Expand All @@ -69,16 +86,17 @@ More on these methods and the approaches taken inside the [DOCS](/DOCS.md)

Chonkie was developed with the support and contributions of the open-source community. We would like to thank the following projects and individuals for their invaluable help:

- **Hugging Face** for their amazing [tokenizers](https://github.com/huggingface/tokenizers) library, which provides the backbone for our tokenization needs.
- **OpenAI** for their amazing [tiktoken](https://github.com/openai/tiktoken) library, which provides the backbone for our tokenization needs.
- **spaCy** for their powerful [spaCy](https://spacy.io/) library, which we use for advanced sentence segmentation.
- **Sentence Transformers** for their [sentence-transformers](https://www.sbert.net/) library, which enables semantic chunking.
- The contributors and maintainers of various open-source projects that have inspired and supported the development of Chonkie.

Special thanks to **Moto Moto** for endorsing Chonkie with his famous quote:
> "I like them big, I like them chonkie."
And to all the users and contributors who have provided feedback, reported issues, and helped improve Chonkie.

Special thanks to **[Moto Moto](https://www.youtube.com/watch?v=I0zZC4wtqDQ&t=5s)** for endorsing Chonkie with his famous quote:
> "I like them big, I like them chonkie."
> ~ Moto Moto
# Citation

If you use Chonkie in your research, please cite it as follows:
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

[project]
name = "chonkie"
version = "0.0.3"
version = "0.1.0"
description = "🦛 CHONK your texts with Chonkie ✨ - The no-nonsense RAG chunking library"
readme = "README.md"
requires-python = ">=3.8"
Expand All @@ -23,7 +23,7 @@ classifiers = [
"Programming Language :: Python :: 3.11"
]
dependencies = [
"tokenizers>=0.13.0"
"autotiktokenizer", "tokenizers>=0.13.0"
]

[project.urls]
Expand Down
6 changes: 3 additions & 3 deletions src/chonkie/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@
WordChunker,
SentenceChunker,
SemanticChunker,
SPDMChunker,
SDPMChunker,
Chunk,
SentenceChunk,
SemanticChunk,
Sentence,
SemanticSentence
)

__version__ = "0.0.3"
__version__ = "0.1.0"
__name__ = "chonkie"
__author__ = "Bhavnick Minhas"

Expand All @@ -30,5 +30,5 @@
"WordChunker",
"SentenceChunker",
"SemanticChunker",
"SPDMChunker"
"SDPMChunker"
]
4 changes: 2 additions & 2 deletions src/chonkie/chunker/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from .word import WordChunker
from .sentence import Sentence, SentenceChunk, SentenceChunker
from .semantic import SemanticSentence, SemanticChunk, SemanticChunker
from .spdm import SPDMChunker
from .sdpm import SDPMChunker


__all__ = [
Expand All @@ -17,5 +17,5 @@
"SemanticSentence",
"SemanticChunk",
"SemanticChunker",
"SPDMChunker"
"SDPMChunker"
]
25 changes: 20 additions & 5 deletions src/chonkie/chunker/spdm.py → src/chonkie/chunker/sdpm.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,26 @@
from typing import List
from typing import List, Union
import warnings
import importlib

from .semantic import SemanticChunker, SemanticChunk, Sentence

class SPDMChunker(SemanticChunker):
# Check if sentence-transformers is available
SENTENCE_TRANSFORMERS_AVAILABLE = importlib.util.find_spec("sentence_transformers") is not None
if SENTENCE_TRANSFORMERS_AVAILABLE:
try:
from sentence_transformers import SentenceTransformer
except ImportError:
SENTENCE_TRANSFORMERS_AVAILABLE = False
warnings.warn("Failed to import sentence-transformers despite it being installed. SemanticChunker will not work.")
else:
warnings.warn("sentence-transformers is not installed. SemanticChunker will not work.")


class SDPMChunker(SemanticChunker):
def __init__(
self,
tokenizer,
sentence_transformer_model: str,
embedding_model: Union[str, SentenceTransformer],
similarity_threshold: float = None,
similarity_percentile: float = None,
max_chunk_size: int = 512,
Expand All @@ -14,15 +29,15 @@ def __init__(
spacy_model: str = "en_core_web_sm",
skip_window: int = 1 # How many chunks to skip when looking for similarities
):
"""Initialize the SPDMChunker.
"""Initialize the SDPMChunker.
Args:
Same as SemanticChunker, plus:
skip_window: Number of chunks to skip when looking for similarities
"""
super().__init__(
tokenizer=tokenizer,
sentence_transformer_model=sentence_transformer_model,
embedding_model=embedding_model,
max_chunk_size=max_chunk_size,
similarity_threshold=similarity_threshold,
similarity_percentile=similarity_percentile,
Expand Down
65 changes: 48 additions & 17 deletions src/chonkie/chunker/semantic.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from dataclasses import dataclass
from typing import List, Optional
from typing import List, Optional, Union
import numpy as np
import re
import importlib.util
Expand All @@ -8,6 +8,28 @@
from .base import BaseChunker
from .sentence import Sentence, SentenceChunk

import warnings

# Check if spacy is available
SPACY_AVAILABLE = importlib.util.find_spec("spacy") is not None
if SPACY_AVAILABLE:
try:
import spacy
except ImportError:
SPACY_AVAILABLE = False
warnings.warn("Failed to import spacy despite it being installed. Using heuristic mode only.")

SENTENCE_TRANSFORMERS_AVAILABLE = importlib.util.find_spec("sentence_transformers") is not None
if SENTENCE_TRANSFORMERS_AVAILABLE:
try:
from sentence_transformers import SentenceTransformer
except ImportError:
SENTENCE_TRANSFORMERS_AVAILABLE = False
warnings.warn("Failed to import sentence-transformers despite it being installed. SemanticChunker will not work.")
else:
warnings.warn("sentence-transformers is not installed. SemanticChunker will not work.")


@dataclass
class SemanticSentence(Sentence):
text: str
Expand All @@ -28,7 +50,7 @@ class SemanticChunker(BaseChunker):
def __init__(
self,
tokenizer: Tokenizer,
sentence_transformer_model: str,
embedding_model: Union[str, SentenceTransformer],
similarity_threshold: Optional[float] = None,
similarity_percentile: Optional[float] = None,
max_chunk_size: int = 512,
Expand All @@ -40,7 +62,7 @@ def __init__(
Args:
tokenizer: Tokenizer for counting tokens
sentence_transformer_model: Name of the sentence-transformers model to load
embedding_model: Name of the sentence-transformers model to load
max_chunk_size: Maximum tokens allowed per chunk
similarity_threshold: Absolute threshold for semantic similarity (0-1)
similarity_percentile: Percentile threshold for similarity (0-100)
Expand Down Expand Up @@ -74,31 +96,26 @@ def __init__(
self.sentence_mode = sentence_mode

# Initialize sentence transformer
if not importlib.util.find_spec("sentence_transformers"):
if not SENTENCE_TRANSFORMERS_AVAILABLE:
raise ImportError(
"sentence-transformers is not installed. "
"Install it with 'pip install sentence-transformers'"
)
try:
from sentence_transformers import SentenceTransformer
self.sentence_transformer = SentenceTransformer(sentence_transformer_model)
except Exception as e:
raise ImportError(
f"Failed to load sentence-transformers model '{sentence_transformer_model}'. "
f"Error: {str(e)}"
) from e
if isinstance(embedding_model, str):
self.embedding_model = self._load_sentence_transformer_model(embedding_model)
else:
self.embedding_model = embedding_model

# Initialize spaCy if explicitly requested
self.nlp = None
if sentence_mode == "spacy":
if not importlib.util.find_spec("spacy"):
if not SPACY_AVAILABLE:
raise ImportError(
"spaCy is not installed. Install it with 'pip install spacy' "
"and download the model with 'python -m spacy download en_core_web_sm', "
"or use sentence_mode='heuristic' instead."
)
try:
import spacy
self.nlp = spacy.load(spacy_model)
except OSError as e:
raise ImportError(
Expand All @@ -107,6 +124,18 @@ def __init__(
"or use sentence_mode='heuristic' instead."
) from e

def _load_sentence_transformer_model(self, model_name: str) -> SentenceTransformer:
"""Load a sentence-transformers model by name."""
try:
model = SentenceTransformer(model_name)
except Exception as e:
raise ImportError(
f"Failed to load sentence-transformers model '{model_name}'. "
f"Make sure it is installed and available."
) from e
return model


def _split_sentences_spacy(self, text: str) -> List[str]:
"""Split text into sentences using spaCy."""
doc = self.nlp(text)
Expand Down Expand Up @@ -157,7 +186,7 @@ def _prepare_sentences(self, text: str) -> List[Sentence]:
current_idx = end_idx

# Batch compute embeddings for all sentences
embeddings = self.sentence_transformer.encode(raw_sentences, convert_to_numpy=True)
embeddings = self.embedding_model.encode(raw_sentences, convert_to_numpy=True)

# Batch compute token counts
token_counts = [len(encoding) for encoding in self._encode_batch(raw_sentences)]
Expand All @@ -179,12 +208,14 @@ def _prepare_sentences(self, text: str) -> List[Sentence]:

def _get_semantic_similarity(self, embedding1: np.ndarray, embedding2: np.ndarray) -> float:
"""Compute cosine similarity between two embeddings."""
similarity = self.sentence_transformer.similarity(embedding1, embedding2)
similarity = self.embedding_model.similarity(embedding1, embedding2)
return similarity

def _compute_group_embedding(self, sentences: List[Sentence]) -> np.ndarray:
"""Compute mean embedding for a group of sentences."""
return np.mean([sent.embedding for sent in sentences], axis=0)
return np.divide(np.sum([(sent.embedding * sent.token_count) for sent in sentences], axis=0),
np.sum([sent.token_count for sent in sentences]),
dtype=np.float32)

def _group_sentences(self, sentences: List[Sentence]) -> List[List[Sentence]]:
"""Group sentences based on semantic similarity, ignoring token count limits.
Expand Down
Loading

0 comments on commit 977d1d6

Please sign in to comment.