Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add codespell support (config, workflow to detect/not fix) and make it fix few typos #78

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions .github/workflows/codespell.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Codespell configuration is within pyproject.toml
---
name: Codespell

on:
push:
branches: [main]
pull_request:
branches: [main]

permissions:
contents: read

jobs:
codespell:
name: Check for spelling errors
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4
- name: Annotate locations with typos
uses: codespell-project/codespell-problem-matcher@v1
- name: Codespell
uses: codespell-project/actions-codespell@v2
8 changes: 8 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,11 @@ repos:
- sentence-transformers>=3.0.1
- tiktoken>=0.7.0
- tqdm>=4.66.4

- repo: https://github.com/codespell-project/codespell
# Configuration for codespell is in pyproject.toml
rev: v2.3.0
hooks:
- id: codespell
additional_dependencies:
- tomli
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@

LOTUS makes LLM-powered data processing fast and easy.

LOTUS (**L**LMs **O**ver **T**ables of **U**nstructured and **S**tructured Data) provides a declarative programming model and an optimized query engine for serving powerful reasoning-based query pipelines over structured and unstructured data! We provide a simple and intuitive Pandas-like API, that implements **semantic operators**.
LOTUS (**L**LMs **O**ver **T**ables of **U**nstructured and **S**structured Data) provides a declarative programming model and an optimized query engine for serving powerful reasoning-based query pipelines over structured and unstructured data! We provide a simple and intuitive Pandas-like API, that implements **semantic operators**.

For trouble-shooting or feature requests, please raise an issue and we'll get to it promptly. To share feedback and applications you're working on, you can send us a message on our [community slack](https://join.slack.com/t/lotus-fnm8919/shared_invite/zt-2tnq6948j-juGuSIR0__fsh~kUmZ6TJw), or send an email ([email protected]).

Expand Down Expand Up @@ -88,7 +88,7 @@ LOTUS offers a number of semantic operators in a Pandas-like API, some of which
| sem_filter | Keep records that match the natural language predicate |
| sem_extract | Extract one or more attributes from each row |
| sem_agg | Aggregate across all records (e.g. for summarization) |
| sem_topk | Order the records by some natural langauge sorting criteria |
| sem_topk | Order the records by some natural language sorting criteria |
| sem_join | Join two datasets based on a natural language predicate |
| sem_sim_join | Join two DataFrames based on semantic similarity |
| sem_search | Perform semantic search the over a text column |
Expand Down
4 changes: 2 additions & 2 deletions docs/approximation_cascades.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,11 @@ LOTUS serves approximations for semantic operators to let you balance speed and
You can set accurayc targets according to the requirements of your application, and LOTUS
will use approximations to optimize the implementation for lower computaitonal overhead, while providing probabilistic accuracy guarantees.
One core technique for providing these approximations is the use of cascades.
Cascades provide a way to optimize certian semantic operators (Join Cascade and Filter Cascade) by blending
Cascades provide a way to optimize certain semantic operators (Join Cascade and Filter Cascade) by blending
a less costly but potentially inaccurate proxy model with a high-quality oracle model. The method seeks to achieve
preset precision and recall targets with a given probability while controlling computational overhead.

Cascades work by intially using a cheap approximation to score and filters/joins tuples. Using statistically
Cascades work by initially using a cheap approximation to score and filters/joins tuples. Using statistically
supported thresholds found from sampling prior, it then assigns each tuple to one of three actions based on the
proxy's score: accept, reject, or seek clarification from the oracle model.

Expand Down
2 changes: 1 addition & 1 deletion docs/configurations.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Configurable Parameters
--------------------------

1. enable_message_cache:
* Description: Enables or Disables cahcing mechanisms
* Description: Enables or Disables caching mechanisms
* Default: False
* Parameters:
- cache_type: Type of caching (SQLITE or In_MEMORY)
Expand Down
6 changes: 3 additions & 3 deletions docs/core_concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@ Core Concepts
==================

LOTUS' implements the semantic operator programming model. Semantic operators are declarative transformations over one or more
datasets, parameterized by a natural langauge expression (*langex*) that can be implemnted by a variety of AI-based algorithms.
datasets, parameterized by a natural language expression (*langex*) that can be implemented by a variety of AI-based algorithms.
Semantic operators seamlessly extend the relational model, operating over datasets that may contain traditional structured data
as well as unstructured fields, such as free-form text or images. Because semantic operators are composable, modular and declarative, they allow you to write
AI-based piplines with intuitive, high-level logic, leaving the rest of the work to the query engine! Each operator can be implmented and
AI-based pipelines with intuitive, high-level logic, leaving the rest of the work to the query engine! Each operator can be implemented and
optimized in multiple ways, opening a rich space for execution plans, similar to relational operators. Here is a quick example of semantic operators in action:

.. code-block:: python
Expand All @@ -28,7 +28,7 @@ Here are some key semantic operators:
+--------------+----------------------------------------------------------+
| sem_agg | Aggregate across all records (e.g. for summarization) |
+--------------+----------------------------------------------------------+
| sem_topk | Order records by the natural langauge ranking criteria |
| sem_topk | Order records by the natural language ranking criteria |
+--------------+----------------------------------------------------------+
| sem_join | Join two datasets based on a natural language predicate |
+--------------+----------------------------------------------------------+
Expand Down
2 changes: 1 addition & 1 deletion docs/multimodal_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ PIL images, numpy arrays, base64 strings, and image URLs
Initializing ImageArray
-----------------------
The ImageArray class is an extension array designed to handle images as data types in pandas.
You can initilize an ImageArray with a list of supported image formats
You can initialize an ImageArray with a list of supported image formats

.. code-block:: python

Expand Down
2 changes: 1 addition & 1 deletion docs/sem_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Overview
---------
The sem_index operator in LOTUS creates a semantic index over the specified column in the dataset.
This index enables efficient retrieval and ranking of records based on semantic similarity.
The index will be generated with the configured retreival model stored locally in the specified directory.
The index will be generated with the configured retrieval model stored locally in the specified directory.


Example
Expand Down
2 changes: 1 addition & 1 deletion docs/sem_map.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ sem_map

Overview
----------
This operato performs a semantic projection over an input column. The langex parameter specifies this projection in natural language.
This operator performs a semantic projection over an input column. The langex parameter specifies this projection in natural language.

Motivation
-----------
Expand Down
2 changes: 1 addition & 1 deletion docs/sem_sim_join.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ sem_sim_join

Overview
---------
The similairty join matches tuples from the right and left table according to their semantic similarity, rather than an arbitrary
The similarity join matches tuples from the right and left table according to their semantic similarity, rather than an arbitrary
natural-language predicate. Akin to an equi-join in standard relational algebra, the semantic similarity
join is a specialized semantic join, can be heavily optimized using the semantic index.

Expand Down
6 changes: 3 additions & 3 deletions docs/sem_topk.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,9 +59,9 @@ Required Parameters
- **user_instruction** : The user instruction for sorting.
- **K**: The number of rows to return.

Optional Paramaters
---------------------
Optional Parameters
--------------------
- **method** : The method to use for sorting. Options are "quick", "heap", "naive", "quick-sem".
- **group_by** : The columns to group by before sorting. Each group will be sorted separately.
- **cascade_threshold**: The confidence threshold for cascading to a larger model.
- **return_stats** : Whether to return stats.
- **return_stats** : Whether to return stats.
2 changes: 1 addition & 1 deletion examples/model_examples/cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
df = pd.DataFrame(data)
user_instruction = "{Course Name} requires a lot of math"
df = df.sem_filter(user_instruction)
print("====== intial run ======")
print("====== initial run ======")
print(df)

# run a second time
Expand Down
2 changes: 1 addition & 1 deletion lotus/models/reranker.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,6 @@ def __call__(self, query: str, docs: list[str], K: int) -> RerankerOutput:
K (int): The number of documents to keep after reranking.

Returns:
RerankerOutput: The indicies of the reranked documents.
RerankerOutput: The indices of the reranked documents.
"""
pass
2 changes: 1 addition & 1 deletion lotus/sem_ops/postprocessors.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ def extract_postprocess(llm_answers: list[str]) -> SemanticExtractPostprocessOut
Postprocess the output of the extract operator to extract the schema.

Args:
llm_answers (list[str]): The list of llm answers containging the extract.
llm_answers (list[str]): The list of llm answers containing the extract.

Returns:
SemanticExtractPostprocessOutput
Expand Down
2 changes: 1 addition & 1 deletion lotus/sem_ops/sem_join.py
Original file line number Diff line number Diff line change
Expand Up @@ -384,7 +384,7 @@ def join_optimizer(
int: The number of LM calls from optimizing join plan.
"""

# Helper is currently default to similiarity join
# Helper is currently default to similarity join
if lotus.settings.helper_lm is not None:
lotus.logger.debug("Helper model is not supported yet. Default to similarity join.")

Expand Down
2 changes: 1 addition & 1 deletion lotus/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def ret(
rm.load_index(col_index_dir)
assert rm.index_dir == col_index_dir

ids = df.index.tolist() # assumes df index hasn't been resest and corresponds to faiss index ids
ids = df.index.tolist() # assumes df index hasn't been reset and corresponds to faiss index ids
vec_set = rm.get_vectors_from_index(col_index_dir, ids)
d = vec_set.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
Expand Down
8 changes: 7 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -74,4 +74,10 @@ line-ending = "auto"
[tool.mypy]
python_version = "3.10"
strict = true
ignore_missing_imports = true
ignore_missing_imports = true
[tool.codespell]
# Ref: https://github.com/codespell-project/codespell#using-a-config-file
skip = '.git*'
check-hidden = true
ignore-regex = '\bParth\b'
ignore-words-list = 'ans'