Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor into llama-cloud-services #597

Merged
merged 14 commits into from
Feb 6, 2025
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/build_package.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,4 +45,4 @@ jobs:
- name: Test import
shell: bash
working-directory: ${{ vars.RUNNER_TEMP }}
run: python -c "import llama_parse"
run: python -c "import llama_cloud_services"
18 changes: 17 additions & 1 deletion .github/workflows/publish_release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,16 +23,31 @@ jobs:
uses: actions/setup-python@v4
with:
python-version: ${{ env.PYTHON_VERSION }}

- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: ${{ env.POETRY_VERSION }}

- name: Install deps
shell: bash
run: pip install -e .
- name: Build and publish to pypi

- name: Build and publish llama-cloud-services
uses: JRubics/[email protected]
with:
poetry_version: ${{ env.POETRY_VERSION }}
python_version: ${{ env.PYTHON_VERSION }}
working_directory: "llama_cloud_services"
pypi_token: ${{ secrets.LLAMA_PARSE_PYPI_TOKEN }}
poetry_install_options: "--without dev"

- name: Build and publish llama-parse
uses: JRubics/[email protected]
with:
poetry_version: ${{ env.POETRY_VERSION }}
python_version: ${{ env.PYTHON_VERSION }}
working_directory: "llama_parse"
pypi_token: ${{ secrets.LLAMA_PARSE_PYPI_TOKEN }}
poetry_install_options: "--without dev"

Expand All @@ -52,6 +67,7 @@ jobs:
export PKG=$(ls dist/ | grep tar)
set -- $PKG
echo "name=$1" >> $GITHUB_ENV

- name: Upload Release Asset (sdist) to GitHub
id: upload-release-asset
uses: actions/upload-release-asset@v1
Expand Down
3 changes: 2 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ repos:
rev: v1.0.1
hooks:
- id: mypy
exclude: ^tests/
additional_dependencies:
[
"types-requests",
Expand All @@ -46,7 +47,7 @@ repos:
[
--disallow-untyped-defs,
--ignore-missing-imports,
--python-version=3.8,
--python-version=3.10,
]
- repo: https://github.com/adamchainz/blacken-docs
rev: 1.16.0
Expand Down
158 changes: 22 additions & 136 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,165 +1,51 @@
# LlamaParse

[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-parse)](https://pypi.org/project/llama-parse/)
[![GitHub contributors](https://img.shields.io/github/contributors/run-llama/llama_parse)](https://github.com/run-llama/llama_parse/graphs/contributors)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/llama-cloud-services)](https://pypi.org/project/llama-cloud-services/)
[![GitHub contributors](https://img.shields.io/github/contributors/run-llama/llama_cloud_services)](https://github.com/run-llama/llama_cloud_services/graphs/contributors)
[![Discord](https://img.shields.io/discord/1059199217496772688)](https://discord.gg/dGcwcsnxhU)

LlamaParse is a **GenAI-native document parser** that can parse complex document data for any downstream LLM use case (RAG, agents).

It is really good at the following:

- ✅ **Broad file type support**: Parsing a variety of unstructured file types (.pdf, .pptx, .docx, .xlsx, .html) with text, tables, visual elements, weird layouts, and more.
- ✅ **Table recognition**: Parsing embedded tables accurately into text and semi-structured representations.
- ✅ **Multimodal parsing and chunking**: Extracting visual elements (images/diagrams) into structured formats and return image chunks using the latest multimodal models.
- ✅ **Custom parsing**: Input custom prompt instructions to customize the output the way you want it.
# Llama Cloud Services

LlamaParse directly integrates with [LlamaIndex](https://github.com/run-llama/llama_index).
This repository contains the code for hand-written SDKs and clients for interacting with LlamaCloud.

The free plan is up to 1000 pages a day. Paid plan is free 7k pages per week + 0.3c per additional page by default. There is a sandbox available to test the API [**https://cloud.llamaindex.ai/parse ↗**](https://cloud.llamaindex.ai/parse).
This includes:

Read below for some quickstart information, or see the [full documentation](https://docs.cloud.llamaindex.ai/).

If you're a company interested in enterprise RAG solutions, and/or high volume/on-prem usage of LlamaParse, come [talk to us](https://www.llamaindex.ai/contact).
- [LlamaParse](./parse.md) - A GenAI-native document parser that can parse complex document data for any downstream LLM use case (Agents, RAG, data processing, etc.).
- [LlamaReport (beta/invite-only)](./report.md) - A prebuilt agentic report builder that can be used to build reports from a variety of data sources.
- [LlamaExtract (beta/invite-only)](./extract.md) - A prebuilt agentic data extractor that can be used to transform data into a structured JSON representation.

## Getting Started

First, login and get an api-key from [**https://cloud.llamaindex.ai/api-key ↗**](https://cloud.llamaindex.ai/api-key).

Then, make sure you have the latest LlamaIndex version installed.

**NOTE:** If you are upgrading from v0.9.X, we recommend following our [migration guide](https://pretty-sodium-5e0.notion.site/v0-10-0-Migration-Guide-6ede431dcb8841b09ea171e7f133bd77), as well as uninstalling your previous version first.

```
pip uninstall llama-index # run this if upgrading from v0.9.x or older
pip install -U llama-index --upgrade --no-cache-dir --force-reinstall
```

Lastly, install the package:

`pip install llama-parse`

Now you can parse your first PDF file using the command line interface. Use the command `llama-parse [file_paths]`. See the help text with `llama-parse --help`.
Install the package:

```bash
export LLAMA_CLOUD_API_KEY='llx-...'

# output as text
llama-parse my_file.pdf --result-type text --output-file output.txt

# output as markdown
llama-parse my_file.pdf --result-type markdown --output-file output.md

# output as raw json
llama-parse my_file.pdf --output-raw-json --output-file output.json
pip install llama-cloud-services
```

You can also create simple scripts:

```python
import nest_asyncio

nest_asyncio.apply()

from llama_parse import LlamaParse

parser = LlamaParse(
api_key="llx-...", # can also be set in your env as LLAMA_CLOUD_API_KEY
result_type="markdown", # "markdown" and "text" are available
num_workers=4, # if multiple files passed, split in `num_workers` API calls
verbose=True,
language="en", # Optionally you can define a language, default=en
)

# sync
documents = parser.load_data("./my_file.pdf")

# sync batch
documents = parser.load_data(["./my_file1.pdf", "./my_file2.pdf"])
Then, get your API key from [LlamaCloud](https://cloud.llamaindex.ai/).

# async
documents = await parser.aload_data("./my_file.pdf")

# async batch
documents = await parser.aload_data(["./my_file1.pdf", "./my_file2.pdf"])
```

## Using with file object

You can parse a file object directly:
Then, you can use the services in your code:

```python
import nest_asyncio

nest_asyncio.apply()

from llama_parse import LlamaParse
from llama_cloud_services import LlamaParse, LlamaReport, LlamaExtract

parser = LlamaParse(
api_key="llx-...", # can also be set in your env as LLAMA_CLOUD_API_KEY
result_type="markdown", # "markdown" and "text" are available
num_workers=4, # if multiple files passed, split in `num_workers` API calls
verbose=True,
language="en", # Optionally you can define a language, default=en
)

file_name = "my_file1.pdf"
extra_info = {"file_name": file_name}

with open(f"./{file_name}", "rb") as f:
# must provide extra_info with file_name key with passing file object
documents = parser.load_data(f, extra_info=extra_info)

# you can also pass file bytes directly
with open(f"./{file_name}", "rb") as f:
file_bytes = f.read()
# must provide extra_info with file_name key with passing file bytes
documents = parser.load_data(file_bytes, extra_info=extra_info)
parser = LlamaParse(api_key="YOUR_API_KEY")
report = LlamaReport(api_key="YOUR_API_KEY")
extractor = LlamaExtract(api_key="YOUR_API_KEY")
```

## Using with `SimpleDirectoryReader`
See the quickstart guides for each service for more information:

You can also integrate the parser as the default PDF loader in `SimpleDirectoryReader`:

```python
import nest_asyncio

nest_asyncio.apply()

from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

parser = LlamaParse(
api_key="llx-...", # can also be set in your env as LLAMA_CLOUD_API_KEY
result_type="markdown", # "markdown" and "text" are available
verbose=True,
)

file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
```

Full documentation for `SimpleDirectoryReader` can be found on the [LlamaIndex Documentation](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader.html).

## Examples

Several end-to-end indexing examples can be found in the examples folder

- [Getting Started](examples/demo_basic.ipynb)
- [Advanced RAG Example](examples/demo_advanced.ipynb)
- [Raw API Usage](examples/demo_api.ipynb)
- [LlamaParse](./parse.md)
- [LlamaReport (beta/invite-only)](./report.md)
- [LlamaExtract (beta/invite-only)](./extract.md)

## Documentation

[https://docs.cloud.llamaindex.ai/](https://docs.cloud.llamaindex.ai/)
You can see complete SDK and API documentation for each service on [our official docs](https://docs.cloud.llamaindex.ai/).

## Terms of Service

See the [Terms of Service Here](./TOS.pdf).

## Get in Touch (LlamaCloud)

LlamaParse is part of LlamaCloud, our e2e enterprise RAG platform that provides out-of-the-box, production-ready connectors, indexing, and retrieval over your complex data sources. We offer SaaS and VPC options.

LlamaCloud is currently available via waitlist (join by [creating an account](https://cloud.llamaindex.ai/)). If you're interested in state-of-the-art quality and in centralizing your RAG efforts, come [get in touch with us](https://www.llamaindex.ai/contact).
You can get in touch with us by following our [contact link](https://www.llamaindex.ai/contact).
Binary file added examples/extract/data/resumes/ai_researcher.pdf
Binary file not shown.
Binary file added examples/extract/data/resumes/ml_engineer.pdf
Binary file not shown.
Binary file not shown.
Loading
Loading