Skip to content

Commit

Permalink
docs: more README.md updates
Browse files Browse the repository at this point in the history
  • Loading branch information
nmammeri committed Sep 19, 2024
1 parent 9b73217 commit 3d10693
Showing 1 changed file with 94 additions and 20 deletions.
114 changes: 94 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,8 @@
<div align="center">

_Extractous offers a fast and efficient solution for extracting content and metadata from various documents types such as PDF, Word, HTML, and [many other formats](#supported-file-formats).
Our goal is to deliver an efficient comprehensive solution with bindings for many programming languages._
Our goal is to deliver a fast and efficient comprehensive solution in Rust with bindings for many programming
languages._

</div>

Expand All @@ -30,32 +31,105 @@ Our goal is to deliver an efficient comprehensive solution with bindings for man
For complete benchmarking details please consult our [benchmarking repository](https://github.com/yobix-ai/extractous-benchmarks)

![unstructured_vs_extractous](https://github.com/yobix-ai/extractous-benchmarks/raw/main/docs/extractous_vs_unstructured.gif)
<sup>* demo running at 5x recoding speed </sup>
<sup>* demo running at 5x recoding speed</sup>

## Why Extractous?

Extractous was mainly inspired by the [Unstructured Python library](https://github.com/Unstructured-IO/unstructured).
While Unstructured offers a good solution for parsing unstructured content, we see 2 main issues with it:
**Extractous** was born out of frustration with requiring yet another service to handle content extraction out of
unstructured data. Do we really need to call external APIs or run special servers just for content extraction? Can't
we perform the extraction locally and efficiently?

* Performance: data processing is mainly a cpu-bound problem and Python is not the best choice for such tasks
While researching this space, **unstructured-io** offers a good solution for parsing unstructured content, and can be
performed in-process. However, it's performance is very poor and has many limitations:
* **unstructured-io** wraps around so many heavy Python libraries making it both slow and memory hungry [See benchmarks foo more details](https://github.com/yobix-ai/extractous-benchmarks).
* Data processing is mainly a cpu-bound problem and Python is not the best choice for such tasks
because of its Global Interpreter Lock (GIL) which makes it hard to utilize multiple cores.
* [Unstructured](https://github.com/Unstructured-IO/unstructured) is becoming more of an LLM framework rather than
just text and metadata parsing library.
* **unstructured-io** is becoming increasingly complex as it focuses on becoming more of a framework rather than
just a text and metadata extraction library.

In contrast, **Extractous** is built in Rust, a language renowned for its memory safety and high performance. By
leveraging Rust's multithreading capabilities and zero-cost abstractions, Extractous achieves significantly faster
processing speeds. **Extractous** maintains a dedicated focus on text and metadata extraction, ensuring optimized
performance and reliability in its core functionality.

## 🌳 Key Features
* Fast and efficient unstructured data extraction.
* Clear and simple API for extracting text and metadata content.
* Autodetect document type and extracts content accordingly.
* Supports [many file formats](#supported-file-formats).
* Extracts text from images and scanned documents with OCR through [tesseract-ocr](https://github.com/tesseract-ocr/tesseract).
* Leverages Rust performance and memory safety and provides bindings for [Python](https://pypi.org/project/extractous/)
and Javascript/Typescript(coming soon)
* Comprehensive documentation and examples to help you get started quickly.
* Free for Commercial Use: Apache 2.0 License.

Extractous will focus only on the text and metadata extraction part. The core is written in Rust, leveraging its
memory safety, multithreading and zero cost abstractions. Extractous will provide bindings for many programming
languages.
## 🚀 Quickstart
Extractous provides a simple and easy-to-use API for extracting content from various file formats. Below are examples:

## Features
### Python
* Extract a file content to a string:
```python
from extractous import Extractor

* Clear simple API for extracting text and metadata content.
* Support for [many file formats](#supported-file-formats).
* Strives to be efficient and fast.
* Comprehensive documentation and examples to help you get started quickly.
# Create a new extractor
extractor = Extractor()
extractor.set_extract_string_max_length(1000)

# Extract text from a file
result = extractor.extract_file_to_string("README.md")
print(result)
```

### Rust
* Extract a file content to a string:
```rust
use extractous::Extractor;
use extractous::PdfParserConfig;

// Create a new extractor. Note it uses a consuming builder pattern
let mut extractor = Extractor::new().set_extract_string_max_length(1000);

// Extract text from a file
let text = extractor.extract_file_to_string("README.md").unwrap();
println!("{}", text);
```

## 🔥 Performance
* **Extractous** is built in fast, don't take our word for it, you can run the [benchmarks](https://github.com/yobix-ai/extractous-benchmarks) yourself. For example extracting content out of sec10 filings
pdf forms, **Extractous** is 22x faster than **unstructured-io**:

![extractous_speedup_relative_to_unstructured](https://github.com/yobix-ai/extractous-benchmarks/raw/main/docs/extractous_speedup_relative_to_unstructured.png)

* Not just speed it is also memory efficient, **Extractous** allocates 12x less memory than **unstructured-io**:

![extractous_memory_efficiency_relative_to_unstructured](https://github.com/yobix-ai/extractous-benchmarks/raw/main/docs/extractous_memory_efficiency_relative_to_unstructured.png)



## 📄 Supported file formats

| **Category** | **Supported Formats** | **Notes** |
|---------------------|---------------------------------------------------------|------------------------------------------------|
| **Microsoft Office**| DOC, DOCX, PPT, PPTX, XLS, XLSX, RTF | Includes legacy and modern Office file formats |
| **OpenOffice** | ODT, ODS, ODP | OpenDocument formats |
| **PDF** | PDF | Can extracts embedded content and supports OCR |
| **Spreadsheets** | CSV, TSV | Plain text spreadsheet formats |
| **Web Documents** | HTML, XML | Parses and extracts content from web documents |
| **E-Books** | EPUB | EPUB format for electronic books |
| **Text Files** | TXT, Markdown | Plain text formats |
| **Images** | PNG, JPEG, TIFF, BMP, GIF, ICO, PSD, SVG | Extracts embedded text with OCR |
| **E-Mail** | EML, MSG, MBOX, PST | Extracts content, headers, and attachments |

[//]: # (| **Archives** | ZIP, TAR, GZIP, RAR, 7Z | Extracts content from compressed archives |)
[//]: # (| **Audio** | MP3, WAV, OGG, FLAC, AU, MIDI, AIFF, APE | Extracts metadata such as ID3 tags |)
[//]: # (| **Video** | MP4, AVI, MOV, WMV, FLV, MKV, WebM | Extracts metadata and basic information |)
[//]: # (| **CAD Files** | DXF, DWG | Supports CAD formats for engineering drawings |)
[//]: # (| **Other** | ICS &#40;Calendar&#41;, VCF &#40;vCard&#41; | Supports calendar and contact file formats |)
[//]: # (| **Geospatial** | KML, KMZ, GeoJSON | Extracts geospatial data and metadata |)
[//]: # (| **Font Files** | TTF, OTF | Extracts metadata from font files |)

## Supported file formats
## 🤝 Contributing
Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or new features to propose.

| File Format | Rust Core | Python Binding |
|-------------|-----------|----------------|
| pdf |||
| csv |||
## 🕮 License
This project is licensed under the Apache License 2.0. See the LICENSE file for details.

0 comments on commit 3d10693

Please sign in to comment.