Skip to content

Commit

Permalink
Rename to Parxy
Browse files Browse the repository at this point in the history
  • Loading branch information
avvertix committed Nov 20, 2024
1 parent 62e977f commit 58e0c35
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 14 deletions.
6 changes: 3 additions & 3 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ RUN pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt
FROM python:3.9.17-slim-bullseye AS runtime-image

LABEL maintainer="OneOffTech <[email protected]>" \
org.label-schema.name="data-house/pdf-text-extractor" \
org.label-schema.description="Docker image for the Data House PDF text extractor service." \
org.label-schema.name="OneOffTech/parxy" \
org.label-schema.description="Docker image for the Parxy service. A PDF parser gateway." \
org.label-schema.schema-version="1.0" \
org.label-schema.vcs-url="https://github.com/data-house/pdf-text-extractor"
org.label-schema.vcs-url="https://github.com/OneOffTech/parxy"

RUN apt-get update -yqq && \
apt-get install -yqq --no-install-recommends tini \
Expand Down
27 changes: 16 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,29 @@
[![CI](https://github.com/data-house/pdf-text-extractor/actions/workflows/ci.yml/badge.svg)](https://github.com/data-house/pdf-text-extractor/actions/workflows/ci.yml) [![Build Docker Image](https://github.com/data-house/pdf-text-extractor/actions/workflows/docker.yml/badge.svg)](https://github.com/data-house/pdf-text-extractor/actions/workflows/docker.yml)
[![CI](https://github.com/OneOffTech/parxy/actions/workflows/ci.yml/badge.svg)](https://github.com/OneOffTech/parxy/actions/workflows/ci.yml) [![Build Docker Image](https://github.com/OneOffTech/parxy/actions/workflows/docker.yml/badge.svg)](https://github.com/OneOffTech/parxy/actions/workflows/docker.yml)

# PDF Text Extraction Service
# Parxy

A FastAPI application to extract text from pdf documents.
Parxy is a gateway service that provides a unified approach to accessing PDF parsing services and libraries. It is available as a library and as an http-based application.

> [!NOTE]
> Parxy is under active development.
## Getting started

The PDF Text Extraction service is available as a Docker image.
The easiest way to get started with Parxy is to use the Docker image provided.

```bash
docker pull ghcr.io/data-house/pdf-text-extractor:main
docker pull ghcr.io/OneOffTech/parxy:main
```

A sample [`docker-compose.yaml` file](./docker-compose.yaml) is available within the repository.


> Please refer to [Releases](https://github.com/data-house/pdf-text-extractor/releases) and [Packages](https://github.com/data-house/pdf-text-extractor/pkgs/container/pdf-text-extractor) for the available tags.
> Please refer to [Releases](https://github.com/OneOffTech/parxy/releases) and [Packages](https://github.com/OneOffTech/parxy/pkgs/container/parxy) for the available tags.

## Usage

The PDF Text Extract service expose a web application. The available API receive a PDF file via a URL and return the extracted text as a JSON response.
Parxy expose a web-based application programming interface (API). The available API receive a PDF file via a URL and return the extracted text as a JSON response.

The exposed service is unauthenticated therefore consider exposing it only within a trusted network. If you plan to make it available publicly consider adding a reverse proxy with authentication in front.

Expand All @@ -33,9 +36,11 @@ with the following input as a `json` body:
- `mime_type`: the mime type of the file (it is expected to be `application/pdf`).
- `driver`: two drivers are currently implemented `pymupdf` and `pdfact`. It defines the extraction backend to use.

> **warning** The processing is performed synchronously
> [!WARNING]
> The processing is performed synchronously
The response is a JSON structure following the [Parse Document Model](https://github.com/OneOffTech/parse-document-model-python).

The response is a JSON with the extracted text organized into typed nodes, making it easy to navigate and understand the different components of a document.
In particular, the structure is as follows:
- `category`: A string specifying the node category, which is `doc`
- `content`: A list of `page` nodes representing the pages within the document.
Expand Down Expand Up @@ -93,7 +98,7 @@ The body of the response can contain a JSON with the following fields:

## Development

The PDF text extract service is built using [FastAPI](https://fastapi.tiangolo.com/) and Python 3.9.
Parxy is built using [FastAPI](https://fastapi.tiangolo.com/) and Python 3.9.

Given the selected stack the development requires:

Expand Down Expand Up @@ -121,7 +126,7 @@ _to be documented_

## Contributing

Thank you for considering contributing to the PDF text extract service! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.
Thank you for considering contributing to Parxy! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.


## Supporters
Expand Down

0 comments on commit 58e0c35

Please sign in to comment.