From 58e0c35c52216d4273dcd4b68d3f9e5c0bc069b1 Mon Sep 17 00:00:00 2001 From: Alessio Vertemati Date: Wed, 20 Nov 2024 15:21:12 +0100 Subject: [PATCH] Rename to Parxy --- Dockerfile | 6 +++--- README.md | 27 ++++++++++++++++----------- 2 files changed, 19 insertions(+), 14 deletions(-) diff --git a/Dockerfile b/Dockerfile index 6591947..3be6627 100644 --- a/Dockerfile +++ b/Dockerfile @@ -14,10 +14,10 @@ RUN pip install --upgrade pip && pip install --no-cache-dir -r requirements.txt FROM python:3.9.17-slim-bullseye AS runtime-image LABEL maintainer="OneOffTech " \ - org.label-schema.name="data-house/pdf-text-extractor" \ - org.label-schema.description="Docker image for the Data House PDF text extractor service." \ + org.label-schema.name="OneOffTech/parxy" \ + org.label-schema.description="Docker image for the Parxy service. A PDF parser gateway." \ org.label-schema.schema-version="1.0" \ - org.label-schema.vcs-url="https://github.com/data-house/pdf-text-extractor" + org.label-schema.vcs-url="https://github.com/OneOffTech/parxy" RUN apt-get update -yqq && \ apt-get install -yqq --no-install-recommends tini \ diff --git a/README.md b/README.md index 904d4ef..5f06f03 100644 --- a/README.md +++ b/README.md @@ -1,26 +1,29 @@ -[![CI](https://github.com/data-house/pdf-text-extractor/actions/workflows/ci.yml/badge.svg)](https://github.com/data-house/pdf-text-extractor/actions/workflows/ci.yml) [![Build Docker Image](https://github.com/data-house/pdf-text-extractor/actions/workflows/docker.yml/badge.svg)](https://github.com/data-house/pdf-text-extractor/actions/workflows/docker.yml) +[![CI](https://github.com/OneOffTech/parxy/actions/workflows/ci.yml/badge.svg)](https://github.com/OneOffTech/parxy/actions/workflows/ci.yml) [![Build Docker Image](https://github.com/OneOffTech/parxy/actions/workflows/docker.yml/badge.svg)](https://github.com/OneOffTech/parxy/actions/workflows/docker.yml) -# PDF Text Extraction Service +# Parxy -A FastAPI application to extract text from pdf documents. +Parxy is a gateway service that provides a unified approach to accessing PDF parsing services and libraries. It is available as a library and as an http-based application. + +> [!NOTE] +> Parxy is under active development. ## Getting started -The PDF Text Extraction service is available as a Docker image. +The easiest way to get started with Parxy is to use the Docker image provided. ```bash -docker pull ghcr.io/data-house/pdf-text-extractor:main +docker pull ghcr.io/OneOffTech/parxy:main ``` A sample [`docker-compose.yaml` file](./docker-compose.yaml) is available within the repository. -> Please refer to [Releases](https://github.com/data-house/pdf-text-extractor/releases) and [Packages](https://github.com/data-house/pdf-text-extractor/pkgs/container/pdf-text-extractor) for the available tags. +> Please refer to [Releases](https://github.com/OneOffTech/parxy/releases) and [Packages](https://github.com/OneOffTech/parxy/pkgs/container/parxy) for the available tags. ## Usage -The PDF Text Extract service expose a web application. The available API receive a PDF file via a URL and return the extracted text as a JSON response. +Parxy expose a web-based application programming interface (API). The available API receive a PDF file via a URL and return the extracted text as a JSON response. The exposed service is unauthenticated therefore consider exposing it only within a trusted network. If you plan to make it available publicly consider adding a reverse proxy with authentication in front. @@ -33,9 +36,11 @@ with the following input as a `json` body: - `mime_type`: the mime type of the file (it is expected to be `application/pdf`). - `driver`: two drivers are currently implemented `pymupdf` and `pdfact`. It defines the extraction backend to use. -> **warning** The processing is performed synchronously +> [!WARNING] +> The processing is performed synchronously + +The response is a JSON structure following the [Parse Document Model](https://github.com/OneOffTech/parse-document-model-python). -The response is a JSON with the extracted text organized into typed nodes, making it easy to navigate and understand the different components of a document. In particular, the structure is as follows: - `category`: A string specifying the node category, which is `doc` - `content`: A list of `page` nodes representing the pages within the document. @@ -93,7 +98,7 @@ The body of the response can contain a JSON with the following fields: ## Development -The PDF text extract service is built using [FastAPI](https://fastapi.tiangolo.com/) and Python 3.9. +Parxy is built using [FastAPI](https://fastapi.tiangolo.com/) and Python 3.9. Given the selected stack the development requires: @@ -121,7 +126,7 @@ _to be documented_ ## Contributing -Thank you for considering contributing to the PDF text extract service! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file. +Thank you for considering contributing to Parxy! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file. ## Supporters