Skip to content

Commit

Permalink
Add Web API for MarkItDown
Browse files Browse the repository at this point in the history
Related to microsoft#133

Improved with suggestions by @markthepixel, @GerardSmit and @ranma42.
  • Loading branch information
vs4vijay authored and ranma42 committed Jan 28, 2025
1 parent bfde857 commit 2413ea7
Show file tree
Hide file tree
Showing 4 changed files with 123 additions and 1 deletion.
3 changes: 2 additions & 1 deletion .dockerignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
*
*
!/src
26 changes: 26 additions & 0 deletions Dockerfile.api
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
FROM python:3.13-slim-bullseye

USER root

ARG INSTALL_GIT=false
RUN if [ "$INSTALL_GIT" = "true" ]; then \
apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*; \
fi

# Runtime dependency
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
&& rm -rf /var/lib/apt/lists/*

# FIXME: should use markitdown from sources
RUN pip install markitdown fastapi[standard] uvicorn

# Default USERID and GROUPID
ARG USERID=10000
ARG GROUPID=10000

USER $USERID:$GROUPID

COPY src/markitdown/api.py /src/markitdown/api.py

ENTRYPOINT ["uvicorn", "src.markitdown.api:app", "--host", "0.0.0.0", "--port", "8000"]
40 changes: 40 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,46 @@ print(result.text_content)
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
```

### Web API

You can also use MarkItDown via a REST endpoint. The Web API is built using FastAPI and can be run using Docker.

#### Running the Web API

1. Build the Docker image:

```sh
docker build -f Dockerfile.api -t markitdown-api:latest .
```

2. Run the Docker container:

```sh
docker run --rm -p 8000:8000 markitdown-api:latest
```

The Web API will be available at `http://localhost:8000`.

#### Using the Web API

The Web API provides a single endpoint `/convert` that accepts a file and returns the converted markdown.

- **Endpoint:** `/convert`
- **Method:** `POST`
- **Request Body:** Multipart form data with a file field named `file`
- **Response:** depends on the `Accept` header:
- `application/json` the JSON serialization of the `DocumentConverterResult`,
i.e. an object with a `text_content` field containing the converted markdown
and (optionally) a `title` field containing the title of the document
- (otherwise) a `text/markdown` response containing the converted markdown

Example using `curl`:

```sh
curl -X POST "http://localhost:8000/convert" -F "[email protected]"
```

<details>

<summary>Batch Processing Multiple Files</summary>
Expand Down
55 changes: 55 additions & 0 deletions src/markitdown/api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
from mimetypes import guess_extension
from multiprocessing import Pool
from os.path import splitext
from shutil import copyfileobj
from tempfile import NamedTemporaryFile
from fastapi import FastAPI, Request, UploadFile, HTTPException
from fastapi.responses import JSONResponse, Response
from markitdown import MarkItDown


def convert_simple(local_path: str, **kwargs):
return MarkItDown().convert(local_path, **kwargs)


pool = Pool()


def convert_upload(upload_file: UploadFile):
file_extension = None
ext = None

# Guess from the mimetype
file_extension = guess_extension(upload_file.content_type)

# Read the extension from the filename
if upload_file.filename:
base, ext = splitext(upload_file.filename)

# Save the file locally to a temporary file. It will be deleted before this function exits
with NamedTemporaryFile(suffix=ext) as temp:
copyfileobj(upload_file.file, temp)
temp.flush()
upload_file.file.close()

return pool.apply(convert_simple, [temp.name], {file_extension: file_extension})


app = FastAPI()


@app.post("/convert")
def convert(request: Request, file: UploadFile) -> Response:
if not file.filename:
raise HTTPException(status_code=400, detail="No file uploaded")

try:
result = convert_upload(file)

if request.headers.get("Accept") == "application/json":
return JSONResponse(content=result)
else:
return Response(content=result.text_content, media_type="text/markdown")

except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

0 comments on commit 2413ea7

Please sign in to comment.