Skip to content

Convert document files (PDF, DOCX, PPTX) to Markdown

License

Notifications You must be signed in to change notification settings

saltcorn/docling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

docling

Convert document files (PDF, DOCX, PPTX) to Markdown

This module provides an Saltcorn action that uses docling to convert files (PDF, .docx word documents and more) to markdown.

Installation

Before the actions can run the module needs to install docling in a new python virtual environment. Please make sure your server is able to create virtual environments. On Debian/Ubuntu this can be done by installing the python3-venv systems package.

In order to install the virtual environment, configure the Saltcorn docling module. You will see a short message about this installation requirement and then click finish to perform the actual installation.

Action Usage

To use this module, create a table that has a file field (which will contain the source documents) and a string or markdown field (which will be set to the output of the source documents in markdown format). The link the action docling_to_markdown to a button or to a table trigger on this table. In the configuration for this action you should choose the appropriate fields. Running the action will then perform the conversion and set the contents of the string or markdown field.

Functions usage

The module also provides two functions docling_file_to_markdown and docling_html_to_markdown which each take a single argument and return a markdown string.

docling_file_to_markdown takes a string containing a file path in your file store and produces markdown

docling_html_to_markdown takes a string containing HTML (this should be a full document with DOCTYPE) and produces markdown.

This module also export two other functions to generate text from HTML:

The function htmlToText which is the htmlToText (alias for convert) from the html-to-text module. This generates text (not markdown) froma single argument which is an html string.

The function turndown_html_to_markdown uses the turndown module to generate markdown from HTML. It takes two arguments, the first is an HTML string, and the second is optional, an object with the turndown options

Example for run_js_code action:

const md = turndown_html_to_markdown(`<h1>Hello</h1>world`)
console.log(md)

About

Convert document files (PDF, DOCX, PPTX) to Markdown

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published