Skip to content

Commit

Permalink
TLDR-853 added info about GOST frame processing into docs
Browse files Browse the repository at this point in the history
  • Loading branch information
oksidgy committed Nov 18, 2024
1 parent cb701b0 commit 7edaa1b
Show file tree
Hide file tree
Showing 8 changed files with 78 additions and 3 deletions.
2 changes: 1 addition & 1 deletion dedoc/api/web/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ <h4>Tables handling </h4>

<div class="parameters">
<h4>PDF handling</h4>
<details><summary>pdf_with_text_layer, fast_textual_layer_detection, language, pages, is_one_column_document, document_orientation, need_header_footer_analysis, need_binarization</summary>
<details><summary>pdf_with_text_layer, fast_textual_layer_detection, language, pages, is_one_column_document, document_orientation, need_header_footer_analysis, need_binarization, need_gost_frame_analysis</summary>
<br>
<p>
<label>
Expand Down
Binary file modified docs/source/_static/code_examples/test_dir/example.docx
Binary file not shown.
Binary file added docs/source/_static/page_with_gost_frame_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/page_with_gost_frame_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/_static/result_gost_frame.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
("py:class", "abc.ABC"),
("py:class", "pydantic.main.BaseModel"),
("py:class", "scipy.stats._multivariate.dirichlet_multinomial_gen.cov"),
("py:class", "scipy.stats._multivariate.random_table_gen.rvs"),
("py:class", "pandas.core.series.Series"),
("py:class", "numpy.ndarray"),
("py:class", "pandas.core.frame.DataFrame"),
Expand Down
68 changes: 68 additions & 0 deletions docs/source/parameters/gost_frame_handling.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
.. _gost_frame_handling:

GOST frame handling
====================

.. flat-table:: Parameters for GOST frame handling
:widths: 5 5 3 15 72
:header-rows: 1
:class: tight-table

* - Parameter
- Possible values
- Default value
- Where can be used
- Description

* - need_gost_frame_analysis
- True, False
- False
- * :meth:`dedoc.DedocManager.parse`
* method :meth:`~dedoc.readers.BaseReader.read` of inheritors of :class:`~dedoc.readers.BaseReader`
* :meth:`dedoc.readers.PdfTabbyReader.read`
- This option is used to enable GOST (Russian government standard "ГОСТ Р 21.1101") frame recognition for PDF documents or images.


The content of each page of some technical documents is placed in special GOST frames. An example of GOST frames is shown in the example below (:ref:`example_gost_frame`).
Such frames contain meta-information and are not part of the text content of the document.Based on this, we have implemented the functionality for ignoring GOST frames in documents, which works for:

* Copyable and non-copyable PDF documents (:class:`dedoc.readers.PdfTxtlayerReader` and :class:`dedoc.readers.PdfTabbyReader`);
* Images (:class:`dedoc.readers.PdfImageReader`).

If parameter ``need_gost_frame_analysis=True``, the GOST frame itself is ignored and only the contents inside the frame are extracted.

.. _example_gost_frame:

Examples of GOST frame
----------------------
For example your send PDF-document with two pages:

.. image:: ../_static/page_with_gost_frame_1.png
:width: 30%
.. image:: ../_static/page_with_gost_frame_2.png
:width: 30%

Parameter's usage
-----------------

.. code-block:: python
import requests
data = {
"pdf_with_text_layer": "auto_tabby",
"need_gost_frame_analysis": "true",
"return_format": "html"
}
with open(filename, "rb") as file:
files = {"file": (filename, file)}
r = requests.post("http://localhost:1231/upload", files=files, data=data)
result = r.content.decode("utf-8")
Request's result
----------------

.. image:: ../_static/result_gost_frame.png
:width: 50%


10 changes: 8 additions & 2 deletions docs/source/parameters/pdf_handling.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,8 +159,8 @@ PDF and images handling
* :meth:`dedoc.readers.ReaderComposition.read`
- This option is used to enable GOST (Russian government standard) frame recognition for PDF documents or images.
The GOST frame recognizer is used in :meth:`dedoc.readers.PdfBaseReader.read`. Its main function is to recognize and
ignore the GOST frame on the document. It allows :class:`dedoc.readers.PdfImageReader` and :class:`dedoc.readers.PdfTxtlayerReader`
to properly process the content of the document containing GOST frame.
ignore the GOST frame on the document. It allows :class:`dedoc.readers.PdfImageReader`, :class:`dedoc.readers.PdfTxtlayerReader`
and :class:`dedoc.readers.PdfTabbyReader` to properly process the content of the document containing GOST frame, see :ref:`gost_frame_handling` for more details

* - orient_analysis_cells
- True, False
Expand All @@ -185,3 +185,9 @@ PDF and images handling

* **270** -- cells are rotated 90 degrees clockwise;
* **90** -- cells are rotated 90 degrees counterclockwise (or 270 clockwise).


.. toctree::
:maxdepth: 1

gost_frame_handling

0 comments on commit 7edaa1b

Please sign in to comment.