Check it out on
Source Code of our Paper:
Multi-Type-TD-TSR Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition
TSR for partially bordered tables uses the same erosion algorithm as in bordered tables to detect existing borderes, but without using them to create a grid cell, but to delete the borders from the table image to get an unbordered table. This allows for applying the algorithm for unbordered tables to create the grid-cell image and contours by analogy to the variants discussed above. A key feature of this approach is that it works with both bordered and unbordered tables: it is type-independent.
IoU | IoU | IoU | IoU | Weighted | |
---|---|---|---|---|---|
Team | 0.6 | 0.7 | 0.8 | 0.9 | Average |
CascadeTabNet | 0.438 | 0.354 | 0.19 | 0.036 | 0.232 |
NLPR-PAL | 0.365 | 0.305 | 0.195 | 0.035 | 0.206 |
Multi-Type-TD-TSR | 0.589 | 0.404 | 0.137 | 0.015 | 0.253 |
The source code is developed under the following library dependencies
- PyTorch = 1.7.0
- Torchvision = 0.8.1
- Cuda = 10.1
- PyYAML = 5.1
The table detection model is based on detectron2 follow this installation guide to setup.
For the image alignment pre-processing step there is one script available:
deskew.py
To apply the image alignment pre-processing algorithm to all images in one folder, you need to execute:
python3 deskew.py
with the following parameters
--folder
the input folder including document images--output
the output folder for the deskewed images
For the table structure recognition we offer a simple script for different approaches
tsr.py
To apply a table structure recognition algorithm to all images in one folder, you need to execute:
python3 tsr.py
with the following parameters
--folder
path of the input folder including table images--type
the table structure recognition typetype in ["borderd", "unbordered", "partially", "partially_color_inv"]
--img_output
output folder path for the processed images--xml_output
output folder path for the xml files including bounding boxes
To appy the table detection with a followed table structure recogniton
tdtsr.py
To apply a table structure recognitio algorithm to all images in one folder, you need to execute:
python3 tdtsr.py
with the following parameters
--folder
path of the input folder including table images--type
the table structure recognition typetype in ["borderd", "unbordered", "partially", "partially_color_inv"]
--tsr_img_output
output folder path for the processed table images--td_img_output
output folder path for the produced table cutouts--xml_output
output folder path for the xml files for tables and cells including bounding boxes--config
path of detectron2 configuration file for table detection--yaml
path of detectron2 yaml file for table detection--weights
path of detectron2 model weights for table detection
To evaluate the table structure recognition algorithm we provide the following script:
evaluate.py
to apply the evaluation the table images and their labels in xml-format have to be the same name and should lie in a single folder. The evaluation could be started by:
python3 evaluate.py
with the following parameter
--dataset
dataset folder path containing table images and labels in .xml format
- test dataset for table structure recognition including table images and annotations can be downloaded here
- table detection detectron2 model weights and configuration files can be downloaded here
@misc{fischer2021multitypetdtsr,
title={Multi-Type-TD-TSR - Extracting Tables from Document Images using a Multi-stage Pipeline for Table Detection and Table Structure Recognition: from OCR to Structured Table Representations},
author={Pascal Fischer and Alen Smajic and Alexander Mehler and Giuseppe Abrami},
year={2021},
eprint={2105.11021},
archivePrefix={arXiv},
primaryClass={cs.CV}
}