PDF Figure Parsing

Contains Python toolkit for parsing figures from old PDF figures from the 80's.
For now there is just a tool for getting the y-value counts from histograms based on screenshotted images from the PDF.

HistogramParser Overview

The Histogram Parser is a Python-based tool designed to extract, process, and analyze data from histogram plots saved as images. It leverages OpenCV for image processing and Tesseract OCR to handle text detection, enabling users to generate CSV reports of histogram data while accurately mapping bins to labeled sections (e.g., cervical levels).

HistogramParser Features

Automatically preprocesses images by removing text regions detected using OCR.
Allows manual selection of histogram and reference pixel ROIs.
Scales histogram bin heights using user-defined vertical scales.
Maps histogram bins to labeled segments (e.g., cervical levels).
Generates detailed CSV reports with segment-level bin data.

Installation

Prerequisites

Python 3.8 or higher
Tesseract OCR
Required Python packages (listed in requirements.txt)

Steps

Clone the repository:

git clone <repository-url>
cd <repository-folder>

Install dependencies:
```
pip install -r requirements.txt
```
Ensure Tesseract OCR is installed and added to your system's PATH:
- On Windows: Download and install Tesseract OCR from Tesseract GitHub.
- On Linux:
```
sudo apt-get install tesseract-ocr
```
- On macOS:
```
brew install tesseract
```
Verify installation by running:
```
tesseract --version
```

HistogramParser Usage

Command-Line Interface

Use the provided main.py to parse single or batch image files.

Arguments:

--input_file: Path to a single histogram image file.
--batch: Path to a directory containing multiple histogram images.
--tesseract_exe: Full path to the Tesseract executable (default: C:\msys64\mingw64\bin\tesseract.exe).

Example:

Process a single file:

python main.py --input_file data/example.png --tesseract_exe "C:\msys64\mingw64\bin\tesseract.exe"

Process a batch of files:

python main.py --batch data/ --tesseract_exe "C:\msys64\mingw64\bin\tesseract.exe"

Interactive Steps

Set Histogram ROI:
- Manually select the region containing the histogram.
Set Reference Pixel ROI for Y-Axis:
- Manually select a reference region for determining vertical scaling.
Input Parameters:
- Enter the vertical scale (counts) and labeled segments (e.g., C7, C8, T1).
Generate Output:
- The tool processes the image and saves results in a CSV file (e.g., example.csv).

CSV Output

The generated CSV contains three columns:

Cervical Level: The label of the segment (e.g., C7).
Sub-index: The index of the bin within the segment.
Count: The scaled value for the bin.

File Structure

.
├── data/                      # Directory for input images
├── pipeline/                  # Core processing module
│   ├── histogram_parser.py    # Main HistogramParser class
│   ├── __init__.py            # Package initialization
├── main.py                    # CLI script for running the parser
├── requirements.txt           # Python dependencies
└── README.md                  # Documentation (this file)

Troubleshooting

Tesseract OCR not found: Ensure Tesseract is installed and its path is correctly provided in the --tesseract_exe argument.
ROI selection issues: Make sure to accurately select the histogram and reference ROIs during the interactive process.
Unexpected CSV results: Verify that the input vertical scale matches the unit of the Y-axis in the image.

Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.
Create a feature branch:
```
git checkout -b feature-name
```
Commit your changes and open a pull request.

License

This project is licensed under the MIT License. See LICENSE for details.

Acknowledgments

OpenCV for image processing.
Tesseract OCR for text detection and recognition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Figure Parsing

HistogramParser Overview

HistogramParser Features

Installation

Prerequisites

Steps

HistogramParser Usage

Command-Line Interface

Arguments:

Example:

Interactive Steps

CSV Output

File Structure

Troubleshooting

Contributing

License

Acknowledgments

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
main.py		main.py
requirements.txt		requirements.txt

Neuro-Mechatronics-Interfaces/pdf_figure_parsing

Folders and files

Latest commit

History

Repository files navigation

PDF Figure Parsing

HistogramParser Overview

HistogramParser Features

Installation

Prerequisites

Steps

HistogramParser Usage

Command-Line Interface

Arguments:

Example:

Interactive Steps

CSV Output

File Structure

Troubleshooting

Contributing

License

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages