Contains Python toolkit for parsing figures from old PDF figures from the 80's.
For now there is just a tool for getting the y-value counts from histograms based on screenshotted images from the PDF.
The Histogram Parser is a Python-based tool designed to extract, process, and analyze data from histogram plots saved as images. It leverages OpenCV for image processing and Tesseract OCR to handle text detection, enabling users to generate CSV reports of histogram data while accurately mapping bins to labeled sections (e.g., cervical levels).
- Automatically preprocesses images by removing text regions detected using OCR.
- Allows manual selection of histogram and reference pixel ROIs.
- Scales histogram bin heights using user-defined vertical scales.
- Maps histogram bins to labeled segments (e.g., cervical levels).
- Generates detailed CSV reports with segment-level bin data.
- Python 3.8 or higher
- Tesseract OCR
- Required Python packages (listed in
requirements.txt
)
-
Clone the repository:
git clone <repository-url> cd <repository-folder>
-
Install dependencies:
pip install -r requirements.txt
-
Ensure Tesseract OCR is installed and added to your system's PATH:
- On Windows: Download and install Tesseract OCR from Tesseract GitHub.
- On Linux:
sudo apt-get install tesseract-ocr
- On macOS:
brew install tesseract
-
Verify installation by running:
tesseract --version
Use the provided main.py
to parse single or batch image files.
--input_file
: Path to a single histogram image file.--batch
: Path to a directory containing multiple histogram images.--tesseract_exe
: Full path to the Tesseract executable (default:C:\msys64\mingw64\bin\tesseract.exe
).
Process a single file:
python main.py --input_file data/example.png --tesseract_exe "C:\msys64\mingw64\bin\tesseract.exe"
Process a batch of files:
python main.py --batch data/ --tesseract_exe "C:\msys64\mingw64\bin\tesseract.exe"
- Set Histogram ROI:
- Manually select the region containing the histogram.
- Set Reference Pixel ROI for Y-Axis:
- Manually select a reference region for determining vertical scaling.
- Input Parameters:
- Enter the vertical scale (counts) and labeled segments (e.g.,
C7, C8, T1
).
- Enter the vertical scale (counts) and labeled segments (e.g.,
- Generate Output:
- The tool processes the image and saves results in a CSV file (e.g.,
example.csv
).
- The tool processes the image and saves results in a CSV file (e.g.,
The generated CSV contains three columns:
- Cervical Level: The label of the segment (e.g.,
C7
). - Sub-index: The index of the bin within the segment.
- Count: The scaled value for the bin.
.
├── data/ # Directory for input images
├── pipeline/ # Core processing module
│ ├── histogram_parser.py # Main HistogramParser class
│ ├── __init__.py # Package initialization
├── main.py # CLI script for running the parser
├── requirements.txt # Python dependencies
└── README.md # Documentation (this file)
- Tesseract OCR not found:
Ensure Tesseract is installed and its path is correctly provided in the
--tesseract_exe
argument. - ROI selection issues: Make sure to accurately select the histogram and reference ROIs during the interactive process.
- Unexpected CSV results: Verify that the input vertical scale matches the unit of the Y-axis in the image.
Contributions are welcome! Please follow these steps:
- Fork the repository.
- Create a feature branch:
git checkout -b feature-name
- Commit your changes and open a pull request.
This project is licensed under the MIT License. See LICENSE
for details.
- OpenCV for image processing.
- Tesseract OCR for text detection and recognition.