This project uses OpenCV and Tesseract OCR to detect and extract text from images. The program preprocesses images, identifies text regions, and converts them into readable text, which is saved to a file and optionally displayed. This project is ideal for automating text recognition tasks in scanned documents, photographs, or other image files.
- Preprocessing:
- Converts images to grayscale for simpler processing.
- Uses adaptive thresholding to binarize the image and make text regions prominent.
- Applies dilation to expand text regions for better detection.
- Text Detection:
- Finds contours to identify potential text regions.
- Filters out small contours to remove irrelevant areas.
- Text Extraction:
- Crops and resizes text regions for improved OCR accuracy.
- Extracts text using Tesseract OCR.
- Output:
- Saves the extracted text to
detected text.txt
. - Prints the extracted text directly to the terminal.
- Saves the extracted text to
- Python (3.x recommended)
- Required Libraries:
- Install using pip:
pip install opencv-python pytesseract
- Install using pip:
- Tesseract OCR:
- macOS:
brew install tesseract
- Linux:
sudo apt update sudo apt install tesseract-ocr
- Windows:
- Download and install from Tesseract OCR GitHub.
- macOS:
- Clone the repository:
git clone https://github.com/your-username/text-extraction-opencv.git cd text-extraction-opencv
- Ensure Tesseract is installed and accessible. Update the path to Tesseract in the script if necessary:
pytesseract.pytesseract.tesseract_cmd = '/path/to/tesseract'
- Add your input image (e.g.,
sample.jpg
) to the project directory. - Run the script:
python text_extraction.py
- Output:
- The extracted text will be saved in
detected text.txt
. - Detected text will also be displayed in the terminal.
- The extracted text will be saved in
The program works as follows:
-
Load and Preprocess the Image:
- Converts the input image to grayscale.
- Applies adaptive thresholding to binarize the image.
- Uses dilation to merge close text components.
-
Detect Text Regions:
- Finds contours in the processed image.
- Filters small contours to avoid noise.
- Crops and resizes potential text regions.
-
Extract and Output Text:
- Passes the cropped regions to Tesseract for text recognition.
- Writes the detected text to a file and displays it in the terminal.
-
Improved OCR Accuracy:
- Implement advanced preprocessing techniques like noise reduction, skew correction, and edge enhancement for cleaner text regions.
- Experiment with different OCR configurations for handling complex layouts.
-
Multi-Language Support:
- Extend Tesseract's language models to support text extraction in languages other than English.
-
Support for Curved Text:
- Enhance the program to detect and process curved or rotated text using tools like Hough Transform or deskewing algorithms.
-
Error Handling and Validation:
- Add error handling for missing dependencies, invalid input files, and unreadable text regions.