This project demonstrates Optical Character Recognition (OCR) using Python and Pytesseract to convert voter information provided in Tamil language. OCR is a technology that extracts text from images or scanned documents. In this project, we leverage the power of Pytesseract, along with other essential libraries like Numpy, Pandas, PyPDF2, PIL (Python Imaging Library), and Google Translator, to perform OCR tasks in a Jupyter Notebook environment.
Make sure you have the following installed:
- Python: Download Python
- Jupyter Notebook: Installation Guide
- Pytesseract:
pip install pytesseract
- Tesseract OCR Engine: Tesseract Installation Guide
Additionally, install the required Python libraries:
pip install numpy pandas PyPDF2 googletrans==4.0.0-rc1
Language Translation (Optional):
- Modify the notebook to translate extracted text using Google Translator if multilingual support is needed.
import pytesseract
from PIL import Image
# Read an image from file
image_path = 'images/sample_image.png'
image = Image.open(image_path)
# Perform OCR using Pytesseract
extracted_text = pytesseract.image_to_string(image)
# Print the extracted text
print("Extracted Text:")
print(extracted_text)
Feel free to contribute, open issues, or provide feedback.🚀