pytesseract is a Python wrapper for Google's Tesseract-OCR Engine. This wrapper lets you extract text from images with just a few lines of code. This tool will be very helpful to you if you are working with document digitization, data extraction, and image-to-text conversion if you know the basics of Python, which you will, if you keep visiting PythonCentral 🙂
In this detailed guide, we will learn how to use pytesseract effectively, including setup, usage examples, advanced techniques, best practices, common pitfalls, and tips for better OCR accuracy. Ready? Get. Set. Learn!
How to Install pytesseract
Before you start using pytesseract, make sure you have Tesseract-OCR installed. Here is how you can install it on different operating systems:
- Windows: Download from Github
- macOS: Open Terminal and execute "brew install tesseract"
- Linux: Open Terminal and execute "sudo apt-get install tesseract-ocr"
You can install pytesseract via pip as well by executing the command:
pip install pytesseract
Then, install Pillow for image processing by executing the command:
pip install Pillow
How Can You Use pytesseract
Here is how you can extract text from an image using pytesseract:
from PIL import Image import pytesseract # This step loads the image image = Image.open('ExampleImage.png') # Now let us perform OCR text = pytesseract.image_to_string(image) print(text)
Specifying Tesseract Path in Windows
To specify the path while using a Windows device, execute this command:
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
How to Read Text from Different Languages
Here is how you can specify a language for OCR:
text = pytesseract.image_to_string(image, lang='fra') # For French
Download additional language packs from the official repository.
Extracting Structured Data
Now that we have covered the basics, let us see some practical applications
Creating Bounding Boxes for Words
Use this script to create bounding boxes for words:
boxes = pytesseract.image_to_boxes(image) print(boxes)
Extracting Metadata at Word-Level
To extract word-level metadata from an image, you can use this script:
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DICT) print(data['text'])
What is Preprocessing
Preprocessing an image is done to improve accuracy. OCR performance improves with clean images. Here is how you can do that:
from PIL import ImageOps
# This step converts the image to grayscale and increase contrast image = ImageOps.grayscale(image) image = ImageOps.autocontrast(image)
For advanced preprocessing, follow these instructions:
- Use OpenCV to apply thresholding, denoising, and dilation.
- Crop unnecessary areas and remove noisy backgrounds.
How to Perform OCR with PDF Files
You can use "pdf2image" to convert PDFs into images. As usual, we are going to use pip to install "pdf2image":
pip install pdf2image
Once you have installed it, use this script to convert pdf to image and then perform OCR:
from pdf2image import convert_from_path images = convert_from_path('sample.pdf') for img in images: print(pytesseract.image_to_string(img))
Some Advanced Use Cases
By now, you would be familiar with the basic use cases. Now it is time for some advanced real-world applications.
- Use pytesseract to extract fields like names, dates, and invoice numbers from scanned forms.
- You can capture screen content with "pyautogui" or "mss", then extract text using pytesseract.
- Use Tesseract’s CLI or Python libraries to overlay text on scanned PDFs.
Best Practices
- Always preprocess images to improve contrast and remove noise.
- Use the correct "lang" code for documents in non-English languages.
- Train Tesseract on custom fonts or layouts if OCR is inaccurate.
- Use bounding boxes to validate OCR results visually.
- Avoid using compressed images. Prefer PNG or high-quality JPEG.
Common Errors and How to Fix Them
Here are the common errors we face when we work with pytesseract and their solutions:
- FileNotFoundError: Ensure Tesseract executable path is correctly specified on Windows.
- Empty output: Check image quality and ensure text is not distorted.
- Incorrect characters: Use appropriate language packs and image preprocessing.
- Slow processing: Reduce image size or optimize the OCR pipeline using multiprocessing.
- Installation issues: Check if all dependencies (Tesseract, Pillow, pdf2image) are properly installed.
Wrapping Up
pytesseract helps Python developers to add OCR capabilities to their applications with ease. Whether you are building document automation tools, digitizing printed media, or scraping screen data, pytesseract provides a versatile and open-source solution for text extraction. With this wrapper, you get powerful OCR capabilities, turning images into actionable data for modern applications.