Basic OCR Usage with Docprompt
This guide will walk you through the basics of performing Optical Character Recognition (OCR) using Docprompt. You'll learn how to set up the OCR provider, process a document, and access the results.
Prerequisites
Before you begin, ensure you have:
- Installed Docprompt with OCR support:
pip install "docprompt[google]"
- A Google Cloud Platform account with Document AI API enabled
- A GCP service account key file
Setting Up the OCR Provider
First, let's set up the Google OCR provider:
from docprompt.tasks.factory import GCPTaskProviderFactory
# Initialize the GCP Task Provider Factory
gcp_factory = GCPTaskProviderFactory(
service_account_file="path/to/your/service_account_key.json"
)
# Create the OCR provider
ocr_provider = gcp_factory.get_page_ocr_provider(
project_id="your-gcp-project-id",
processor_id="your-document-ai-processor-id"
)
Loading and Processing a Document
Now, let's load a document and process it using OCR:
from docprompt import load_document, DocumentNode
# Load the document
document = load_document("path/to/your/document.pdf")
document_node = DocumentNode.from_document(document)
# Process the document
ocr_results = ocr_provider.process_document_node(document_node)
Accessing OCR Results
After processing, you can access the OCR results in various ways:
1. Page-level Text
To get the full text of a specific page:
page_number = 1 # Pages are 1-indexed
page_text = ocr_results[page_number].page_text
print(f"Text on page {page_number}:\n{page_text[:500]}...") # Print first 500 characters
2. Words, Lines, and Blocks
Docprompt provides access to words, lines, and blocks (paragraphs) extracted from the document:
# Get the first page's result
first_page_result = ocr_results[1]
# Print the first 5 words on the page
print("First 5 words:")
for word in first_page_result.word_level_blocks[:5]:
print(f"Word: {word.text}, Confidence: {word.metadata.confidence}")
# Print the first line on the page
print("\nFirst line:")
if first_page_result.line_level_blocks:
first_line = first_page_result.line_level_blocks[0]
print(f"Line: {first_line.text}")
# Print the first block (paragraph) on the page
print("\nFirst block:")
if first_page_result.block_level_blocks:
first_block = first_page_result.block_level_blocks[0]
print(f"Block: {first_block.text[:100]}...") # Print first 100 characters
3. Bounding Boxes
Each word, line, and block comes with bounding box information:
# Get bounding box for the first word
if first_page_result.word_level_blocks:
first_word = first_page_result.word_level_blocks[0]
bbox = first_word.bounding_box
print(f"\nBounding box for '{first_word.text}':")
print(f"Top-left: ({bbox.x0}, {bbox.top})")
print(f"Bottom-right: ({bbox.x1}, {bbox.bottom})")
Conclusion
You've now learned the basics of performing OCR with Docprompt. This includes setting up the OCR provider, processing a document, and accessing the results at different levels of granularity.
For more advanced usage, including customizing OCR settings, using other providers, and leveraging the powerful search capabilities, check out our other guides:
- Customizing OCR Providers and Settings
- Advanced Text Search with Provenance Locators
- Building OCR-based Workflows