Customizing OCR Providers
Docprompt uses a factory pattern to manage credentials and create task providers efficiently. This guide will demonstrate how to configure and customize OCR providers, focusing on Amazon Textract and Google Cloud Platform (GCP) as examples.
Understanding the Factory Pattern
Docprompt uses task provider factories to manage credentials and create providers for various tasks. This approach allows for:
- Centralized credential management
- Easy creation of multiple task providers from a single backend
- Separation of provider-specific and task-specific configurations
Here's a simplified example of how the factory pattern works:
from docprompt.tasks.factory import GCPTaskProviderFactory, AmazonTaskProviderFactory
# Create a GCP factory
gcp_factory = GCPTaskProviderFactory(
service_account_file="path/to/service_account.json"
)
# Create an Amazon factory
amazon_factory = AmazonTaskProviderFactory(
aws_access_key_id="YOUR_ACCESS_KEY",
aws_secret_access_key="YOUR_SECRET_KEY",
region_name="us-west-2"
)
Creating OCR Providers
Once you have a factory, you can create OCR providers with task-specific configurations:
# Create a GCP OCR provider
gcp_ocr_provider = gcp_factory.get_page_ocr_provider(
project_id="YOUR_PROJECT_ID",
processor_id="YOUR_PROCESSOR_ID",
max_workers=4,
return_images=True
)
# Create an Amazon Textract provider
amazon_ocr_provider = amazon_factory.get_page_ocr_provider(
max_workers=4,
exclude_bounding_poly=True
)
Understanding Provider Configuration
When configuring OCR providers, you'll encounter two types of parameters:
-
Docprompt generic parameters: These are common across different providers and control Docprompt's behavior.
max_workers
: Controls concurrency for processing large documentsexclude_bounding_poly
: Reduces memory usage by excluding detailed polygon data
-
Provider-specific parameters: These are unique to each backend and control provider-specific features. For example, if using GCP as an OCR provider, you must specify
project_id
,processor_id
and you may optionally setreturn_image_quality_scores
.
Provider-Specific Features and Limitations
Google Cloud Platform (GCP)
- Offers advanced layout analysis and image quality scoring
- Supports returning rasterized images of processed pages
- Requires GCP-specific project and processor IDs
Example configuration:
gcp_ocr_provider = gcp_factory.get_page_ocr_provider(
project_id="YOUR_PROJECT_ID",
processor_id="YOUR_PROCESSOR_ID",
max_workers=4, # Docprompt generic
return_images=True, # GCP-specific
return_image_quality_scores=True # GCP-specific
)
Amazon Textract
- Focuses on text extraction and layout analysis
- Provides confidence scores for extracted text
- Does not support returning rasterized images
Example configuration:
amazon_ocr_provider = amazon_factory.get_page_ocr_provider(
max_workers=4, # Docprompt generic
exclude_bounding_poly=True # Docprompt generic
)
Best Practices
-
Use factories for credential management: This centralizes authentication and makes it easier to switch between providers.
-
Consult provider documentation: Always refer to the latest documentation from AWS or GCP for the most up-to-date information on their OCR services.
-
Check Docprompt API reference: Review Docprompt's API documentation for each provider to understand available configurations.
-
Optimize for your use case: Configure providers based on your specific needs, balancing performance and feature requirements.
Conclusion
Understanding the factory pattern and the distinction between Docprompt generic and provider-specific parameters is key to effectively configuring OCR providers in Docprompt. While this guide provides an overview using Amazon Textract and GCP as examples, the principles apply to other providers as well. Always consult the specific provider's documentation and Docprompt's API reference for the most current and detailed information.