Lightning-Fast Document Search π₯π
Ever wished you could search through OCR-processed documents at the speed of light? Look no further! DocPrompt's Provenance Locator, powered by Rust, offers blazingly fast text search capabilities that will revolutionize your document processing workflows.
The Power of Rust-Powered Search
DocPrompt's DocumentProvenanceLocator
is not your average search tool. Implemented in Rust and leveraging the power of tantivy
and rtree
, it provides:
- β‘ Lightning-fast full-text search
- π― Precise text location within documents
- π§ Smart granularity refinement (word, line, block)
- πΊοΈ Spatial querying capabilities
Let's dive into how you can harness this power!
Setting Up the Locator
First, let's create a DocumentProvenanceLocator
from a processed DocumentNode
:
from docprompt import load_document, DocumentNode
from docprompt.tasks.factory import GCPTaskProviderFactory
# Load and process the document
document = load_document("path/to/your/document.pdf")
document_node = DocumentNode.from_document(document)
# Process with OCR (assuming you've set up the GCP factory)
gcp_factory = GCPTaskProviderFactory(service_account_file="path/to/credentials.json")
ocr_provider = gcp_factory.get_page_ocr_provider(
project_id="your-project-id",
processor_id="your-processor-id"
)
ocr_results = ocr_provider.process_document_node(document_node)
# Create the locator
locator = document_node.locator
Searching at the Speed of Rust π₯
Now that we have our locator, let's see it in action:
# Perform a simple search
results = locator.search("DocPrompt")
for result in results:
print(f"Found on page {result.page_number}")
print(f"Text: {result.text}")
print(f"Bounding box: {result.text_location.merged_source_block.bounding_box}")
print("---")
This search operation happens in milliseconds, even for large documents, thanks to the Rust-powered backend!
Advanced Search Capabilities
Refining to Word Level
DocPrompt can automatically refine search results to the word level:
This gives you pinpoint accuracy in locating text within your document.
Page-Specific Search
Need to search on a specific page? No problem:
Best Match Search
Find the best matches based on different criteria:
best_short_matches = locator.search_n_best("DocPrompt", n=3, mode="shortest_text")
best_long_matches = locator.search_n_best("DocPrompt", n=3, mode="longest_text")
best_overall_matches = locator.search_n_best("DocPrompt", n=3, mode="highest_score")
Spatial Queries: Beyond Text Search πΊοΈ
DocPrompt's locator isn't just fastβit's spatially aware! You can perform queries based on document layout:
from docprompt.schema.layout import NormBBox
# Get blocks near a specific area on page 1
bbox = NormBBox(x0=0.1, top=0.1, x1=0.2, bottom=0.2)
nearby_blocks = locator.get_k_nearest_blocks(bbox, page_number=1, k=5)
# Get overlapping blocks
overlapping_blocks = locator.get_overlapping_blocks(bbox, page_number=1)
This spatial awareness opens up possibilities for advanced document analysis and data extraction!
Conclusion: Search at the Speed of Thought π§ π¨
DocPrompt's DocumentProvenanceLocator
brings unprecedented speed and precision to document search and analysis. By leveraging the power of Rust, it offers:
- Lightning-fast full-text search
- Precise text location within documents
- Advanced spatial querying capabilities
- Scalability for large documents and datasets
Whether you're building a document analysis pipeline, a search system, or any text-based application, DocPrompt's Provenance Locator offers the speed and accuracy you need to stay ahead of the game.