Index
extract_dates_from_text(input_string, *, date_formats=default_date_formats)
Extract dates from a string using a set of predefined regex patterns.
Returns a list of tuples, where the first element is the date object and the second is the full date string.
Source code in docprompt/utils/date_extraction.py
get_page_count(fd)
Determines the number of pages in a PDF
hash_from_bytes(byte_data, hash_func=hashlib.md5, threshold=1024 * 1024 * 128)
Gets a hash from bytes. If the bytes are larger than the threshold, the hash is computed in chunks to avoid memory issues. The default hash function is MD5 with a threshold of 128MB which is optimal for most machines and use cases.
Source code in docprompt/utils/util.py
is_pdf(fd)
Determines if a file is a PDF
Source code in docprompt/utils/util.py
load_pdf_document(fp, *, file_name=None, password=None)
Loads a document from a file path
Source code in docprompt/utils/util.py
load_pdf_documents(fps, *, max_threads=12, passwords=None)
Loads multiple documents from file paths, using a thread pool
Source code in docprompt/utils/util.py
date_extraction
extract_dates_from_text(input_string, *, date_formats=default_date_formats)
Extract dates from a string using a set of predefined regex patterns.
Returns a list of tuples, where the first element is the date object and the second is the full date string.
Source code in docprompt/utils/date_extraction.py
inference
A utility file for running inference with various LLM providers.
run_batch_inference_anthropic(model_name, messages, **kwargs)
async
Run batch inference using an Anthropic model asynchronously.
Source code in docprompt/utils/inference.py
run_inference_anthropic(model_name, messages, **kwargs)
async
Run inference using an Anthropic model asynchronously.
Source code in docprompt/utils/inference.py
masking
image
mask_image_from_bounding_boxes(image, *bounding_boxes, mask_color='#000000')
Create a copy of the image with the positions of the bounding boxes masked.
Source code in docprompt/utils/masking/image.py
splitter
pdf_split_iter_fast(file_bytes, max_page_count)
Splits a PDF into batches of pages up to max_page_count
pages quickly.
Source code in docprompt/utils/splitter.py
pdf_split_iter_with_max_bytes(file_bytes, max_page_count, max_bytes)
Splits a PDF into batches of pages up to max_page_count
pages and max_bytes
bytes.
Source code in docprompt/utils/splitter.py
split_pdf_to_bytes(file_bytes, *, start_page=None, stop_page=None)
Splits a PDF into a list of bytes.
Source code in docprompt/utils/splitter.py
util
determine_pdf_name_from_bytes(file_bytes)
Attempts to determine the name of a PDF by exaimining metadata
Source code in docprompt/utils/util.py
get_page_count(fd)
Determines the number of pages in a PDF
hash_from_bytes(byte_data, hash_func=hashlib.md5, threshold=1024 * 1024 * 128)
Gets a hash from bytes. If the bytes are larger than the threshold, the hash is computed in chunks to avoid memory issues. The default hash function is MD5 with a threshold of 128MB which is optimal for most machines and use cases.
Source code in docprompt/utils/util.py
is_pdf(fd)
Determines if a file is a PDF
Source code in docprompt/utils/util.py
load_pdf_document(fp, *, file_name=None, password=None)
Loads a document from a file path
Source code in docprompt/utils/util.py
load_pdf_documents(fps, *, max_threads=12, passwords=None)
Loads multiple documents from file paths, using a thread pool