util
determine_pdf_name_from_bytes(file_bytes)
Attempts to determine the name of a PDF by exaimining metadata
Source code in docprompt/utils/util.py
get_page_count(fd)
Determines the number of pages in a PDF
hash_from_bytes(byte_data, hash_func=hashlib.md5, threshold=1024 * 1024 * 128)
Gets a hash from bytes. If the bytes are larger than the threshold, the hash is computed in chunks to avoid memory issues. The default hash function is MD5 with a threshold of 128MB which is optimal for most machines and use cases.
Source code in docprompt/utils/util.py
is_pdf(fd)
Determines if a file is a PDF
Source code in docprompt/utils/util.py
load_pdf_document(fp, *, file_name=None, password=None)
Loads a document from a file path
Source code in docprompt/utils/util.py
load_pdf_documents(fps, *, max_threads=12, passwords=None)
Loads multiple documents from file paths, using a thread pool