Index
BaseMetadata
Bases: BaseModel
, MutableMapping
, Generic[TMetadataOwner]
The base metadata class is utilized for defining a basic yet flexible interface for metadata attached to various fields.
When used out of the box, the metadata class will adobpt dictionary-like behavior. You may easily access different fields of the metadata as if it were a dictionary:
# Instantiate it with any kwargs you like
metadata = BaseMetadata(foo-'bar', cow='moo')
metadata["foo"] # "bar"
metadata["cow"] # "moo"
# Update the value of the key
metadata["foo"] = "fighters"
# Set new key-value pairs
metadata['sheep'] = 'baa'
Otherwise, you may sub-class the metadata class in order to create a more strictly typed metadata model. This is useful when you want to enforce a specific structure for your metadata.
class CustomMetadata(BaseMetadata):
foo: str
cow: str
# Instantiate it with the required fields
metadata = CustomMetadata(foo='bar', cow='moo')
metadata.foo # "bar"
metadata.cow # "moo"
# Update the value of the key
metadata.foo = "fighters"
# Use the extra field to store dynamic metadata
metadata.extra['sheep'] = 'baa'
Additionally, the task results descriptor allows for controlled and easy access to the task results of various tasks that are run on the parent node.
Source code in docprompt/schema/pipeline/metadata.py
|
|
owner: TMetadataOwner
property
writable
Return the owner of the metadata.
task_results: TaskResultsDescriptor
property
writable
Return the task results descriptor.
__delattr__(name)
Ensure that we can delete attributes from the metadata class.
The attributes are deleted through the following heirarchy
- If the attribute is
task_results
, we use the descriptor to delete the task results. - Otherwise, if it is a sub-classed model, it will be deleted as normal.
- Finally, if we are deleting a public attribute on the base metadata class, we use the extra field.
Source code in docprompt/schema/pipeline/metadata.py
__delitem__(name)
Provide dictionary functionality to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an delitem method.
Source code in docprompt/schema/pipeline/metadata.py
__getattr__(name)
Allow for getting of attributes on the metadata class.
The attributes are retrieved through the following heirarchy
- If the model is sub-classed, it will be retrieved as normal.
- Otherwise, if the attribute is private, it will be retrieved as normal.
- Finally, if we are getting a public attribute on the base metadata class, we use the extra field.
- If the key is not set in the
extra
dictionary, we resort back to just trying to get the field.- This is when we grab the
owner
ortask_result
attribuite.
- This is when we grab the
Source code in docprompt/schema/pipeline/metadata.py
__getitem__(name)
Provide dictionary functionlaity to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an getitem method.
Source code in docprompt/schema/pipeline/metadata.py
__iter__()
Iterate over the keys in the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an iter method.
Source code in docprompt/schema/pipeline/metadata.py
__len__()
Get the number of keys in the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have a len method.
Source code in docprompt/schema/pipeline/metadata.py
__repr__()
Provide a string representation of the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have a repr method.
Source code in docprompt/schema/pipeline/metadata.py
__setattr__(name, value)
Allow for setting of attributes on the metadata class.
The attributes are set through the following heirarchy
- If the model is sub-classed, it will be set as normal.
- Otherwise, if the attribute is private, it will be set as normal.
- Finally, if we are setting a public attribute on the base metadata class, we use the extra field.
Source code in docprompt/schema/pipeline/metadata.py
__setitem__(name, value)
Provide dictionary functionality to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an setitem method.
Source code in docprompt/schema/pipeline/metadata.py
from_owner(owner, **data)
classmethod
Create a new instance of the metadata class with the owner set.
validate_data_fields_from_annotations(data)
classmethod
Validate the data fields from the annotations.
Source code in docprompt/schema/pipeline/metadata.py
DocumentCollection
Bases: BaseModel
, Generic[DocumentCollectionMetadata, DocumentNodeMetadata, PageNodeMetadata]
Represents a collection of documents with some common metadata
Source code in docprompt/schema/pipeline/node/collection.py
DocumentNode
Bases: BaseNode
, Generic[DocumentNodeMetadata, PageNodeMetadata]
Represents a single document, with some metadata
Source code in docprompt/schema/pipeline/node/document.py
|
|
persistance_path
property
writable
The base path to storage location.
from_storage(path, file_hash, **kwargs)
classmethod
Load the document node from storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The base path to storage location. - Example (S3): "s3://bucket-name/key/to/folder" - Example (Local FS): "/tmp/docprompt/storage" |
required |
file_hash |
str
|
The hash of the document. |
required |
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DocumentNode |
Self
|
The loaded document node. |
Source code in docprompt/schema/pipeline/node/document.py
metadata_class()
classmethod
Get the metadata class for instantiating metadata from the model.
Source code in docprompt/schema/pipeline/node/document.py
page_metadata_class()
classmethod
Get the metadata class for the page nodes in the document.
Source code in docprompt/schema/pipeline/node/document.py
persist(path=None, **kwargs)
Persist a document node to storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Optional[str]
|
Overwrites the current |
None
|
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
FileSidecarsPathManager |
FileSidecarsPathManager
|
The file path manager for the persisted document node. |
Source code in docprompt/schema/pipeline/node/document.py
refresh_locator()
Refreshes the locator for this document node
Source code in docprompt/schema/pipeline/node/document.py
PageNode
Bases: BaseNode
, Generic[PageNodeMetadata]
Represents a single page in a document, with some metadata
Source code in docprompt/schema/pipeline/node/page.py
metadata
The metadata class is utilized for defining a basic, yet flexible interface for metadata attached to various fields.
In essence, this allows for developers to choose to either create their metadtata in an unstructured manner (i.e. a dictionary), or to sub class the base metadata class in order to create a more strictly typed metadata model for their page and document nodes.
BaseMetadata
Bases: BaseModel
, MutableMapping
, Generic[TMetadataOwner]
The base metadata class is utilized for defining a basic yet flexible interface for metadata attached to various fields.
When used out of the box, the metadata class will adobpt dictionary-like behavior. You may easily access different fields of the metadata as if it were a dictionary:
# Instantiate it with any kwargs you like
metadata = BaseMetadata(foo-'bar', cow='moo')
metadata["foo"] # "bar"
metadata["cow"] # "moo"
# Update the value of the key
metadata["foo"] = "fighters"
# Set new key-value pairs
metadata['sheep'] = 'baa'
Otherwise, you may sub-class the metadata class in order to create a more strictly typed metadata model. This is useful when you want to enforce a specific structure for your metadata.
class CustomMetadata(BaseMetadata):
foo: str
cow: str
# Instantiate it with the required fields
metadata = CustomMetadata(foo='bar', cow='moo')
metadata.foo # "bar"
metadata.cow # "moo"
# Update the value of the key
metadata.foo = "fighters"
# Use the extra field to store dynamic metadata
metadata.extra['sheep'] = 'baa'
Additionally, the task results descriptor allows for controlled and easy access to the task results of various tasks that are run on the parent node.
Source code in docprompt/schema/pipeline/metadata.py
|
|
owner: TMetadataOwner
property
writable
Return the owner of the metadata.
task_results: TaskResultsDescriptor
property
writable
Return the task results descriptor.
__delattr__(name)
Ensure that we can delete attributes from the metadata class.
The attributes are deleted through the following heirarchy
- If the attribute is
task_results
, we use the descriptor to delete the task results. - Otherwise, if it is a sub-classed model, it will be deleted as normal.
- Finally, if we are deleting a public attribute on the base metadata class, we use the extra field.
Source code in docprompt/schema/pipeline/metadata.py
__delitem__(name)
Provide dictionary functionality to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an delitem method.
Source code in docprompt/schema/pipeline/metadata.py
__getattr__(name)
Allow for getting of attributes on the metadata class.
The attributes are retrieved through the following heirarchy
- If the model is sub-classed, it will be retrieved as normal.
- Otherwise, if the attribute is private, it will be retrieved as normal.
- Finally, if we are getting a public attribute on the base metadata class, we use the extra field.
- If the key is not set in the
extra
dictionary, we resort back to just trying to get the field.- This is when we grab the
owner
ortask_result
attribuite.
- This is when we grab the
Source code in docprompt/schema/pipeline/metadata.py
__getitem__(name)
Provide dictionary functionlaity to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an getitem method.
Source code in docprompt/schema/pipeline/metadata.py
__iter__()
Iterate over the keys in the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an iter method.
Source code in docprompt/schema/pipeline/metadata.py
__len__()
Get the number of keys in the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have a len method.
Source code in docprompt/schema/pipeline/metadata.py
__repr__()
Provide a string representation of the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have a repr method.
Source code in docprompt/schema/pipeline/metadata.py
__setattr__(name, value)
Allow for setting of attributes on the metadata class.
The attributes are set through the following heirarchy
- If the model is sub-classed, it will be set as normal.
- Otherwise, if the attribute is private, it will be set as normal.
- Finally, if we are setting a public attribute on the base metadata class, we use the extra field.
Source code in docprompt/schema/pipeline/metadata.py
__setitem__(name, value)
Provide dictionary functionality to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an setitem method.
Source code in docprompt/schema/pipeline/metadata.py
from_owner(owner, **data)
classmethod
Create a new instance of the metadata class with the owner set.
validate_data_fields_from_annotations(data)
classmethod
Validate the data fields from the annotations.
Source code in docprompt/schema/pipeline/metadata.py
node
DocumentCollection
Bases: BaseModel
, Generic[DocumentCollectionMetadata, DocumentNodeMetadata, PageNodeMetadata]
Represents a collection of documents with some common metadata
Source code in docprompt/schema/pipeline/node/collection.py
DocumentNode
Bases: BaseNode
, Generic[DocumentNodeMetadata, PageNodeMetadata]
Represents a single document, with some metadata
Source code in docprompt/schema/pipeline/node/document.py
|
|
persistance_path
property
writable
The base path to storage location.
from_storage(path, file_hash, **kwargs)
classmethod
Load the document node from storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The base path to storage location. - Example (S3): "s3://bucket-name/key/to/folder" - Example (Local FS): "/tmp/docprompt/storage" |
required |
file_hash |
str
|
The hash of the document. |
required |
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DocumentNode |
Self
|
The loaded document node. |
Source code in docprompt/schema/pipeline/node/document.py
metadata_class()
classmethod
Get the metadata class for instantiating metadata from the model.
Source code in docprompt/schema/pipeline/node/document.py
page_metadata_class()
classmethod
Get the metadata class for the page nodes in the document.
Source code in docprompt/schema/pipeline/node/document.py
persist(path=None, **kwargs)
Persist a document node to storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Optional[str]
|
Overwrites the current |
None
|
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
FileSidecarsPathManager |
FileSidecarsPathManager
|
The file path manager for the persisted document node. |
Source code in docprompt/schema/pipeline/node/document.py
refresh_locator()
Refreshes the locator for this document node
Source code in docprompt/schema/pipeline/node/document.py
PageNode
Bases: BaseNode
, Generic[PageNodeMetadata]
Represents a single page in a document, with some metadata
Source code in docprompt/schema/pipeline/node/page.py
base
collection
DocumentCollection
Bases: BaseModel
, Generic[DocumentCollectionMetadata, DocumentNodeMetadata, PageNodeMetadata]
Represents a collection of documents with some common metadata
Source code in docprompt/schema/pipeline/node/collection.py
document
DocumentNode
Bases: BaseNode
, Generic[DocumentNodeMetadata, PageNodeMetadata]
Represents a single document, with some metadata
Source code in docprompt/schema/pipeline/node/document.py
|
|
persistance_path
property
writable
The base path to storage location.
from_storage(path, file_hash, **kwargs)
classmethod
Load the document node from storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The base path to storage location. - Example (S3): "s3://bucket-name/key/to/folder" - Example (Local FS): "/tmp/docprompt/storage" |
required |
file_hash |
str
|
The hash of the document. |
required |
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DocumentNode |
Self
|
The loaded document node. |
Source code in docprompt/schema/pipeline/node/document.py
metadata_class()
classmethod
Get the metadata class for instantiating metadata from the model.
Source code in docprompt/schema/pipeline/node/document.py
page_metadata_class()
classmethod
Get the metadata class for the page nodes in the document.
Source code in docprompt/schema/pipeline/node/document.py
persist(path=None, **kwargs)
Persist a document node to storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Optional[str]
|
Overwrites the current |
None
|
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
FileSidecarsPathManager |
FileSidecarsPathManager
|
The file path manager for the persisted document node. |
Source code in docprompt/schema/pipeline/node/document.py
refresh_locator()
Refreshes the locator for this document node
Source code in docprompt/schema/pipeline/node/document.py
page
PageNode
Bases: BaseNode
, Generic[PageNodeMetadata]
Represents a single page in a document, with some metadata
Source code in docprompt/schema/pipeline/node/page.py
typing
rasterizer
DocumentRasterizer
Source code in docprompt/schema/pipeline/rasterizer.py
propagate_cache(name, rasters)
Should be one-indexed