Index
BaseMetadata
Bases: BaseModel
, MutableMapping
, Generic[TMetadataOwner]
The base metadata class is utilized for defining a basic yet flexible interface for metadata attached to various fields.
When used out of the box, the metadata class will adobpt dictionary-like behavior. You may easily access different fields of the metadata as if it were a dictionary:
# Instantiate it with any kwargs you like
metadata = BaseMetadata(foo-'bar', cow='moo')
metadata["foo"] # "bar"
metadata["cow"] # "moo"
# Update the value of the key
metadata["foo"] = "fighters"
# Set new key-value pairs
metadata['sheep'] = 'baa'
Otherwise, you may sub-class the metadata class in order to create a more strictly typed metadata model. This is useful when you want to enforce a specific structure for your metadata.
class CustomMetadata(BaseMetadata):
foo: str
cow: str
# Instantiate it with the required fields
metadata = CustomMetadata(foo='bar', cow='moo')
metadata.foo # "bar"
metadata.cow # "moo"
# Update the value of the key
metadata.foo = "fighters"
# Use the extra field to store dynamic metadata
metadata.extra['sheep'] = 'baa'
Additionally, the task results descriptor allows for controlled and easy access to the task results of various tasks that are run on the parent node.
Source code in docprompt/schema/pipeline/metadata.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 |
|
owner: TMetadataOwner
property
writable
Return the owner of the metadata.
task_results: TaskResultsDescriptor
property
writable
Return the task results descriptor.
__delattr__(name)
Ensure that we can delete attributes from the metadata class.
The attributes are deleted through the following heirarchy
- If the attribute is
task_results
, we use the descriptor to delete the task results. - Otherwise, if it is a sub-classed model, it will be deleted as normal.
- Finally, if we are deleting a public attribute on the base metadata class, we use the extra field.
Source code in docprompt/schema/pipeline/metadata.py
__delitem__(name)
Provide dictionary functionality to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an delitem method.
Source code in docprompt/schema/pipeline/metadata.py
__getattr__(name)
Allow for getting of attributes on the metadata class.
The attributes are retrieved through the following heirarchy
- If the model is sub-classed, it will be retrieved as normal.
- Otherwise, if the attribute is private, it will be retrieved as normal.
- Finally, if we are getting a public attribute on the base metadata class, we use the extra field.
- If the key is not set in the
extra
dictionary, we resort back to just trying to get the field.- This is when we grab the
owner
ortask_result
attribuite.
- This is when we grab the
Source code in docprompt/schema/pipeline/metadata.py
__getitem__(name)
Provide dictionary functionlaity to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an getitem method.
Source code in docprompt/schema/pipeline/metadata.py
__iter__()
Iterate over the keys in the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an iter method.
Source code in docprompt/schema/pipeline/metadata.py
__len__()
Get the number of keys in the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have a len method.
Source code in docprompt/schema/pipeline/metadata.py
__repr__()
Provide a string representation of the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have a repr method.
Source code in docprompt/schema/pipeline/metadata.py
__setattr__(name, value)
Allow for setting of attributes on the metadata class.
The attributes are set through the following heirarchy
- If the model is sub-classed, it will be set as normal.
- Otherwise, if the attribute is private, it will be set as normal.
- Finally, if we are setting a public attribute on the base metadata class, we use the extra field.
Source code in docprompt/schema/pipeline/metadata.py
__setitem__(name, value)
Provide dictionary functionality to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an setitem method.
Source code in docprompt/schema/pipeline/metadata.py
from_owner(owner, **data)
classmethod
Create a new instance of the metadata class with the owner set.
validate_data_fields_from_annotations(data)
classmethod
Validate the data fields from the annotations.
Source code in docprompt/schema/pipeline/metadata.py
DocumentCollection
Bases: BaseModel
, Generic[DocumentCollectionMetadata, DocumentNodeMetadata, PageNodeMetadata]
Represents a collection of documents with some common metadata
Source code in docprompt/schema/pipeline/node/collection.py
DocumentNode
Bases: BaseNode
, Generic[DocumentNodeMetadata, PageNodeMetadata]
Represents a single document, with some metadata
Source code in docprompt/schema/pipeline/node/document.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
|
persistance_path
property
writable
The base path to storage location.
from_storage(path, file_hash, **kwargs)
classmethod
Load the document node from storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The base path to storage location. - Example (S3): "s3://bucket-name/key/to/folder" - Example (Local FS): "/tmp/docprompt/storage" |
required |
file_hash |
str
|
The hash of the document. |
required |
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DocumentNode |
Self
|
The loaded document node. |
Source code in docprompt/schema/pipeline/node/document.py
metadata_class()
classmethod
Get the metadata class for instantiating metadata from the model.
Source code in docprompt/schema/pipeline/node/document.py
page_metadata_class()
classmethod
Get the metadata class for the page nodes in the document.
Source code in docprompt/schema/pipeline/node/document.py
persist(path=None, **kwargs)
Persist a document node to storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Optional[str]
|
Overwrites the current |
None
|
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
FileSidecarsPathManager |
FileSidecarsPathManager
|
The file path manager for the persisted document node. |
Source code in docprompt/schema/pipeline/node/document.py
refresh_locator()
Refreshes the locator for this document node
Source code in docprompt/schema/pipeline/node/document.py
PageNode
Bases: BaseNode
, Generic[PageNodeMetadata]
Represents a single page in a document, with some metadata
Source code in docprompt/schema/pipeline/node/page.py
metadata
The metadata class is utilized for defining a basic, yet flexible interface for metadata attached to various fields.
In essence, this allows for developers to choose to either create their metadtata in an unstructured manner (i.e. a dictionary), or to sub class the base metadata class in order to create a more strictly typed metadata model for their page and document nodes.
BaseMetadata
Bases: BaseModel
, MutableMapping
, Generic[TMetadataOwner]
The base metadata class is utilized for defining a basic yet flexible interface for metadata attached to various fields.
When used out of the box, the metadata class will adobpt dictionary-like behavior. You may easily access different fields of the metadata as if it were a dictionary:
# Instantiate it with any kwargs you like
metadata = BaseMetadata(foo-'bar', cow='moo')
metadata["foo"] # "bar"
metadata["cow"] # "moo"
# Update the value of the key
metadata["foo"] = "fighters"
# Set new key-value pairs
metadata['sheep'] = 'baa'
Otherwise, you may sub-class the metadata class in order to create a more strictly typed metadata model. This is useful when you want to enforce a specific structure for your metadata.
class CustomMetadata(BaseMetadata):
foo: str
cow: str
# Instantiate it with the required fields
metadata = CustomMetadata(foo='bar', cow='moo')
metadata.foo # "bar"
metadata.cow # "moo"
# Update the value of the key
metadata.foo = "fighters"
# Use the extra field to store dynamic metadata
metadata.extra['sheep'] = 'baa'
Additionally, the task results descriptor allows for controlled and easy access to the task results of various tasks that are run on the parent node.
Source code in docprompt/schema/pipeline/metadata.py
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 |
|
owner: TMetadataOwner
property
writable
Return the owner of the metadata.
task_results: TaskResultsDescriptor
property
writable
Return the task results descriptor.
__delattr__(name)
Ensure that we can delete attributes from the metadata class.
The attributes are deleted through the following heirarchy
- If the attribute is
task_results
, we use the descriptor to delete the task results. - Otherwise, if it is a sub-classed model, it will be deleted as normal.
- Finally, if we are deleting a public attribute on the base metadata class, we use the extra field.
Source code in docprompt/schema/pipeline/metadata.py
__delitem__(name)
Provide dictionary functionality to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an delitem method.
Source code in docprompt/schema/pipeline/metadata.py
__getattr__(name)
Allow for getting of attributes on the metadata class.
The attributes are retrieved through the following heirarchy
- If the model is sub-classed, it will be retrieved as normal.
- Otherwise, if the attribute is private, it will be retrieved as normal.
- Finally, if we are getting a public attribute on the base metadata class, we use the extra field.
- If the key is not set in the
extra
dictionary, we resort back to just trying to get the field.- This is when we grab the
owner
ortask_result
attribuite.
- This is when we grab the
Source code in docprompt/schema/pipeline/metadata.py
__getitem__(name)
Provide dictionary functionlaity to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an getitem method.
Source code in docprompt/schema/pipeline/metadata.py
__iter__()
Iterate over the keys in the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an iter method.
Source code in docprompt/schema/pipeline/metadata.py
__len__()
Get the number of keys in the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have a len method.
Source code in docprompt/schema/pipeline/metadata.py
__repr__()
Provide a string representation of the metadata.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have a repr method.
Source code in docprompt/schema/pipeline/metadata.py
__setattr__(name, value)
Allow for setting of attributes on the metadata class.
The attributes are set through the following heirarchy
- If the model is sub-classed, it will be set as normal.
- Otherwise, if the attribute is private, it will be set as normal.
- Finally, if we are setting a public attribute on the base metadata class, we use the extra field.
Source code in docprompt/schema/pipeline/metadata.py
__setitem__(name, value)
Provide dictionary functionality to the metadata class.
This only works for the base metadata model. If sub-classed, this will raise an error, unless overridden, as BaseModel's do not have an setitem method.
Source code in docprompt/schema/pipeline/metadata.py
from_owner(owner, **data)
classmethod
Create a new instance of the metadata class with the owner set.
validate_data_fields_from_annotations(data)
classmethod
Validate the data fields from the annotations.
Source code in docprompt/schema/pipeline/metadata.py
node
DocumentCollection
Bases: BaseModel
, Generic[DocumentCollectionMetadata, DocumentNodeMetadata, PageNodeMetadata]
Represents a collection of documents with some common metadata
Source code in docprompt/schema/pipeline/node/collection.py
DocumentNode
Bases: BaseNode
, Generic[DocumentNodeMetadata, PageNodeMetadata]
Represents a single document, with some metadata
Source code in docprompt/schema/pipeline/node/document.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
|
persistance_path
property
writable
The base path to storage location.
from_storage(path, file_hash, **kwargs)
classmethod
Load the document node from storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The base path to storage location. - Example (S3): "s3://bucket-name/key/to/folder" - Example (Local FS): "/tmp/docprompt/storage" |
required |
file_hash |
str
|
The hash of the document. |
required |
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DocumentNode |
Self
|
The loaded document node. |
Source code in docprompt/schema/pipeline/node/document.py
metadata_class()
classmethod
Get the metadata class for instantiating metadata from the model.
Source code in docprompt/schema/pipeline/node/document.py
page_metadata_class()
classmethod
Get the metadata class for the page nodes in the document.
Source code in docprompt/schema/pipeline/node/document.py
persist(path=None, **kwargs)
Persist a document node to storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Optional[str]
|
Overwrites the current |
None
|
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
FileSidecarsPathManager |
FileSidecarsPathManager
|
The file path manager for the persisted document node. |
Source code in docprompt/schema/pipeline/node/document.py
refresh_locator()
Refreshes the locator for this document node
Source code in docprompt/schema/pipeline/node/document.py
PageNode
Bases: BaseNode
, Generic[PageNodeMetadata]
Represents a single page in a document, with some metadata
Source code in docprompt/schema/pipeline/node/page.py
base
collection
DocumentCollection
Bases: BaseModel
, Generic[DocumentCollectionMetadata, DocumentNodeMetadata, PageNodeMetadata]
Represents a collection of documents with some common metadata
Source code in docprompt/schema/pipeline/node/collection.py
document
DocumentNode
Bases: BaseNode
, Generic[DocumentNodeMetadata, PageNodeMetadata]
Represents a single document, with some metadata
Source code in docprompt/schema/pipeline/node/document.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 |
|
persistance_path
property
writable
The base path to storage location.
from_storage(path, file_hash, **kwargs)
classmethod
Load the document node from storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The base path to storage location. - Example (S3): "s3://bucket-name/key/to/folder" - Example (Local FS): "/tmp/docprompt/storage" |
required |
file_hash |
str
|
The hash of the document. |
required |
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DocumentNode |
Self
|
The loaded document node. |
Source code in docprompt/schema/pipeline/node/document.py
metadata_class()
classmethod
Get the metadata class for instantiating metadata from the model.
Source code in docprompt/schema/pipeline/node/document.py
page_metadata_class()
classmethod
Get the metadata class for the page nodes in the document.
Source code in docprompt/schema/pipeline/node/document.py
persist(path=None, **kwargs)
Persist a document node to storage.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Optional[str]
|
Overwrites the current |
None
|
**kwargs |
Additional keyword arguments for fsspec FileSystem |
{}
|
Returns:
Name | Type | Description |
---|---|---|
FileSidecarsPathManager |
FileSidecarsPathManager
|
The file path manager for the persisted document node. |
Source code in docprompt/schema/pipeline/node/document.py
refresh_locator()
Refreshes the locator for this document node
Source code in docprompt/schema/pipeline/node/document.py
page
PageNode
Bases: BaseNode
, Generic[PageNodeMetadata]
Represents a single page in a document, with some metadata
Source code in docprompt/schema/pipeline/node/page.py
typing
rasterizer
DocumentRasterizer
Source code in docprompt/schema/pipeline/rasterizer.py
propagate_cache(name, rasters)
Should be one-indexed