Binary Classification with DocPrompt
Binary classification is a fundamental task in document analysis, where you categorize pages into one of two classes. This guide will walk you through performing binary classification using DocPrompt with the Anthropic provider.
Setting Up
First, let's import the necessary modules and set up our environment:
from docprompt import load_document, DocumentNode
from docprompt.tasks.factory import AnthropicTaskProviderFactory
from docprompt.tasks.classification import ClassificationInput, ClassificationTypes
# Initialize the Anthropic factory
# Make sure you have set the ANTHROPIC_API_KEY environment variable
factory = AnthropicTaskProviderFactory()
# Create the classification provider
classification_provider = factory.get_page_classification_provider()
Preparing the Document
Load your document and create a DocumentNode:
document = load_document("path/to/your/document.pdf")
document_node = DocumentNode.from_document(document)
Configuring the Classification Task
For binary classification, we need to create a ClassificationInput
object. This object acts as a prompt for the model, guiding its classification decision. Here's how to set it up:
classification_input = ClassificationInput(
type=ClassificationTypes.BINARY,
instructions="Determine if the page contains information about financial transactions.",
confidence=True # Optional: request confidence scores
)
Let's break down the ClassificationInput
:
type
: Set toClassificationTypes.BINARY
for binary classification.instructions
: This is crucial for binary classification. Provide clear, specific instructions for what the model should look for.confidence
: Set toTrue
if you want confidence scores with the results.
Note: For binary classification, you don't need to specify labels. The model will automatically use "YES" and "NO" as labels.
Performing Classification
Now, let's run the classification task:
Interpreting Results
The results
dictionary contains the classification output for each page. Let's examine the results:
for page_number, result in results.items():
label = result.labels
confidence = result.score
print(f"Page {page_number}:")
print(f"\tClassification: {label} ({confidence})")
print('---')
This will print the classification (YES or NO) for each page, along with the confidence level if requested.
Tips for Effective Binary Classification
-
Clear Instructions: The
instructions
field inClassificationInput
is crucial. Be specific about what constitutes a "YES" classification. -
Consider Page Context: Remember that the model analyzes each page independently. If your classification requires context from multiple pages, you may need to adjust your approach.
-
Confidence Scores: Use the
confidence
option to get an idea of how certain the model is about its classifications. This can be helpful for identifying pages that might need human review. -
Iterative Refinement: If you're not getting the desired results, try refining your instructions. You might need to be more specific or provide examples of what constitutes a positive classification.
Conclusion
Binary classification with DocPrompt allows you to quickly categorize pages in your documents. By leveraging the Anthropic provider and carefully crafting your ClassificationInput
, you can achieve accurate and efficient document analysis.
Remember that the quality of your results heavily depends on the clarity and specificity of your instructions. Experiment with different phrasings to find what works best for your specific use case.