As you train your skill, you'll create one or more models. Each time you train a model, you'll see metrics to help show you how well it is able to fulfill its purpose. There is no single metric that can tell you that your model is ready to use in production. Instead, you'll want to evaluate this data based on your use case and consideration of how your process might be impacted.
This page explains the metrics you'll see during model training, how to interpret them, and how to use them to evaluate if the model meets your requirements.
As training begins, the model divides the sample files into two groups:
This split between training files and test files occurs when your model begins training. The samples are split into the two groups randomly each time you train a model.
The metrics you'll see reflect the model's performance classifying or extracting values from test files.
Each time you train a model in the document classification or email classification AI skill, you'll see the following metrics:
Each time you train a model in the document extraction AI skill, you'll see the following metrics:
You can view these metrics By field or By document. The By field view shows the aggregated metrics for each field wherever it appears across the sample documents you used to train. The By document view shows aggregated metrics for how the model performed when extracting data from each one of the documents you uploaded.
In either view, you can click the value in the Number of Labels column to see the actual value vs. the extracted value during training.
Before you can act upon these metrics, it's important to understand what each one means. First, let's recap some of the key concepts:
Now that we've refreshed our memories, we can start to evaluate whether the model's training results meet our needs:
A model can be in one of five states:
Available as a training metric in document classification and email classification AI skills.
Accuracy is the ratio of correct predictions to total predictions, expressed as a percentage.
Accuracy is a broader metric than precision and recall, and it gives you an idea of how many total correct predictions the model made.
Accuracy includes true positives (correctly predicting that the document is part of a certain type) and true negatives (correctly predicting that the document isn't part of a certain type).
When is accuracy useful?: When your model classifies documents or emails in roughly equal numbers.
Available as a training metric in document classification and email classification AI skills.
Tip: Download the Prediction Result CSV to see the model's confidence score for each classification prediction made during training. This CSV is not available in document extraction AI skills.
The confidence score measures how sure the model is of its prediction. A confidence is calculated for each prediction the model makes.
Most decisions (made by humans or computers) can vary in confidence. If someone asked you, "what's 2 + 2?" and you answer "4," your confidence in that answer is probably very high. But if someone asked you, "what did you eat for lunch last Tuesday?" you might not be so sure of your reply. "I think I had pizza" indicates you aren't very confident in your answer. Machine learning models apply the same confidence to their predictions.
In training and in production, the model outputs a confidence score for each prediction it makes. The Classify Documents, Classify Emails, and Extract from Document smart services include a Confidence threshold
output that you can configure to meet your requirements. This way, you can configure the process to proceed only when the model's confidence in a prediction is above the threshold value you set.
Confidence and accuracy are similar but distinct concepts and it's important to not confuse them. Confidence is determined per prediction, whereas accuracy is an average that's determined from the entire data set.
To understand the difference between accuracy and confidence, consider the following example:
You've created a model to recognize and classify two document types: invoices and purchase orders. During training, you uploaded 100 examples of invoices your business receives. The model uses 65 documents as training documents and 35 as test documents. The model learns from the 65 training documents, and then puts its knowledge to the test with the 35 test documents.
Of those 35 test documents, the model positively identifies 27 of them as invoices and the remaining 8 as purchase orders. Because this is training data, we (and the model) know that all 35 are invoices.
Looking back at the equation for accuracy, this model's accuracy is (27 / 35) * 100, or 77%.
Next is confidence. Confidence measures the probability the model will return a correct prediction. However, 100% confidence doesn't mean that the model will definitely return a correct prediction. Rather, confidence is the model's estimation at the likelihood the prediction is correct – not the actual result. Let's break it down by continuing our example:
Confidence is calculated for each document. Let's say the model is analyzing a document called acme_receiving.pdf
. The model's average accuracy for predicting an invoice or a purchase order is 77%, so we can expect that the model will do reasonably well predicting the document's type.
Based on what it's learned about invoices and purchase orders, the model predicts that this is an invoice. However, the model's analysis only picked up on some signals that this document is an invoice. Based on those traits, the model is fairly confident this is an invoice and gives it a confidence score of 70%.
Next, the model analyzes a document called stock_request.pdf
. This time, the model recognize some sure signs that this is a purchase order, which it learned during training. For one, the label "purchase order" appears at the top. During model training, every example purchase order had this label too. Other labels, like "description," "unit price," and "quantity" also appear here and are strong indicators of this document being a purchase order. Based on this information, the model classifies this as a purchase order with a confidence score of 90%.
Both predictions are true positives: the first document is indeed an invoice and the second one is a purchase order. But the model was more confident when it came to the purchase order, because it had more experience identifying similar documents in the past.
Accuracy and confidence are important for some use cases. Jump to Evaluating the data to learn more.
Available in document classification and email classification AI skills.
To view the Confusion Matrix, select the All Data view within the metrics.
The confusion matrix visually represents the division of predicted and actual types. The confusion matrix grows larger based on the number of document or email types within the model.
For example, in the image above, the two types (invoice and purchase order) are represented on both axes: predicted and actual. This creates four scenarios to measure:
We want the actual type to match the predicted type, so we look for higher numbers in the cells on the diagonal. These are the cells which overlap predicted and actual type, indicating a match.
The macro average is mean of the metrics across all data types. It doesn't consider how many samples of each email or document were added for training.
The weighted average is the mean of the metrics across all types, but it also takes into account the number of documents that were uploaded per document or email type.
Precision is the number of true positive predictions the model made.
The metric only considers the number of correct predictions the model made regarding a certain type or field. It doesn't include correct predictions for documents or emails that aren't part of a certain type, nor does it consider correct predictions that the model didn't make (false negatives).
For example, a model is given 10 documents and tasked with identifying how many of them are invoices. There are 4 invoices in the set of documents to classify, but the model positively identifies only 3 as being invoices. However, those 3 identifications are correct. In this example, the model's precision is 1.0 because all of its predictions were correct, even though it didn't identify all of the invoices.
Precision is closely related to recall, and together they calculate the F-1 Score.
When is precision useful?: When you want your model to avoid making false positive predictions.
Recall is the number of actual correct predictions a model made. Unlike precision, recall also considers the number of correct predictions the model didn't make.
For example, a model is given 10 documents and tasked with identifying how many of them are invoices. There are 4 invoices in the set of documents, but the model identifies only 3. However, those 3 identifications are correct. In this example, the model's recall is .75 because it missed predicting one of the invoices.
Recall is closely related to precision, and together they calculate the F-1 Score.
When is recall useful?: When you want your model to avoid making false negative predictions.
A metric used to measure accuracy in machine learning. The F-1 score (aka F-score) is a quick way to understand the model's ability to fulfill its purpose and make correct predictions. It is computed using precision and recall.
When is the F-1 score useful?: When your model classifies documents or emails in unequal numbers.
The flexibility of the AI skill means that there is no one-size-fits-all solution. The results of model training are subjective, meaning that you'll need to determine if the metrics meet your requirements. As you evaluate the results of training you'll want to consider the following:
The answers to these questions can help you focus on different training metrics.
For the sake of demonstration and to help answer some of these questions, we'll use an example from a fictional company: Acme Insurance.
Acme Insurance offers vehicle insurance to customers. To file a new claim, a customer submits an online form. The customer can upload supporting paperwork (such as police reports, repair shop quotes, medical bills, and rental car bills), or they can request those organizations/businesses to send the documents directly to Acme.
After a customer files a claim, their case is automatically assigned to an adjustor in the state of their accident. The adjustor reviews the claim details and determines if any documentation is missing. If the claim has all of the necessary documentation, the adjustor creates a summary report to send to the at-fault party's insurance company to begin negotiations for reimbursement. If the claim is missing documentation, the adjustor reaches out to the customer (or businesses) to request it. In both cases, an automated process also extracts key data from the claim's supporting documents.
The adjustor takes special care with medical bills, prescriptions, and other sensitive information, since Acme is legally bound to not share this without the explicit permission of the customer first. Upon extraction, this data is also saved in a record type with extra security configured.
The adjustor spends a lot of time manually reviewing each incoming claim to see if the supporting documents are attached. On average, it takes an adjustor 15 minutes per incoming claim to review it for completeness. If the adjustor spent their 8-hour work day on this, they would get through roughly 30 new claims a day. Acme Insurance decides to pilot an AI project to see if they can reduce the amount of manual review each adjustor has to do.
The Acme developer gets started on a model to classify the following document types that are attached to incoming claims:
Now that we have a sense of the use case, we can discuss the evaluation questions:
Different metrics are suited to help you evaluate model performance based on your use case. The table below outlines which metrics Acme Insurance wants to pay attention to:
Description | Classification use case | Extraction use case | Metric of interest |
---|---|---|---|
Imbalanced data set | The model classifies rental car receipts and towing receipts. Historically, 80% of the total are rental car receipts. | The model extracts data from rental car receipts. Rental car companies include different fields on their receipts. | F-1 score |
When individual documents are of more interest than the whole document type | The model classifies police reports and towing bills at roughly equal volumes. | Accuracy | |
When false positives are annoying, but false negatives are detrimental | The model classifies medical bills. It's a bit annoying when the model accidentally classifies a repair estimate as a medical bill, but it's unacceptable when the model misclassifies a medical bill as another document type. | When extracting data from medical bills, it's unacceptable when the model identifies a social security number as a phone number. | Recall |
When false positive is a major failure and false negative is fine | The model classifies multiple document types, but if it classifies any documents as fraudulent, the case is routed to another department and consumes many resources to investigate. | If the model misidentifies the accident date on the police report, it could be flagged as outside of the claim window and require manual verification to proceed. | Precision |
As you consider these metrics, it's helpful to think about the impact that classification and extraction may have in the context of your process. If a document is classified or a field is extracted incorrectly, what happens next in the process? Does someone have to manually reclassify or identify it? And if so, how much time will that take? Is someone's personal information at risk? Will the process take much more time to fix these issues?
If the answer is yes to these questions, you'll have to evaluate the repercussions against the time savings of integrating AI. A model won't always make correct predictions, so you'll need to determine your appetite for uncertainty or incorrect predictions. The best way to do that is to think of the practical impacts.
It's tempting to aim for 100% accuracy when training your classification model. Why not try to get every single document classified correctly?
This is a natural goal, but it's misguided. The model is trained on a data set of sample files that represent the files it will encounter in production. By nature, the data set is a selection of the whole, and therefore can never be completely representative.
Overfitting is when a developer trains the model too specifically on the training data set. An overfitted model is very good at classifying the files that were used to train it. However, in production, the model is looking for these specific characteristics or patterns to the exclusion of others – so it misses some files because they don't fit within the model's narrow understanding of the document or email types.
Rather than trying to reach 100% accuracy, we recommend instead using the model in a test environment to see how it performs in the context of your process model.
If the model's training results don't yet meet your requirements, you can train another model using additional samples. The more documents, emails, or labels you have, the better.
You'll want to provide a comprehensive set of data to help train the model. Learn how to build a comprehensive machine learning data set.
As with skills like programming, swimming, or photography, the AI skill will need to regularly refresh its knowledge and understanding of your document types. This can occur if:
These are all occasions when you should consider training the model further to improve it performance for your purposes.
Evaluate Model Performance