Appian's new suite of document extraction features make it easy to extract text and data from documents. Appian auto-generates a form for a person to confirm or correct the extraction. This provides an easy form for human-in-the-loop validation of automated extraction results. The user inputs then trains the extraction to perform better over time.
Document extraction identifies the contents of fields in forms (key/value pairs) from PDF files. For example, you can extract data from invoices and by identifying the field names and values that are paired together e.g. Invoice Number and INV-12.
Appian maps keys to a data type field. Appian also auto-generates a form for human-in-the-loop validation of automated extraction results. After this manual reconciliation of the extracted data, Appian will store and recall the mapping of the extracted key to an Appian field. This means over time, Appian learns to extract more information automatically.
We think you are going to get so much benefit out of Appian Document Extraction that we're giving you access to a pre-built application called Intelligent Document Processing (IDP). IDP takes advantage of all of the document extraction features, plus adds the capability to automatically classify documents, monitor the performance, and securely process documents across multiple teams. Check out the Intelligent Document Processing Documentation for more information.
This page walks you through how to use the document extraction functionality together. For more details about the individual smart services and functions, see the list below:
Many companies currently extract data from their documents manually or use expensive, high-maintenance optical character recognition (OCR) software.
Appian's document extraction overcomes these challenges by using flexible AI to analyze documents and accurately extract text and data. Once the information is captured, you can take action on it within your Appian applications.
Common use cases for document extraction include:
Before you get started, it is important to understand what kind of documents work best for data extraction.
Appian Document Extraction supports files that meet the following standards:
Document extraction works best with forms that tend to contain similar information in each document. For example, almost all invoices will have an invoice number and total and almost all purchase orders will have a PO number and purchaser.
Document extraction also works best with forms that have clearly labeled values to help extract data from the document.
For example, the following form contains numerous labels and values associated with those labels, such as the label INVOICE and the value 101 and the label DATE and the value MAR 20, 2020.
Since Appian Document Extraction is meant to extract labels and values, extracting paragraphs of text is not a good use case for it. If you need to extract paragraphs of text, try using the Google Cloud Vision Connected System.
Likewise, if you need to find specific information in text, such as footnotes in a document, you will be better served by the Google Cloud Vision Connected System along with expressions to analyze the output.
The instructions below provide an overview of how to set up and create a simple document extraction process.
Appian document extraction uses Google Cloud Document AI. You must set up a Google Service Account in the Administration Console > Document Extraction section for authentication.
When creating your data type for your document extraction process, take the following notes into consideration:
Invoice
data type with fields like customerName, customerNumber, billingAddress, phoneNumber.field is not currently supported
.To configure a document extraction process:
Runtime Document
Doc Extraction Id
docExtractionId
status
COMPLETE
: Analysis is done and Appian has downloaded the results, you can now proceed to extract results using a!docExtractionResult() or Reconcile Doc Extraction Smart Service.IN PROGRESS
: Analysis is still in progress. Appian recommends checking again in 60 seconds.INVALID ID
: Check that the provided ID is valid or start a new doc extraction run on the same document.ERROR
: Internal evaluation error happened during processing.Drag in Script Task node and use the a!docExtractionResult() function - store the populated CDT with the extracted data in a process variable.
docExtractionId
, typeNumber
, confidenceThreshold
extractedData
isnull(index(pv!invoice,"invoiceNumber",null))
could result in going to reconciliation, but a populated result would skip the reconciliation task.Doc Extraction Id
, Data Type Number
extractedData
, isSubmit
, isException
isSubmit
and isException
to route the process model.isException=true()
to route to a chained user input task, where the user provides more information. The user can also classify what is wrong, for example: document is missing information, incorrect document type.After creating your process model, run it with a few samples to test extraction and to see how your auto extracted results change.
To reduce the document processing time, you will want Appian Document Extraction to auto-extract as much document data as possible. To start, Appian uses the field names from the data type to find a match. Then, Appian learns how to map your data to your data type fields from the user interactions with the reconcile interface. For example, if users provide mappings, then eventually Appian Document Extraction will recognize that Invoice Number
, Invoice #
, and Invoice No.
all map to the invoiceNumber
Appian data type field.
Once you set up your end to end process, Appian recommends testing the process with a series of sample documents to see how Appian learns to auto-map.
Some important notes about learned auto-mapping behavior:
When deploying your application to another environment, consider the following notes:
This means you may see different behavior between environments for auto-extraction depending on which documents have been processed and how they have been reconciled by users.
After the Start Doc Extraction Smart Service extracts data from a document, the Reconcile Doc Extraction Smart Service generates a task for a user to validate the data.
To complete this task, users compare the data that was extracted to an image of the uploaded document. They can use the information that displays in the document preview to update any incorrect or missing information.
If any information is missing, you can populate the information in three ways:
Values selected from the document preview will improve data extraction results. Values entered manually will not. For example, if you select the value for a PO number from the document preview in two different documents, it can learn that PO No. and PO # both mean PO Number. If you have the option, you should select correctly labeled values from the document preview instead of entering them manually.
Click the box that surrounds the desired value in the document reconciliation viewer on the right, then click the arrow next to the field to populate the field.
While you are reconciling the data, icons indicate how the information was entered for each field: