Appian's new suite of document extraction features make it easy to extract text and data from documents. Appian auto-generates a form for a person to confirm or correct the extraction. This provides an easy form for human-in-the-loop validation of automated extraction results. The user inputs then train the extraction to perform better over time.
Document extraction identifies the contents of fields in forms (key/value pairs) from PDF files. For example, you can extract data from invoices and by identifying the field names and values that are paired together e.g. Invoice Number and INV-12.
Appian maps keys to a data type field. Appian also auto-generates a form for human-in-the-loop validation of automated extraction results. After this manual reconciliation of the extracted data, Appian will store and recall the mapping of the extracted key to an Appian field. This means over time, Appian learns to extract more information automatically.
We think you are going to get so much benefit out of Appian Document Extraction that we're giving you access to a pre-built application called Intelligent Document Processing (IDP). IDP takes advantage of all of the document extraction features, plus adds the capability to automatically classify documents, monitor the performance, and securely process documents across multiple teams. Check out the Intelligent Document Processing Documentation for more information.
This page walks you through how to use the document extraction functionality together. For more details about the individual smart services and functions, see the list below:
Many companies currently extract data from their documents manually or use expensive, high-maintenance optical character recognition (OCR) software.
Appian's document extraction overcomes these challenges by using flexible AI to analyze documents and accurately extract text and data. Once the information is captured, you can take action on it within your Appian applications.
Common use cases for document extraction include:
Before you get started, it is important to understand what kind of documents work best for data extraction.
Appian Document Extraction supports files that meet the following standards:
Document extraction works best with forms that tend to contain similar information in each document. For example, almost all invoices will have an invoice number and total and almost all purchase orders will have a PO number and purchaser.
Appian Document Extraction can use field position to learn more about your data and improve extraction results. To help train the feature, you can use consistently structured documents that place the same fields in the same places. As you complete reconciliation tasks, the feature learns to recognize data based on its position.
Document extraction also works best with forms that have clearly labeled values to help extract data from the document.
For example, the following form contains numerous labels and values associated with those labels, such as the label INVOICE and the value 101 and the label DATE and the value MAR 20, 2020.
Appian document extraction allows you to extract information from tables in your documents. The data from each table is presented neatly in the reconciliation task for quick and simple verification.
Appian document extraction can process forms with fillable fields. Appian extracts the values from these forms by default, instead of Google's Document AI. To use Document AI for all of your forms, including fillable forms, flatten your PDFs before beginning extraction. For example, you can add the Community supported PDF Tools plug-in in your process model to flatten PDFs before the extraction nodes.
Since Appian Document Extraction is meant to extract labels and values, extracting paragraphs of text is not a good use case for it. If you need to extract paragraphs of text, try using the Google Cloud Vision Connected System.
Likewise, if you need to find specific information in text, such as footnotes in a document, you will be better served by the Google Cloud Vision Connected System along with expressions to analyze the output.
The instructions below provide an overview of how to set up and create a simple document extraction process.
Appian Document Extraction uses Google Cloud Document AI. You must set up a Google Service Account in the Administration Console > Document Extraction section for authentication.
When creating your data type for your document extraction process, take the following notes into consideration:
Invoicedata type with fields like customerName, customerNumber, billingAddress, phoneNumber.
field is not currently supported.
To extract and save table data from your document, you need create a set of nested data types to process tables.
Create a custom data type that represents a single row of the table in your document that you would like to extract information from. These fields should correspond to the column labels of your table. This data type should only include fields of type text, and include a primary key ID field. Do not include any foreign key fields.
After you've finished, create a custom data type that represents the overall document. Add a field to this new CDT, and set its type to be the CDT you previously created to represent a row of the table. This nested CDT field should be marked as an array in the parent CDT, which enables you to extract more than one row of table data to the nested field.
In the CDT for the overall document, configure the foreign key relationship with the table CDT:
parentidin all lowercase. Use this value exactly.
When you publish a datastore the first time, Appian creates tables in the database. This is the database table creation pattern we recommend. If you manually create database tables, you'll have to modify the CDT's XSD definition to map the CDTs to the corresponding tables and also manually add a
parentid column to the child database table. This foreign key column is created in the child database table automatically when you publish the datastore as recommended.
To configure a document extraction process:
Doc Extraction Id
COMPLETE: Analysis is done and Appian has downloaded the results, you can now proceed to extract results using a!docExtractionResult() or Reconcile Doc Extraction Smart Service.
IN PROGRESS: Analysis is still in progress. Appian recommends checking again in 60 seconds.
INVALID ID: Check that the provided ID is valid or start a new doc extraction run on the same document.
ERROR: Internal evaluation error happened during processing.
Drag in Script Task node and use the a!docExtractionResult() function. Store the populated CDT with the extracted data in a process variable:
isnull(index(pv!invoice,"invoiceNumber",null))could result in going to reconciliation, but a populated result would skip the reconciliation task.
Doc Extraction Id,
Data Type Number
isExceptionto route the process model.
isException=true()to route to a chained user input task, where the user provides more information. The user can also classify what is wrong, for example: document is missing information, incorrect document type.
After creating your process model, run it with a few samples to test extraction and to see how your auto extracted results change.
To reduce the document processing time, you will want Appian Document Extraction to auto-extract as much document data as possible. To start, Appian uses the field names from the data type to find a match. Over time, Appian learns how to map your data to your data type fields from the user interactions with the reconcile interface. For example, if users provide mappings, then eventually Appian Document Extraction will recognize that
Invoice #, and
Invoice No. all map to the
invoiceNumber Appian data type field.
Once you set up your end to end process, Appian recommends testing the process with a series of sample documents to see how Appian learns to auto-map.
Some important notes about learned auto-mapping behavior:
When deploying your application to another environment, consider the following notes:
This means you may see different behavior between environments for auto-extraction depending on which documents have been processed and how they have been reconciled by users.
To complete this task, users compare the data that was extracted to an image of the uploaded document. They can use the information that displays in the document preview to update any incorrect or missing information.
To complete the reconciliation task:
If any information is missing, you can populate the information in three ways:
Values selected from the document preview will improve data extraction results. Values entered manually will not. For example, if you select the value for a PO number from the document preview in two different documents, it can learn that PO No. and PO # both mean PO Number. If you have the option, you should select correctly labeled values from the document preview instead of entering them manually.
Place your cursor in the field, then click the box that surrounds the desired value.
Click the box that surrounds the desired value in the document preview on the right, then click the arrow next to the field to populate the field.
To select text that was not automatically extracted, press and hold the Shift key while dragging the mouse.
Perform additional reconciliation for tables, if they appear in the document preview.
While you are reconciling the data, icons indicate how the information was entered for each field:
As users submit document extraction tasks, Appian will learn the aliases for your tables' column headers. It can then use the learnings to automatically extract table values, reducing the need for human reconciliation.
When manually extracting table data, users can take a variety of actions by clicking on the menu icon next to a column or row.
For columns, users can:
For rows, users can:
Users can also remove individual rows by clicking the close icon on the right side of each row
On This Page