If your business spends countless hours extracting data from documents and forms, Appian is here to help. Appian includes a rich set of artificial intelligence (AI) features that accelerate the low-code development of document extraction processes. Leverage the power of AI to minimize repetitive and manual data extraction, and eliminate the need for expensive, high-maintenance optical character recognition (OCR) software.
This page covers how document extraction works in Appian, as well as what makes a document either a good or bad candidate for document extraction.
Appian is capable of processing PDF documents up to 15 pages or 7 MB. In the context of Appian document extraction, there are three types of PDFs that can be processed:
Appian is able to process most data from documents in all three of these digitization statuses.
Appian's built-in document extraction capabilities related to flattened PDFs, key-value pair extraction, and table extraction are only available for Cloud customers at this time. Self-managed and Appian Gov Cloud customers don't have access to these features. Other built-in capabilities are available for both Cloud and self-managed customers.
The digitization status and format of your documents will ultimately determine how much data Appian can successfully extract from them. Documents vary by how the content is organized, so you may find it helpful to have a few examples to reference as you evaluate.
Structured data includes information that's arranged in a fixed layout. Tax forms, passport applications, and hospital forms are good examples of documents with structured data. Appian can extract data from these types of documents easily due to the predictable and consistent positions of labels and values. Appian can use field position to learn more about your data and improve extraction results. To help train the feature, you can use consistently structured documents that place the same fields in the same places. As you complete reconciliation tasks, the feature learns to recognize data based on its position.
Here is an example of a structured document:
Semi-structured data includes similar pieces of information, but in varying layouts. Invoices, receipts, and utility bills are good examples of documents with semi-structured data. Appian's document extraction features are well equipped to identify and extract semi-structured data. Automatic extraction improves as you process additional documents.
Here is an example of a semi-structured document:
Unstructured data includes free-flowing paragraphs of text. Legal contracts and emails often include unstructured data. This type of information is more difficult to extract because the machine learning algorithms that identify the information are looking for key-value pairs. Larger blocks of text, or parts of that text, are more difficult to extract.
Here is an example of an unstructured document:
If your documents contain unstructured data, you may still be able to extract data from them:
Documents with these traits make good candidates for Appian's suite of doc extraction features:
Documents with similar information. For example, invoices that all have invoice numbers and totals.
Documents with clearly labeled values. When the data is extracted, Appian pairs these labels with their corresponding values. For example, the date label and March 1, 2021 value.
Documents with tables. Tables are another way to structure data in invoices and other documents with line items.
Documents in supported languages. Appian can extract data from documents in the following languages. Additional languages may be supported for documents with a digital layer (not flattened PDFs, i.e. scanned documents). Google's Document AI service also supports additional languages.
Does your document sound like a good fit so far? Great! Before you get started, also be sure to consider which documents might not be a good fit. We want to make sure you get the most value from Appian document extraction, and identifying which documents won't work is equally as important as identifying the documents that will.
Since Appian Document Extraction is meant to extract labels and values, extracting paragraphs of text is not a good use case for it. If you need to extract paragraphs of text, try using the Google Cloud Vision Connected System.
Likewise, if you need to find specific information in text, such as footnotes in a document, you will be better served by the Google Cloud Vision Connected System along with expressions to analyze the output.
This section provides more detail on how Appian extracts and maps data from your documents.
Appian learns about your documents in two ways:
The document extraction process consists of three parts:
See the glossary for a refresher on the terms used on this page.
Output: Identified text values, checkboxes, and tables
In the first step, the PDF goes to your custom ML model to run optical character recognition (OCR) and extract text values. Pre-trained models identify special document formatting checkboxes and tables. The models return all identified values (represented by blue selection boxes in the image below). Fields are represented by purple selection boxes for reference, but they do not appear in Appian.
Input: Identified checkboxes and tables from step 1
Output: Reconciliation user input task
Next, Appian leverages previous mappings stored in the customer's environment to know which extracted tables and checkbox data to map to the document structure. These mappings are stored in a dictionary as you complete reconciliation tasks over time (step 3). So, the more mappings and reconciliation tasks you complete for a given document type, the better Appian is at mapping that data. Each subsequent reconciliation task is faster and more accurate.
If your Appian environment has previously mapped values to your table or checkbox fields, Appian leverages those previous keys to assist in mapping the data before assigning a reconciliation task. Each time a user completes a reconciliation task, Appian stores updated mappings in a simple dictionary of terms (keys and positions) to use next time it has to map data from the pre-trained model (output of step 1) to the structured fields in your application.
Input: Identified and mapped checkboxes and tables from step 2
Output: Auto-extracted fields to Appian process model for use in your application
Finally, a user completes a reconciliation task to confirm that the mappings from step 2 are correct. When a user maps data to a field and submits the reconciliation task, Appian stores the label for the key that was mapped. For example, if you provide mappings, Appian will recognize that
Updated both map to the
changed Appian checkbox field.
Reconciliation doesn't impact how Appian learns about your text values because these are extracted with custom ML models. However, reconciliation can be helpful for you to correct any text values that weren't extracted correctly.
Reconciliation helps Appian manage variations table and checkbox values in semi-structured and structured forms. In this way, reconciliation helps document extraction learn more about your data. Appian learns more about your text values by training the document extraction AI skill with a diverse and representative dataset.
As you complete reconciliation tasks, data mapping in step 2 improves because Appian can match the keys to more options. However, the model in step 1 does not get retrained when you submit a reconciliation task with updated mapping. This means that if the ML model misses a field in step 1, Appian will continue to miss that field in step 2, and that there are some forms where auto-extraction will not extract the data desired. In these situations, customers can use manual extraction in step 3 to get the last pieces of information.
Document extraction is a powerful tool to use in your business, but before you put in the work to create your own process, think about what you want to do. For example:
If you want the ability to customize these aspects of the document extraction process, like how the data moves post-extraction or who corrects results, you can build your own document extraction process using the document extraction AI skill and related smart services in the Process Modeler.
We want everyone to have access to the power of automation, so we're offering Appian Cloud customers 20,000 pages of document extraction and classification per month included with the platform. This is substantially more than the free offerings of other document extraction vendors.
If your business processes a higher volume of documents, reach out to your account executive to learn about additional pricing options.
How Document Extraction Works