Appian Document Extraction

Overview

Appian's new suite of document extraction features make it easy to extract text and data from documents. Appian auto-generates a form for a person to confirm or correct the extraction. This provides an easy form for human-in-the-loop validation of automated extraction results. The user inputs then trains the extraction to perform better over time.

doc_extraction_overview.png

Document extraction identifies the contents of fields in forms (key/value pairs) from PDF files. For example, you can extract data from invoices and by identifying the field names and values that are paired together e.g. Invoice Number and INV-12.

Appian maps keys to a data type field. Appian also auto-generates a form for human-in-the-loop validation of automated extraction results. After this manual reconciliation of the extracted data, Appian will store and recall the mapping of the extracted key to an Appian field. This means over time, Appian learns to extract more information automatically.

We think you are going to get so much benefit out of Appian Document Extraction that we're giving you access to a pre-built application called Intelligent Document Processing (IDP). IDP takes advantage of all of the document extraction features, plus adds the capability to automatically classify documents, monitor the performance, and securely process documents across multiple teams. Check out the Intelligent Document Processing Documentation for more information.

This page walks you through how to use the document extraction functionality together. For more details about the individual smart services and functions, see the list below:

Benefits

Many companies currently extract data from their documents manually or use expensive, high-maintenance optical character recognition (OCR) software.

Appian's document extraction overcomes these challenges by using flexible AI to analyze documents and accurately extract text and data. Once the information is captured, you can take action on it within your Appian applications.

Common use cases for document extraction include:

  • Perform invoice processing: With this feature, you can quickly and accurately identify key data points like invoice number, date, total, and supplier within your invoices.
  • Find identifying information: Extract fields that help identify a customer or account. This provides valuable insight and a more holistic view of your records. For example: extract a claim number for claims processing.

Selecting document types to process

Before you get started, it is important to understand what kind of documents work best for data extraction.

Use supported files

Appian Document Extraction supports files that meet the following standards:

  • File type supported: PDF
  • Maximum file size: 15 pages or 7 MB

Use forms with similar information

IDP works best with forms that tend to contain similar information in each document. For example, almost all invoices will have an invoice number and total and almost all purchase orders will have a PO number and purchaser.

invoice examples

Use forms with clearly labeled values

IDP also works best with forms that have clearly labeled values to help extract data from the document.

For example, the following form contains numerous labels and values associated with those labels, such as the label INVOICE and the value 101 and the label DATE and the value MAR 20, 2020.

Purchase order good example

Don't use documents with big blocks of text

Since Appian Document Extraction is meant to extract labels and values, extracting paragraphs of text is not a good use case for it. If you need to extract paragraphs of text, try using the Google Cloud Vision Connected System.

Paragraph bad example

Likewise, if you need to find specific information in text, such as footnotes in a document, you will be better served by the Google Cloud Vision Connected System along with expressions to analyze the output.

Footnotes bad example

Getting Started

The instructions below provide an overview of how to set up and create a simple document extraction process.

Add your credentials to the administration console

Appian document extraction uses Google Cloud Document AI. You must set up a Google Service Account in the Administration Console > Document Extraction section for authentication.

Create your data type

When creating your data type for your document extraction process, take the following notes into consideration:

  • Look for a document that is representative of the documents you want to process and decide which fields you'll want to extract. For example: you may have an invoice document with key data points like customer name, customer identifier, address, and phone number.
  • Create a custom data type (CDT) to match those fields. Use matching field names where possible, as this will improve the initial performance of the extraction.
    • For example: create an Invoice data type with fields like customerName, customerNumber, billingAddress, phoneNumber.
    • Currently, only non-array text fields are supported. Unsupported fields:
      • Will not be populated by automated extraction.
      • Will appear in the reconcile interface as field is not currently supported.
      • Will not be populated on submission of the reconcile interface.

Build an end to end process

To configure a document extraction process:

  1. Create a process model.
  2. Drag in a Start Doc Extraction Smart Service node - pass in your document and save the identifier for that document extraction run in a process variable.
    • Input: Runtime Document
    • Output: Doc Extraction Id
  3. Drag in a Script Task node and use the a!docExtractionStatus() function which returns the text of the status of the document extraction run.
    • Input: docExtractionId
    • Output: status
  4. Drag in an XOR gateway - check the status that was returned from the a!docExtractionStatus() function.
    • COMPLETE: Analysis is done and Appian has downloaded the results, you can now proceed to extract results using a!docExtractionResult() or Reconcile Doc Extraction Smart Service.
    • IN PROGRESS: Analysis is still in progress. Appian recommends checking again in 60 seconds.
    • INVALID ID: Check that the provided ID is valid or start a new doc extraction run on the same document.
    • ERROR: Internal evaluation error happened during processing.
  5. Drag in a Timer Event and configure it to wait one minute.
  6. Drag in Script Task node and use the a!docExtractionResult() function - store the populated CDT with the extracted data in a process variable.

    • Input: docExtractionId, typeNumber, confidenceThreshold
    • Output extractedData
    • NOTE: You do not need to use both a!docExtractionResult and Reconcile Doc Extraction Smart Service. Following this node, you could add logic to check if the data you need is populated. If not, you can route it to the Reconcile Doc Extraction node. For example: isnull(index(pv!invoice,"invoiceNumber",null)) could result in going to reconciliation, but a populated result would skip the reconciliation task.
  7. Drag in a Reconcile Doc Extraction Smart Service node - this node generates a task for a user to validate the extracted data.
    • Input: Doc Extraction Id, Data Type Number
    • Output: extractedData, isSubmit, isException
    • NOTE: On submission of the task, the values are populated into the output data type. However, if you return to the reconcile interface for the same doc extraction run, you will only see the auto-mapped data.
  8. (Optional) Route after Reconciliation.
    • Use isSubmit and isException to route the process model.
    • For example: You can use isException=true() to route to a chained user input task, where the user provides more information. The user can also classify what is wrong, for example: document is missing information, incorrect document type.

doc_extraction_overview_process_model.png

Test extraction with example documents

After creating your process model, run it with a few samples to test extraction and to see how your auto extracted results change.

Extract data using auto-mapping

To reduce the document processing time, you will want Appian Document Extraction to auto-extract as much document data as possible. To start, Appian uses the field names from the data type to find a match. Then, Appian learns how to map your data to your data type fields from the user interactions with the reconcile interface. For example, if users provide mappings, then eventually Appian Document Extraction will recognize that Invoice Number, Invoice #, and Invoice No. all map to the invoiceNumber Appian data type field.

Once you set up your end to end process, Appian recommends testing the process with a series of sample documents to see how Appian learns to auto-map.

Some important notes about learned auto-mapping behavior:

  • Learned mappings are dependent on the data type's fields, so if you change the fields or make a new data type, it will not use the learned mappings.
  • Learning happens independently in each environment, to learn more see deployment considerations.

Deployment considerations

When deploying your application to another environment, consider the following notes:

  • Learning happens independently in each environment.
  • Learned mappings do not move between environments.

This means you may see different behavior between environments for auto-extraction depending on which documents have been processed and how they have been reconciled by users.

How to use the reconcile interface

The reconcile interface is a task that displays auto-mapped results and allows a user to verify and update extracted data. Before using this interface, you should have started document extraction and checked that the status is complete.

When a user clicks on a task, they may see something like this:

  1. Title
  2. Fields from the provided data type. For example: invoiceNumber is shown as Invoice Number
  3. Auto-mapped field. This is denoted by the magic wand icon .
  4. This field was not auto-mapped.
  5. The provided document.
  6. Paging controls
  7. Identified values, outlined by a box.

How to see where the values came from

  • Click or tab into a field, and the document reconciliation viewer will automatically highlight the corresponding value in the document.

How to populate a field using the extracted value

  • Click the box in the image, and then click the arrow icon.
  • The link icon indicates that you have selected a value from the document.

Also, you can select a field and then the value:

  • Click a blank field and click the box in the document reconciliation viewer.
  • The link icon indicates that you have selected a value from the document.
Open in Github

On This Page

FEEDBACK