Appian Document Extraction

Overview

Appian's new suite of document extraction features make it easy to extract text and data from documents. Appian auto-generates a form for a person to confirm or correct the extraction. This provides an easy form for human-in-the-loop validation of automated extraction results. The user inputs then trains the extraction to perform better over time.

doc_extraction_overview.png

Document extraction identifies the contents of fields in forms (key/value pairs) from PDF files. For example, you can extract data from invoices and by identifying the field names and values that are paired together e.g. Invoice Number and INV-12.

Appian maps keys to a data type field. Appian also auto-generates a form for human-in-the-loop validation of automated extraction results. After this manual reconciliation of the extracted data, Appian will store and recall the mapping of the extracted key to an Appian field. This means over time, Appian learns to extract more information automatically.

We think you are going to get so much benefit out of Appian Document Extraction that we're giving you access to a pre-built application called Intelligent Document Processing (IDP). IDP takes advantage of all of the document extraction features, plus adds the capability to automatically classify documents, monitor the performance, and securely process documents across multiple teams. Check out the Intelligent Document Processing Documentation for more information.

This page walks you through how to use the document extraction functionality together. For more details about the individual smart services and functions, see the list below:

Benefits

Many companies currently extract data from their documents manually or use expensive, high-maintenance optical character recognition (OCR) software.

Appian's document extraction overcomes these challenges by using flexible AI to analyze documents and accurately extract text and data. Once the information is captured, you can take action on it within your Appian applications.

Common use cases for document extraction include:

  • Perform invoice processing: With this feature, you can quickly and accurately identify key data points like invoice number, date, total, and supplier within your invoices.
  • Find identifying information: Extract fields that help identify a customer or account. This provides valuable insight and a more holistic view of your records. For example: extract a claim number for claims processing.

Selecting document types to process

Before you get started, it is important to understand what kind of documents work best for data extraction.

Use supported files

Appian Document Extraction supports files that meet the following standards:

  • File type supported: PDF
  • Maximum file size: 15 pages or 7 MB

Use forms with similar information

Document extraction works best with forms that tend to contain similar information in each document. For example, almost all invoices will have an invoice number and total and almost all purchase orders will have a PO number and purchaser.

invoice examples

Use forms with clearly labeled values

Document extraction also works best with forms that have clearly labeled values to help extract data from the document.

For example, the following form contains numerous labels and values associated with those labels, such as the label INVOICE and the value 101 and the label DATE and the value MAR 20, 2020.

Purchase order good example

Don't use documents with big blocks of text

Since Appian Document Extraction is meant to extract labels and values, extracting paragraphs of text is not a good use case for it. If you need to extract paragraphs of text, try using the Google Cloud Vision Connected System.

Paragraph bad example

Likewise, if you need to find specific information in text, such as footnotes in a document, you will be better served by the Google Cloud Vision Connected System along with expressions to analyze the output.

Footnotes bad example

Getting Started

The instructions below provide an overview of how to set up and create a simple document extraction process.

Add your credentials to the administration console

Appian document extraction uses Google Cloud Document AI. You must set up a Google Service Account in the Administration Console > Document Extraction section for authentication.

Create your data type

When creating your data type for your document extraction process, take the following notes into consideration:

  • Look for a document that is representative of the documents you want to process and decide which fields you'll want to extract. For example: you may have an invoice document with key data points like customer name, customer identifier, address, and phone number.
  • Create a custom data type (CDT) to match those fields. Use matching field names where possible, as this will improve the initial performance of the extraction.
    • For example: create an Invoice data type with fields like customerName, customerNumber, billingAddress, phoneNumber.
    • Currently, only non-array text fields are supported. Unsupported fields:
      • Will not be populated by automated extraction.
      • Will appear in the reconcile interface as field is not currently supported.
      • Will not be populated on submission of the reconcile interface.

Build an end to end process

To configure a document extraction process:

  1. Create a process model.
  2. Drag in a Start Doc Extraction Smart Service node - pass in your document and save the identifier for that document extraction run in a process variable.
    • Input: Runtime Document
    • Output: Doc Extraction Id
  3. Drag in a Script Task node and use the a!docExtractionStatus() function which returns the text of the status of the document extraction run.
    • Input: docExtractionId
    • Output: status
  4. Drag in an XOR gateway - check the status that was returned from the a!docExtractionStatus() function.
    • COMPLETE: Analysis is done and Appian has downloaded the results, you can now proceed to extract results using a!docExtractionResult() or Reconcile Doc Extraction Smart Service.
    • IN PROGRESS: Analysis is still in progress. Appian recommends checking again in 60 seconds.
    • INVALID ID: Check that the provided ID is valid or start a new doc extraction run on the same document.
    • ERROR: Internal evaluation error happened during processing.
  5. Drag in a Timer Event and configure it to wait one minute.
  6. Drag in Script Task node and use the a!docExtractionResult() function - store the populated CDT with the extracted data in a process variable.

    • Input: docExtractionId, typeNumber, confidenceThreshold
    • Output extractedData
    • NOTE: You do not need to use both a!docExtractionResult and Reconcile Doc Extraction Smart Service. Following this node, you could add logic to check if the data you need is populated. If not, you can route it to the Reconcile Doc Extraction node. For example: isnull(index(pv!invoice,"invoiceNumber",null)) could result in going to reconciliation, but a populated result would skip the reconciliation task.
  7. Drag in a Reconcile Doc Extraction Smart Service node - this node generates a task for a user to validate the extracted data.
    • Input: Doc Extraction Id, Data Type Number
    • Output: extractedData, isSubmit, isException
    • NOTE: On submission of the task, the values are populated into the output data type. However, if you return to the reconcile interface for the same doc extraction run, you will only see the auto-mapped data.
  8. (Optional) Route after Reconciliation.
    • Use isSubmit and isException to route the process model.
    • For example: You can use isException=true() to route to a chained user input task, where the user provides more information. The user can also classify what is wrong, for example: document is missing information, incorrect document type.

doc_extraction_overview_process_model.png

Test extraction with example documents

After creating your process model, run it with a few samples to test extraction and to see how your auto extracted results change.

Extract data using auto-mapping

To reduce the document processing time, you will want Appian Document Extraction to auto-extract as much document data as possible. To start, Appian uses the field names from the data type to find a match. Then, Appian learns how to map your data to your data type fields from the user interactions with the reconcile interface. For example, if users provide mappings, then eventually Appian Document Extraction will recognize that Invoice Number, Invoice #, and Invoice No. all map to the invoiceNumber Appian data type field.

Once you set up your end to end process, Appian recommends testing the process with a series of sample documents to see how Appian learns to auto-map.

Some important notes about learned auto-mapping behavior:

  • Learned mappings are dependent on the data type's fields, so if you change the fields or make a new data type, it will not use the learned mappings.
  • Learning happens independently in each environment, to learn more see deployment considerations.

Deployment considerations

When deploying your application to another environment, consider the following notes:

  • Learning happens independently in each environment.
  • Learned mappings do not move between environments.

This means you may see different behavior between environments for auto-extraction depending on which documents have been processed and how they have been reconciled by users.

Completing a reconciliation task

After the Start Doc Extraction Smart Service extracts data from a document, the Reconcile Doc Extraction Smart Service generates a task for a user to validate the data.

To complete this task, users compare the data that was extracted to an image of the uploaded document. They can use the information that displays in the document preview to update any incorrect or missing information.

To complete the reconciliation task:

  1. On the left side of the page, review the information in the fields. Use the document preview on the right to verify the accuracy of the data.
    • Note: To see where the information in the fields came from, select the field and the value is automatically highlighted in the document preview.

  2. If any information is missing, you can populate the information in three ways:

    Values selected from the document preview will improve data extraction results. Values entered manually will not. For example, if you select the value for a PO number from the document preview in two different documents, it can learn that PO No. and PO # both mean PO Number. If you have the option, you should select correctly labeled values from the document preview instead of entering them manually.

    • Place your cursor in the field, then click the box that surrounds the desired value.

    • Click the box that surrounds the desired value in the document preview on the right, then click the arrow next to the field to populate the field.

    • To select text that was not automatically extracted, press and hold the Shift key while dragging the mouse.

  3. After all fields are verified and populated, click RECONCILE.

While you are reconciling the data, icons indicate how the information was entered for each field:

  • No icon: Value was entered manually
  • Magic wand icon : Value was entered automatically during data extraction.
  • Link icon : Value was selected from the document preview.

user_guide_reconciliation_icons

Open in Github

On This Page

FEEDBACK