Intelligent Document Processing Overview

Introduction

Up to this point, companies that needed to extract data from documents and forms had two options: slow, labor-intensive manual entry or outdated, hard-to-customize optical character recognition software.

But if you have Appian, you have another option: Appian's Intelligent Document Processing (IDP) application. We were so excited about Appian Document Extraction powered by Appian AI that we couldn't wait to get you started quickly. With IDP, you can automatically extract data from forms in no time. Not only that, we've made it available automatically, as part of the platform.

What does it do? The IDP application uses machine learning and artificial intelligence to quickly extract data from forms for use in your Appian applications. It even gets smarter and better the more you use it.

Read on to get an overview of the IDP application, including how it works, how the data moves in IDP, and some recommendations on how to get started with the application.

Overview of IDP

How does it work?

The IDP application uses Appian Document Extraction to transform unstructured data from PDF documents into structured data. This data can then be stored in your database to be used by your applications. In addition, it adds an initial classification step to identify your documents using machine learning services that are available for free using Appian AI.

It works best with fairly standard forms, such as invoices, that tend to contain similar information in each document. For example, almost all invoices have an invoice number and total, and almost all purchase orders have a PO number and purchaser. IDP doesn't require these forms to be standardized, it just requires them to have similar information on each form. See Appian Document Extraction for more information on what types of documents work best for data extraction.

Invoice examples

Processing documents in IDP consists of two steps: classification and extraction.

Document type classification

After you upload a PDF file, IDP automatically classifies the document as a specific document type. If it isn't quite sure how to classify the document, a task is created for a user to classify it manually. Using Google Cloud AutoML Natural Language, IDP uses these manual classifications to become better at classifying documents over time.

IDP classification

Data extraction

After the document's type is determined through classification, IDP takes advantage of Appian Document Extraction to extract certain fields from the document. For example, for an invoice, IDP extracts the Invoice Number, Invoice Date, Total, and Supplier.

After the data is extracted, a task is created for a user to verify the data and make any necessary corrections. This is referred to as reconciliation. Over time, this process is improved because it uses Appian Document Extraction to learn the different ways the fields can be displayed. For example, it will learn that PO No. and PO # both mean PO Number.

IDP reconciliation

Finally, the once unstructured data is now structured data, stored in a database for further use in applications.

The following graphic illustrates how IDP turns unstructured data into structured data, by first classifying the data and improving the classification model, then extracting the data and improving the results of the extraction.

IDP flowchart

How does my data move in IDP?

Data security is important. We want to make sure you understand where your data goes when you use IDP. IDP provides data privacy and protection because it secures your data with Appian as well as Google Cloud.

The following sections describe how your data moves between Appian and Google Cloud, as well as how it moves within Appian.

Classification model training

The labeled documents that are uploaded are stored in a Google Cloud storage bucket in one of the supported regions. Then the documents are converted to their text form, or "digitized," to form a dataset to train the classification model. Once the model is trained and deployed, which can take up to 24 hours, the documents and the dataset of document digitizations are deleted. The dataset (before it is deleted) and the model is stored and processed in the supported location corresponding with the region of the storage bucket.

Document type classification

The uploaded document is stored in a Google Cloud storage bucket. After the model returns a prediction, which can take up to 3 minutes, the document is deleted.

Data extraction

The uploaded document is stored in a Google Cloud storage bucket in one of the supported regions. The document is then sent to Google Cloud Document AI API to be analyzed for its labels and values and stored in a JSON document in a Google Cloud storage bucket.

If using Appian AI, the uploaded document and JSON analysis document is deleted after 24 hours. If you are not using Appian AI and you want to temporarily store the JSON analysis document, you will need to arrange the deletion of the document.

The auto-mapping learning of labels and values is stored in the Appian environment. The learning happens independently in each environment. See Deployment considerations on the Appian Document Extraction page for more information.

Document reconciliation

After a user completes the task to reconcile the document content with the extracted information, then the document data is written to the database. This data can be referenced in other applications as well.

How do I get started?

Setting up and using IDP

Starting with version 20.1 of Appian, if you are a new Appian customer you are pretty much ready to go. The application is preinstalled on your cloud site. Check out Updating a pre-installed application for the few steps you will have to follow to use your Google Cloud values on your Appian instance.

If you are an on-premise customer with version 20.1 (or higher) of Appian, or if you are an existing cloud customer upgrading to version 20.1 (or higher), you will need to follow these steps to install the application.

After you have customized the application to suit your needs, run through the guided configuration to start training the classification model right away.

After your application is configured, see our Intelligent Document Processing User Guide to learn more about how to:

  • Upload documents.
  • Complete classification and reconciliation tasks.
  • View document processing status, extracted data, and metrics.

Customizing IDP

We wanted to give you the power to get started quickly. So out of the box we offer four document types that are already configured to classify and extract data: invoices, purchase orders, claims, and receipts.

Want to capture different fields from these documents or need a different document type? Appian makes it easy to extend the application to modify the fields that are being extracted for each document type or add more document types.

If your organization needs to process documents from multiple channels, you can add new document channels. For example, if you process documents from the email inboxes of both the Finance and Legal departments, each department likely uses different document types. Moreover, these documents presumably only need to be viewed by their respective teams. This would be a good use case for adding multiple document channels.

You can use IDP directly in the Intelligent Document Processing site. However, you can also use IDP in sub-processes to process documents in larger workflows. Furthermore, you can upload documents from other systems, view the status of a document being processed, and get the extracted data through prebuilt web APIs.


This version of the Intelligent Document Processing documentation was written for Appian 20.1, and does not represent the interfaces or functionality of Appian 20.2.
Open in Github

On This Page

FEEDBACK