Create a Doc Extraction Process [Document Extraction Suite]

Create a Doc Extraction Process

Overview

In Appian, document extraction occurs within a process model.

This process model relies on a series of nodes that leverage artificial intelligence (AI) to map fields from a document to fields in a custom data type (CDT). Once the data is extracted, users can confirm or correct the results using a simple task interface. As you test your process model and reconcile results, the extraction will become smarter and more accurate over time.

This page will walk you through how to create your own document extraction process in three steps:

Construct a CDT to map and save your extraction results.
Configure a basic document extraction process model.
Test your process.

Tip: If you plan to create a document extraction process without personalized business rules or criteria, consider using the Intelligent Document Processing application that is already pre-configured to extract data and classify documents.

Prerequisites

The AI that extracts your document's data can be powered by either Google Cloud Document AI or using Appian's built-in document extraction functionality. You have the option to choose which vendor extracts your data based on your document extraction needs.

Visit the Admin Console > Document Extraction section to configure document extraction vendor availability. Here you'll see options regarding Appian's built-in services as well as Google Cloud Document AI configuration options. If you plan to use Google as your document extraction vendor, you must set up a Google Service Account in the Admin Console for authentication.

Create CDTs to store data

When extracting data, the process will identify and map key-value pairs from the document to the fields of a CDT. This CDT should be constructed to reflect the data available in your document. Therefore, it's important that your CDT fields match the data that will be extracted from your document.

Depending on the structure of your document, you may only need to create one CDT. For example, if you plan to extract data from an employee application that consists of several form fields, you'll create one CDT to match these fields.

However, if your document contains field names and a table, like a invoice document that contains a table of items, you'll ultimately create two CDTs: one that represents the document, and one that represents the table.

Create a CDT for a document

To create a CDT for your document, you'll want to create all the form fields as fields in your CDT.

Note: If you're working with a document that includes a table, do not add the table fields to this CDT. You will create a second CDT to represent the table.

To create a CDT for a document:

Create a new CDT.
Make the first field a primary key. Do not use a value that will be processed or reconciled as your primary key.
Add a new field for each label on the document.
Name the fields so that they match the form field names as closely as possible.

For example, if your document has a field called Customer Name, create a CDT field called customerName.
Set each field Type to either Boolean or Text. Note that only non-array Text is supported. Any unsupported fields will not be populated by automated extraction or on submission of the reconcile interface.
If your form contains a checkbox, create a field for every checkbox option, even mutually exclusive options.

For example, if your document has the field Membership type ☐ Basic ☐ Premium, create two fields: isMembershipBasic and isMembershipPremium.
Click SAVE CHANGES.

Create a CDT for a table

To extract and save table data from your document, you need to create a separate CDT to represent the table. After, you will nest this new CDT inside the CDT representing the document.

Although Appian doesn't recommend nesting CDTs for writing data with a one-to-many relationship, nesting your CDTs is recommended in the case of tables in document extraction.

To create a CDT for a table:

Create another CDT.
Make the first field a primary key. Do not use a value that will be processed or reconciled as the primary key.
Add a new field for each column label in the table.
In the Name field, use matching column names where possible.
In the Type field, select Text. Do not use any other data type unless it is for the primary key.
Click SAVE CHANGES.

images/doc_extraction/invoice_table_cdt.png

Next, you will need to nest this CDT inside the document CDT:

Return to the original CDT you created to represent your document.
Add a new field and set the Type to the second CDT you created to represent the table.
Mark this field as an Array so you can extract several rows of data from the table.
Click Foreign Key to configure a relationship with the table CDT.
Under Child Field Type, select the following checkboxes:
- Updates to a parent value should also update associated child value(s)
- I know the name of the column(s) this field should use in the database
In the Column Name field, type parentid in all lowercase.
Click OK.
Click Save changes.

images/doc_extraction/invoice_nested_cdt.png

Once you have created your CDTs, create a data store and use it to map your CDTs to the external database. When you publish the data store for the first time, Appian will create tables in the database. This is the database table creation pattern we recommend.

If you manually create database tables, you'll have to modify the CDT's XSD definition to map the CDTs to the corresponding tables and also manually add a parentid column to the child database table. This foreign key column is created in the child database table automatically when you publish the datastore as recommended.

Build a process model

With your CDTs and data store in place, you can now start building your end-to-end process.

The following instructions walk you through how to configure your process model and the four key nodes of a document extraction process.

As you build your process, you have the flexibility to incorporate other design objects and decisions that fit your specific business needs. See some additional process configuration options you can add to your own process model.

Define process variables

To easily pass data throughout your process, you'll want to create process variables that represent your document, vendor, CDT, extraction ID, and extraction status:

Create a process model.
From the File menu, click Properties.
In the Process Model Properties dialog, go to the Variables tab.

Create the following process variables:

Name	Type	Value	Parameter?	Required?
document	Document	Blank	Yes	Yes
vendor	Text	`Google` or `Appian`	Yes	No
docExtractionId	Number (Int)	Blank	No	No
docExtractionStatus	Text	Blank	No	No
documentCDT	The CDT that represents your document	Blank	Yes	No

Click OK.

Configure the Start Doc Extraction Smart Service

After defining your process variables, the first node to add to the process is the Start Doc Extraction Smart Service. This smart service will start the document extraction process using the vendor of your choosing.

To configure the smart service:

From the Palette, drag in a Start Doc Extraction smart service.
Open the Start Doc Extraction smart service and select the Data tab.
On the Inputs tab, configure the default inputs with the following values:
- Runtime Document is the document to process. Use pv!document as the Value to dynamically pass documents.
- Preferred Vendor is either Google or Appian. Use pv!vendor as the Value to dynamically choose which vendor to process the documents. See which vendor best suites your document extraction needs.
- Processor ID is the ID of the processor you set up in Document AI. This is an optional value for customers who wish to leverage capabilities of Google's Document AI API v1. Processor ID is not supported when Appian is selected for Preferred Vendor.
On the Outputs tab, configure Doc Extraction Id so the Target is set to pv!docExtractionId. This will store the document extraction ID produced from the smart service.

Monitor and guide the extraction status

After the extraction process has started, you need to track the progress of the extraction results. To monitor the status, configure a Script Task to use the a!docExtractionStatus() function.

To configure the Script Task:

Drag in a Script Task node and rename it to Check Doc Extraction Status.
Open Check Doc Extraction Status and select the Data tab.
On the Inputs tab, create a new input called docExtractionId and set the following field properties:
- Type: Number (Integer)
- Value: pv!docExtractionId
- Required: Yes
- Save into: pv!docExtractionId
On the Outputs tab, create a new custom output.
Use the a!docExtractionStatus() function to return the status of the extraction. To dynamically pass the document extraction ID, use the expression a!docExtractionStatus(pv!docExtractionId). Set the Target to pv!docExtractionStatus.

The a!docExtractionStatus() function will return one of four possible extraction statuses: COMPLETE, IN PROGRESS, INVALID ID, or ERROR. When the extraction begins, the status will be IN PROGRESS until the analysis is finished. Once the analysis completes, the status will change to COMPLETE and you can move to the next node in the process. To account for each status, you will configure a gateway to direct the process.

To configure the XOR gateway:

Drag in an XOR gateway and rename it to Complete?.
Create a connection between the Complete? gateway and Check Doc Extraction Status. This will form a loop for when pv!docExtractionStatus="IN PROGRESS".
Add a Timer Event on the connection between the Complete? gateway and Check Doc Extraction Status. Rename the Timer Event to Sleep.
Open the Sleep event and select the Setup tab. Configure a delay for 1 Minute after activation of the timer event and click OK.

At this point in the design, your process model will look like this:

Next you'll add another Script Task to direct the process when pv!docExtractionStatus="COMPLETE". This Script Task will call the a!docExtractionResult() function to retrieve the results of the extraction.

To configure the Script Task:

Drag in a Script Task node and place it after the Complete? gateway. Rename the Script Task node to Get Doc Extraction Results.
Open Get Doc Extraction Results and select the Data tab.
On the Outputs tab, create a new custom output.
Use the a!docExtractionResult() function to retrieve the results of the extraction. Use the expression a!docExtractionResult(pv!docExtractionId, typeof(pv!documentCDT)) to dynamically pass the document extraction ID and get the type number of your CDT. Learn more about the typeof() function.
Set to Target to pv!documentCDT.

Finally, create a connection from the Complete? gateway to the End Node to handle the process when pv!docExtractionStatus is INVALID ID or ERROR.

With each connection configured, you can now define the gateway's decisions.

Open the Complete? gateway and select the Decisions tab.
Create a condition for each of the following statuses:
- pv!docExtractionStatus="COMPLETE": Analysis is done and Appian has downloaded the results, you can now proceed to Get Doc Extraction Results.
- pv!docExtractionStatus="IN PROGRESS": Analysis is still in progress. Return to the Check Doc Extraction Status after 1 minute.
- pv!docExtractionStatus="INVALID ID": The document extraction ID is invalid. End the process if this status occurs.
- pv!docExtractionStatus="ERROR": An internal evaluation error happened during processing. End the process if this status occurs.

The Complete? gateway configuration should appear like this:

Configure the Reconcile Doc Extraction Smart Service

The last node you will configure is the Reconcile Doc Extraction Smart Service. This smart service will assign a reconciliation task to a user to confirm or correct the extracted results.

To configure the smart service:

Drag in a Reconcile Doc Extraction Smart Service node.
Open the Reconcile Doc Extraction Smart Service and select the Data tab.
On the Inputs tab, configure the default inputs with the following values:
- Doc Extraction Id requires the extraction ID stored in pv!docExtractionId.
- Data Type Number is the type to which the values should be cast. Use the typeof() function with pv!documentCDT.
On the Outputs tab, configure , extractedData so the Target is set to pv!documentCDT. This will store the results of the document's extracted data and the data that was reconciled by a user.
- Note: If you configure your process model to return to the reconcile interface after the task is submitted, you will only see the auto-mapped data. The manually reconciled data is not stored until it is written to a database or external system.
(Optional) Add another node to write to the database or use an integration to write to other systems.

That's it! Your process is set up to extract data. It should look like this, but it may contain additional nodes based on how you customized it:

Additional process configuration options

The process model detailed above provides the basic nodes needed to create an extraction process, but you aren't bound to this model. In fact, the major benefit of creating your own document extraction process is the flexibility to add additional rules or decisions that are specific to your business needs.

There a few ways you can enhance or modify this model, for example:

Create a process start form to pass in different documents, choose a vendor, or pass in other process variable values.
After the Script Task calling a!docExtractionResults, add logic to check if the data you need is populated from the extraction. If the data is populated, you could skip the reconciliation task and go straight to saving the data to your database or external system. For example: isnull(index(pv!invoice,"invoiceNumber",null)) could result in going to reconciliation, but a populated result would skip the reconciliation task.
Use the two optional outputs in the Reconcile Doc Extraction Smart Service to route the process model after reconciliation:
- isSubmit is true when the user selects the Submit button on the reconciliation task. Add logic after this smart service to use isSubmit=true() to trigger an email notification or a confirmation dialog.
- isException is true when the user selects the Mark as Invalid button on the reconciliation task. Add logic after this smart service to use isException=true() to route to a chained user input task, where the user provides more information.
Add conditionality to the Reconcile Doc Extraction Smart Service to determine who should be assigned the reconciliation task based on certain business criteria or rules.
Create an additional process model to map the resulting Text values from your document extraction process to their corresponding system data type values. For example, if you extract invoice numbers from your document extraction process model, the values are returned as type Text. In an additional process model, you could take the returned values and save them to a new CDT as type Number (Integer). This allows you to use the data in other areas of your application without any casting restraints.

If you find that you don't need to customize your document extraction process, consider using the Intelligent Document Processing application that is already pre-configured to extract data and classify documents.

Test your process for auto-extraction

After creating your process model, run it with a few samples to test the extraction and to see how your auto-extracted results change.

To test the document extraction process created above, you can assign a document as the value of pv!document and use Start Process for Debugging to pass the document through the process model.

As you test, Appian will use the field names from the data type to find a match. Over time, Appian learns how to map your data to your data type fields from the user interactions with the reconcile interface.

Appian will delete document extraction runs after 30 days, or when the total amount of data surpasses 10 GB. If you attempt to access a run that has been deleted, you will see an error. Appian will not delete the documents you uploaded. Learn more about your document's security.

Create a Doc Extraction Process