Select a Document

When creating a document extraction process, you want to build around a specific type of document. Whether this is an invoice, an application, or a simple form, the type and format of your document will determine how much data can be successfully extracted from the process.

There are certain types of forms that work best with document extraction. To obtain the best results in your document extraction process, the following page provides guidance on the different types of forms to use.

Use supported files

Appian Document Extraction supports PDF documents up to 15 pages or 7 MB.

In the context of Appian document extraction, there are three types of PDFs that can be processed:

  • Flattened PDF: No text data associated with the file. It doesn't contain digital text or form fields. Often, these types of PDFs are created from paper documents that have been scanned.
  • Searchable PDF: Contains digital text data that can be highlighted, copied, searched, and accessed programmatically. This type of PDF has undergone previous processing or was saved from a word processor.
  • Fillable PDF: Similar to a searchable PDF, this file allows the user to input and save additional information into form fields.

Depending on your security needs and the type of PDF you want to process, you will select your preferred vendor.

Which vendor is best for my documents?

Data security is top of mind in many industries. Appian's document extraction features provide strict data privacy and protection, but if your business requires you to keep data within Appian, you have the option to choose which vendor processes the documents. Select Appian or Google in the Start Doc Extraction Smart Service.

IDP secures your data and protects privacy no matter which vendor you select. See Data Security in IDP to learn more about how your data moves in the application.

In addition to your security needs, you'll also want to consider the type of content you want to extract, and how you want to extract it, before choosing a vendor.

Refer to the table below to see which vendor supports certain data extraction processes and tools based on the type of PDF being processed:

  • Key-value pairs: Identify and extract labeled values in documents.
  • Selection tool: Use Appian's selection tool in the reconcile interface to manually select the values to extract.
  • Positional extraction: Extract values from documents that follow a template.
  • Checkboxes: Extract values from checkboxes.
  • Tables: Extract information from tables.

Appian

Appian's native document extraction capabilities related to flattened PDFs and key-value pair extraction are currently in Preview. Before enabling this feature, please review its compliance to ensure that it aligns with your organization's security requirements. To enable this feature, select Enable additional native services in the Document Extraction page of the Admin Console.

Extraction of different data types from the following types of PDFs are supported when Appian is selected as the vendor:

  • Flattened PDFs: Key-value pair extraction, the selection tool, and positional extraction.
  • Searchable PDFs: Key-value pair extraction, the selection tool, and positional extraction.
  • Fillable PDFs: Key-value pair extraction, the selection tool, positional extraction, and checkbox extraction.

Appian's native document extraction capabilities related to flattened PDFs and key-value pair data are only available for Cloud customers at this time. Self-managed and Appian Gov Cloud customers don't have access to these features. Other native capabilities are available for both Cloud and self-managed customers.

Google

Extraction of different data types from the following types of PDFs are supported when Google is selected as the vendor:

  • Flattened PDFs: Key-value pair extraction, the selection tool, positional extraction, checkbox extraction, and table extraction.
  • Searchable PDFs: Key-value pair extraction, the selection tool, positional extraction, checkbox extraction, and table extraction.
  • Fillable PDFs: Processed automatically on your Appian environment.

Note that Appian's native document extraction capabilities automatically handle fillable PDFs, regardless of the vendor you select. To use Google for all of your forms, including fillable forms, flatten your PDFs before beginning extraction. For example, you can add the Community supported PDF Tools plug-in in your process model to flatten PDFs before the extraction nodes.

Use forms with similar information

Document extraction works best with forms that tend to contain similar information in each document. For example, almost all invoices will have an invoice number and total and almost all purchase orders will have a PO number and purchaser.

invoice examples

Use forms with structured data

Appian Document Extraction can use field position to learn more about your data and improve extraction results. To help train the feature, you can use consistently structured documents that place the same fields in the same places. As you complete reconciliation tasks, the feature learns to recognize data based on its position.

Set the isStructuredDoc input to True in the Reconcile Doc Extraction Smart Service or a!docExtractionResult function to take advantage of this learning.

Use forms with clearly labeled values

Document extraction also works best with forms that have clearly labeled values to help extract data from the document.

For example, the following form contains numerous labels and values associated with those labels, such as the label INVOICE and the value 101 and the label DATE and the value MAR 20, 2020.

Purchase order good example

Use forms with tables

Appian document extraction allows you to extract information from tables in your documents. The data from each table is presented neatly in the reconciliation task for quick and simple verification.

table_good_example.png

Don't use documents with big blocks of text

Since Appian Document Extraction is meant to extract labels and values, extracting paragraphs of text is not a good use case for it. If you need to extract paragraphs of text, try using the Google Cloud Vision Connected System.

Paragraph bad example

Likewise, if you need to find specific information in text, such as footnotes in a document, you will be better served by the Google Cloud Vision Connected System along with expressions to analyze the output.

Footnotes bad example

Open in Github

On This Page

FEEDBACK