Before you begin integrating document extraction into your business processes, it's important to evaluate whether Appian's features are a good choice for the documents you work with.
This page covers what makes a document either a good or bad candidate for document extraction. Where possible, we offer additional guidance to help you find a solution to process your documents using Appian. Follow the steps below to determine if your documents would be a good fit for Appian document extraction.
Appian is capable of processing PDF documents up to 15 pages or 7 MB. In the context of Appian document extraction, there are three types of PDFs that can be processed:
If you know which category your document falls into, it can help you later on when choosing a vendor to process the document. However, knowing the digitization status isn't necessary, as Appian is able to process most data in all three of these statuses.
The digitization status and format of your documents will ultimately determine how much data Appian can successfully extract from them. Before we dive into the specifics, first consider the structure your documents contain. Documents vary by how the content is organized, so you may find it helpful to have a few examples to reference as you evaluate.
Structured data includes information that's arranged in a fixed layout. Tax forms, passport applications, and hospital forms are good examples of documents with structured data. Appian can extract data from these types of documents easily due to the predictable and consistent positions of labels and values. Appian can use field position to learn more about your data and improve extraction results. To help train the feature, you can use consistently structured documents that place the same fields in the same places. As you complete reconciliation tasks, the feature learns to recognize data based on its position.
Here is an example of a structured document:
Tip: When you get ready to implement document extraction in your processes, you can use the IsStructuredDoc input in the Reconcile Doc Extraction smart service to use position to improve data recognition and extraction.
Semi-structured data includes similar pieces of information, but in varying layouts. Invoices, receipts, and utility bills are good examples of documents with semi-structured data. Appian's document extraction features are well equipped to identify and extract semi-structured data. Automatic extraction improves as you process additional documents.
Here is an example of a semi-structured document:
Unstructured data includes free-flowing paragraphs of text. Legal contracts and emails often include unstructured data. This type of information is more difficult to extract because the machine learning algorithms that identify the information are looking for key-value pairs. Larger blocks of text, or parts of that text, are more difficult to extract.
Here is an example of an unstructured document:
If your documents contain unstructured data, you may still be able to extract data from them:
Documents with these traits make good candidates for Appian's suite of doc extraction features:
Documents with similar information. For example, invoices that all have invoice numbers and totals.
Documents with clearly labeled values. When the data is extracted, Appian pairs these labels with their corresponding values. For example, the date label and March 1, 2021 value.
Documents with tables. Tables are another way to structure data in invoices and other documents with line items.
Documents in supported languages. Appian can extract data from documents in the following languages. Additional languages may be supported for documents with a digital layer (not flattened PDFs, i.e. scanned documents). Google's Document AI service also supports additional languages.
Does your document sound like a good fit so far? Great! Before you get started, also be sure to consider which documents might not be a good fit. We want to make sure you get the most value from Appian document extraction, and identifying which documents won't work is equally as important as identifying the documents that will.
Since Appian Document Extraction is meant to extract labels and values, extracting paragraphs of text is not a good use case for it. If you need to extract paragraphs of text, try using the Google Cloud Vision Connected System.
Likewise, if you need to find specific information in text, such as footnotes in a document, you will be better served by the Google Cloud Vision Connected System along with expressions to analyze the output.
After you've determined Appian can extract data from your documents, the next thing to consider is whether you want to use Appian's built-in capabilities or supplement them with Google's services. Both vendors provide strict data privacy and protection, but if your business requires you to keep data within Appian, you have the option to choose which vendor processes the documents.
So, which vendor is the best fit? As shown in the table below, both Appian and Google are capable of extracting data from flattened PDFs and searchable PDFs. Appian automatically extracts data from fillable PDFs regardless of the vendor you choose.
Refer to the table below to see which vendor supports certain data extraction processes and tools based on the type of data being processed:
Because both vendors offer comparable capabilities, the decision comes down to preference or requirements, such as:
Tip: Some customers may need to opt in to Appian's enhanced optical character recognition (OCR) service to process documents with handwriting, noisy, or skewed data.
Extraction of different data types from the following types of PDFs are supported when Appian
is selected as the vendor:
Tip: Appian's built-in document extraction capabilities related to flattened PDFs, key-value pair extraction, and table extraction are only available for Cloud customers at this time. Self-managed and Appian Gov Cloud customers don't have access to these features. Other built-in capabilities are available for both Cloud and self-managed customers.
Extraction of different data types from the following types of PDFs are supported when Google
is selected as the vendor:
Note that Appian's built-in document extraction capabilities automatically handle fillable PDFs, regardless of the vendor you select. To use Google for all of your forms, including fillable forms, flatten your PDFs before beginning extraction. For example, you can add the Community supported PDF Tools plug-in in your process model to flatten PDFs before the extraction nodes.
Appian offers low-code objects and built-in services to help you harness the power of artificial intelligence (AI) to extract data from your documents. However, it can still be tough to know where to start. And once you do, it can take time to learn the best ways to apply the features to your workflows.
That's why we built an application to help you get started: Intelligent Document Processing (IDP). With IDP, you can automatically extract data from forms in no time. You don't have to create a document extraction process, or even add new flows into your existing ones. After a few configuration steps, you can upload or send documents to IDP and see how quickly and accurately Appian can extract that data.
IDP is a low-stakes way for you to learn more about Appian's document extraction features without investing a lot of time in development or testing. Even better, we've made it available automatically, as part of the platform. Learn more about how to get started with Intelligent Document Processing.
Evaluating Doc Extraction Features