Document Extraction Glossary [Document Extraction Suite]

Document Extraction Glossary

B

Built-in services

Appian's own document extraction capabilities. You can extract data from documents and keep all of this information within your Appian Cloud instance by using built-in services.

C

Channel

See Document channel.

Checkbox

A user interface component that allows a user to make a binary choice, often "yes" or "no."

In the context of document extraction, this information is extracted as a label and the selection is saved as a Boolean (true for checked and false for unchecked).

Classification

In the Intelligent Document Processing application, classification is the optional process of identifying a document's type based on certain traits, and assigning it to a group accordingly. IDP document channels are able to intelligently identify the document's type if the classification model is configured and trained to do so. Users may also be asked to complete a manual classification task if the model isn't confident in its ability to classify it automatically.

Confidence threshold

A pre-set value to determine when IDP cannot classify a document automatically. If the auto-classification confidence is below the threshold, for example 90%, IDP generates a task for a human to perform the review manually.

D

Document channel

A document channel is a grouping of document types that have their own configurations and security settings for the purposes of processing. This allows various teams using IDP to control documents and data that may contain sensitive information.

Document structure

In the context of document extraction, structure describes how the content in a document is organized. Appian's document extraction features are more effective on certain structures than others. Document structures include: structured, semi-structured, and unstructured.

Document type

A category of document you use in your business operations. For example, a purchase order or invoice.

Not to be confused with the file type or extension, such as .pdf or .xlsx.

Document extraction

The process of identifying data relationships in a PDF document and digitally representing that information. Appian may use built-in capabilities to extract this information or, if the user prefers, Google's AI services are available as well. The process of extraction can also include a user reconciliation task. After reconciliation, Appian will store and recall the mapping of the extracted key to an Appian field.

E

Extraction

See Document extraction.

F

Field

A single piece of data to be extracted from a document and mapped to a CDT.

Fillable PDF

Similar to a searchable PDF, this file allows the user to input and save additional information into form fields.

Flattened PDF

A PDF with no text data associated with the file. It doesn't contain digital text or form fields. Often, these types of PDFs are created from paper documents that have been scanned.

K

Key

See label.

Key-value pair

A match between two data elements (a label and value) that are extracted from a document.

L

Label

The extracted constant that defines a part of a data set. This information isn't changed based on the user's selection or input. It is matched with the value to create a key-value pair in the extracted data. For example, "Name" is a label, and "John Smith" is a value.

M

Mapping

The act of matching data extracted from a field in a document to a field in a CDT.

O

Optical character recognition (OCR)

OCR software recognizes text within a digital image. This technology is well-suited for unstructured documents, but it is less accurate and requires more maintenance than purpose-built document extraction models.

P

Positional extraction

Ability to extract data from a document based on its location in a document. Appian can use positional extraction if it has processed the documents and learned from the results previously.

R

Reconciliation

The manual task of confirming or updating data Appian extracted from a document. Functionally, users compare the data that was extracted to an image of the uploaded document. When reconciliation occurs, Appian learns how to map the data in documents to the proper fields in the corresponding data type. Over time, this will make auto-extraction more accurate and reconciliation easier and less frequent.

S

Searchable PDF

A PDF that contains digital text data that can be highlighted, copied, searched, and accessed programmatically. This type of PDF has undergone previous processing or was saved from a word processor.

Semi-structured document

Documents that include similar pieces of information, but in varying layouts. Invoices, receipts, and utility bills are good examples of documents with semi-structured data. Appian's document extraction features are well equipped to identify and extract semi-structured data. Through AI and machine learning, the services improve as you process additional documents.

Straight-through processing

Extracted data that is 100% accurate and eliminates the need for a reconciliation task.

Structured document

Document containing information that is arranged in a fixed layout. Tax forms, passport applications, and hospital forms are good examples of documents with structured data. Appian can extract data from these types of documents easily due to the predictable and consistent positions of labels and values.

T

Table

Information displayed in a grid-like format, often using columns and rows to show similar information in a predictable way.

In document extraction, a table is a subset of the overall document data and requires additional configuration to extract and store the data properly.

Train

The process of improving the ability of IDP to extract correct information. This is achieved by providing example documents for IDP to process and manually confirming or fixing the extracted data in a reconciliation task. When a user provides this feedback through correction or confirmation, the model that extracted the data learns and improves in the future.

Type

See Document type.

U

Unstructured document

Documents that include free-flowing paragraphs of text. Legal contracts and emails often include unstructured data. This type of information is more difficult to extract because the machine learning algorithms that identify the information are looking for key-value pairs. Larger blocks of text, or parts of that text, are more difficult to extract.

V

Value

The extracted variable that defines a part of a data set. This information is changed based on the user's selection or input. It is matched with the label to create a key-value pair in the extracted data. For example, "Name" is a label, and "John Smith" is a value.

Vendor

The company that provides document extraction services. Customers can choose to use either Appian or Google for document extraction, based on their preferences and use cases.

Open in Github Built: Thu, May 23, 2024 (08:46:51 PM)

Document Extraction Glossary