In today’s business landscape, organizations are constantly seeking ways to optimize their financial processes, enhance efficiency, and drive cost savings. One area that holds significant potential for improvement is accounts payable. On a high level, the accounts payable process includes receiving and scanning invoices, extraction of the relevant data from scanned invoices, validation, approval, and archival. The second step (extraction) can be complex. Each invoice and receipt look different. The labels are imperfect and inconsistent. The most important pieces of information such as price, vendor name, vendor address, and payment terms are often not explicitly labeled and have to be interpreted based on context. The traditional approach of using human reviewers to extract the data is time-consuming, error-prone, and not scalable.
In this post, we show how to automate the accounts payable process using Amazon Textract for data extraction. We also provide a reference architecture to build an invoice automation pipeline that enables extraction, verification, archival, and intelligent search.
The following architecture diagram shows the stages of a receipt and invoice processing workflow. It starts with a document capture stage to securely collect and store scanned invoices and receipts. The next stage is the extraction phase, where you pass the collected invoices and receipts to the Amazon Textract AnalyzeExpense API to extract financially related relationships between text such as vendor name, invoice receipt date, order date, amount due, amount paid, and so on. In the next stage, you use predefined expense rules to determine if you should automatically approve or reject the receipt. Approved and rejected documents go to their respective folders within the Amazon Simple Storage Service (Amazon S3) bucket. For approved documents, you can search all the extracted fields and values using Amazon OpenSearch Service. You can visualize the indexed metadata using OpenSearch Dashboards. Approved documents are also set up to be moved to Amazon S3 Intelligent-Tiering for long-term retention and archival using S3 lifecycle policies.
The following sections take you through the process of creating the solution.
To deploy this solution, you must have the following:
To create the AWS Cloud9 environment, provide a name and description. Keep everything else as default. Choose the IDE link on the AWS Cloud9 console to navigate to IDE. You’re now ready to use the AWS Cloud9 environment.
To set up the solution, you use the AWS Cloud Development Kit (AWS CDK) to deploy an AWS CloudFormation stack.
git clone pip install -r requirements.txt cdk bootstrap cdk deploy
The deployment takes around 25 minutes with the default configuration settings from the GitHub repo. Additional output information is also available on the AWS CloudFormation console.
aws dynamodb execute-statement --statement "INSERT INTO \"$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Value' --output text)\" VALUE [a-z][0-9]$', 'errorTxt': 'Receipt number is not valid. It is of the format: 123ABC456'>" aws dynamodb execute-statement --statement "INSERT INTO \"$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Value' --output text)\" VALUE "
In Amazon Cognito, you should already have an existing user pool called OpenSearchResourcesCognitoUserPool* . We use this user pool to create a new user.
Now let’s dive into each of the document processing steps.
Customers handle invoices and receipts in a multitude of formats from different vendors. These documents are received through channels like hard copies, scanned copies uploaded to file storage, or shared storage devices. In the document capture stage, you store all scanned copies of receipts and invoices in a highly scalable storage such as in an S3 bucket.
The next stage is the extraction phase, where you pass the collected invoices and receipts to the Amazon Textract AnalyzeExpense API to extract financially related relationships between text such as Vendor Name, Invoice Receipt Date, Order Date, Amount Due/Paid, etc.
AnalyzeExpense is an API dedicated to processing invoice and receipts documents. It is available both as a synchronous or asynchronous API. The synchronous API allows you to send images in bytes format, and the asynchronous API allows you to send files in JPG, PNG, TIFF, and PDF formats. The AnalyzeExpense API response consists of three distinct sections:
This post uses the Amazon Textract IDP CDK constructs (AWS CDK components to define infrastructure for intelligent document processing (IDP) workflows), which allows you to build use case-specific, customizable IDP workflows. The constructs and samples are a collection of components to enable definition of IDP processes on AWS and published to GitHub. The main concepts used are the AWS CDK constructs, the actual AWS CDK stacks, and AWS Step Functions.
The following figure shows the Step Functions workflow.
The extraction workflow includes the following steps:
We discuss the details of the next three steps in the following sections.
For the verification stage, the SetMetaData Lambda function verifies whether the uploaded file is a valid expense as per the rules configured previously in DynamoDB table. For this post, you use the following sample rules:
After the files are processed, the expense verification function moves the input files to either approved or declined folders in the same S3 bucket.
For the purposes of this solution, we use DynamoDB to store the expense validation rules. However, you can modify this solution to integrate with your own or commercial expense validation or management solutions.
With the OpenSearchPushInvoke Lambda function, the extracted expense metadata is pushed to an OpenSearch Service index and is available for search.
The final TaskOpenSearchMapping step clears the context, which otherwise could exceed the Step Functions quota of maximum input or output size for a task, state, or workflow run.
After the OpenSearch Service index is created, you can search for keywords from the extracted text via OpenSearch Dashboards.
To manage the lifecycle and archival of invoices and receipts, you can configure S3 lifecycle rules to transition S3 objects from Standard to Intelligent-Tiering storage classes. S3 Intelligent-Tiering monitors access patterns and automatically moves objects to the Infrequent Access tier when they haven’t been accessed for 30 consecutive days. After 90 days of no access, the objects are moved to the Archive Instant Access tier without performance impact or operational overhead.
For auditing and analytics, this solution uses OpenSearch Service for running analytics on invoice requests. OpenSearch Service enables you to effortlessly ingest, secure, search, aggregate, view, and analyze data for a number of use cases, such as log analytics, application search, enterprise search, and more.
Log in to OpenSearch Dashboards and navigate to Stack Management, Saved objects, then choose Import. Choose the invoices.ndjson file from the cloned repository and choose Import. This prepopulates indexes and builds the visualization.
Refresh the page and navigate to Home, Dashboard, and open Invoices. You can now select and apply filters and expand the time window to explore past invoices.
When you’re finished evaluating Amazon Textract for processing receipts and invoices, we recommend cleaning up any resources that you might have created. Complete the following steps:
cognito_user_pool=$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-CognitoUserPoolId`].Value' --output text) echo $cognito_user_pool cdk destroy aws cognito-idp delete-user-pool --user-pool-id $cognito_user_pool
In this post, we provided an overview of how we can build an invoice automation pipeline using Amazon Textract for data extraction and create a workflow for validation, archival, and search. We provided code samples on how to use the AnalyzeExpense API for extraction of critical fields from an invoice.
To get started, sign in to the Amazon Textract console to try this feature. To learn more about Amazon Textract capabilities, refer to the Amazon Textract Developer Guide or Textract Resources. To learn more about IDP, refer to the IDP with AWS AI services Part 1 and Part 2 posts.
Sushant Pradhan is a Sr. Solutions Architect at Amazon Web Services, helping enterprise customers. His interests and experience include containers, serverless technology, and DevOps. In his spare time, Sushant enjoys spending time outdoors with his family.
Shibin Michaelraj is a Sr. Product Manager with the AWS Textract team. He is focused on building AI/ML-based products for AWS customers.
Suprakash Dutta is a Sr. Solutions Architect at Amazon Web Services. He focuses on digital transformation strategy, application modernization and migration, data analytics, and machine learning. He is part of the AI/ML community at AWS and designs intelligent document processing solutions.
Maran Chandrasekaran is a Senior Solutions Architect at Amazon Web Services, working with our enterprise customers. Outside of work, he loves to travel and ride his motorcycle in Texas Hill Country.