Categories
Guest posting Web Scraping Software

UiPath PDF Data Extraction

UiPath, one of the big providers of robotic process automation software, has some very interesting positioning. Unlike the other players on the market, they provide a free and fully featured community edition of their product for anybody to test and develop. The tool automates any application and is packed with all the web scraping and screen scraping capabilities for both desktop and web.  The platform also has a lively community forum featuring jobs, automation contests and knowledge-sharing between UiPath users: www.forum.uipath.com.

The PDF data extraction (extraction from pdf) and automation feature tool offers several activities and methods to navigate, identify and use PDF data freely whether in native text format or scanned images. The full featured IDE has a graphical interface with straightforward drag-and-drop functionality and a built-in library of predefined ‘Activities’.

PDFDataExtraction-uipath

To start things off, you need all the actions and dependencies required for working with PDF files. You can install the ‘UiPath PDF Activities’ package from the Package Manager. A simple search for ‘PDF’ inside the Package Manager will get you there.

1. Extract larger pieces of text or entire documents

These three techniques can be used to extract larger pieces of text or entire documents.

Read PDF Text activity

For this action, the PDF file doesn’t need to be open. You simply select the file and the Action will output a text variable with the contents of the file. You can save the result as a text file and also show it in a message box, but you could use other string operations to modify or extract information out of generated text. Look for the range parameter, it defines what to actually read. It can be set to ‘All pages’ or a specific page, or a range of pages.

PDFDataExtraction-uipath

OCR

There’s a specific action for reading images inside PDF files called ‘Read PDF with OCR’. It uses optical character recognition to scan the images inside the PDF and output all the text as a variable. Unlike its non-OCR siblings, it requires an OCR engine. You can find available ones and add them by searching for ‘OCR’ in the ‘Activities’ pane. The engine itself contains OCR parameters which are common throughout the app – ‘allowed characters’, ‘denied characters’, ‘language’, ‘scale’ and so on, but different engines may have different parameters. If you need to go deeper into how they work, there’s an advanced ‘UI interactions’ video tutorial available.

PDFDataExtraction-uipath

If background operation is important to you, note that both ‘Read PDF’ Action and the ‘Read PDF with OCR’ actions are self contained; they don’t need other applications open so they can run in the background. However, the PDF file needs to be open when performing OCR, as it only works with on-screen images. It means user must open PDF file and launch the UiPath pdf extracting robot when doing OCR.

The Screen Scraper Wizard

The second method for grabbing large and smaller blocks of text is with the screen scraper wizard found in the ‘Main’ toolbar. The wizard is useful for comparing and choosing a scraping method that also generates the actions for you. A simple mouse hover over the text elements that you need to scrape will make UiPath identify these elements inside the selection you just made and show a preview window of them.

PDFDataExtraction-uipath

The technology behind UiPath screen scraping senses the UI controls like a human instead of blindly using fixed screen coordinates. It extracts text from running Windows apps, even if they are hidden or covered by another app.

UiPath generally detects the best method for your situation, but you can change the scraping method and the preview will adapt accordingly.

2. Extract specific elements

For PDFs that are in the most common format, Native Text, – its elements are directly accessible to UiPath – there are a few options for getting the data:

Get Text action

This action is also available in the integrated ‘Recorder’. Simply point to the element of your choice and UiPath will generate the ‘Get text’ action and its output variable, displaying it in a message box.

PDFDataExtraction-uipath

If you want to extract the total value from a series of similar PDF files instead of just a single one, you’ll need to tweak the Selector a bit. The ‘Get text’ Action – like most UI interactions – uses a Selector to identify the correct element and get its value.

You can do it automatically with the help of the ‘Attach to Live Element’ feature. Simply point to another similar element that should also match the current Selector and UiPath will try to fix the Selector for you.

In case it doesn’t turn out the way you want, you can also manually modify it. For this part, it is advisable for you first to get familiar with UiPath Selectors and learn how to edit and debug them. Selectors play a central role in UI automation and knowing your way around them will help in many other ways. Video is here.

Manually, we’ll open the Selector again, only this time in the ‘UiExplorer’ feature to have a better view. After editing key UI elements, you simply copy the new Selector and paste it over the old one. Now it works for both files.

There is another method you can use to achieve the same result. In order to extract a fluctuating value from a series of PDF files you can also explore the ‘Anchor Base’ Activity. It is pretty flexible and allows you to use various actions inside it, like replacing the ‘Find Element’ action with the ‘Find Image’ action. Also you don’t have to deal with Selectors as much anymore. And since PDF files look the same on all systems, you can use ‘Find Image’ without its usual drawbacks. But don’t forget to set the zoom of the document to its actual size before indicating the image to make sure you get a reliable result. This method also handles structural changes to the document, as long as the image and data are present and in the same relationship.

Note that these last two methods require the PDF document to be opened, and the data with which you try to interact must be visible, otherwise it will most probably fail. Make sure you take that into account when building the final automation.

Automate any process with UiPath Studio

The video below explains how to extract data from a single PDF file. It works to extract a general text, whole PDF documents including images, as well as a specific text from a PDF file.
https://www.youtube.com/watch?v=jncjBCY4Auw

If you want to accomplish batch extraction from multiple files, it is possible through UiPath Studio workflow designer where you can model an automated process by assembling its steps into a visual flow-chart diagram. One activity can read one PDF at a time, but a workflow can read 1000 .pdf files in a few minutes. There are some new features for the Studio, like a start screen that allows you to begin by using best practices templates, making it easier to create automations.

See the screenshots below:

PDFDataExtraction-uipath

PDFDataExtraction-uipath

PDFDataExtraction-uipath

Sum up

To sum up, the above four or five activities should allow you to handle most PDF extractions you’ll be faced with. There are a couple more activities, like ‘Find Relative Element’ and ‘Scrape Relative’, which you can discover on your own. UIPath is the advanced tool for easy PDF Data Extraction and Automation.

If you’re dealing a lot with scanned documents, you may want to have a look at UiPath’s ‘Image-Based Automation’ video tutorials:

https://www.youtube.com/watch?v=LiiHAf-o_kE

https://www.youtube.com/watch?v=a6zAum7YaM4

[box style=’note’]The free, UiPath Community Edition of the product does not come with phone or e-mail support, but you can check out the Video Tutorials, User Guides, Knowledge Base or be active on the Community Forum.[/box]

One reply on “UiPath PDF Data Extraction”

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.