Categories
Web Scraping Software

Dexi.io – how to improve performance

Intro

Some may argue that extracting 3 records per minute is not fast enough for an automated scraper (see my last post on Dexi multi-threaded jobs). However, you should realize that Dexi extractor robots behave like a full-blown modern browser and fetch all the resources that crawled pages load (CSS, JS, fonts, etc.).
In terms of performance, an extractor robot might not be as fast as a pure HTTP scraping script, but its advantage is the ability to extract data from dynamic websites which require running JavaScript code in order to generate a user-facing content. It will also be harder for anti-bot mechanisms to detect and block it.

Improving

With some assistance from the Dexi support team, I learned about the process of improving the speed of the data extraction process. Since extractor robots load all the resources needed for serving a particular web page, one can block some of these requests without preventing the robot from extracting target data. You can, for example, block images, even if you’re going to extract them (links or actual files).

The Dexi WYSIWYG extractor editor makes adding network filters easy via the Network tab. When a robot is opened in an Editor mode you can switch to the Network tab to find a list of all the requests that have been sent to the server.

You may notice that some images are being served from a CDN. Thus why not to block these requests to save some bandwidth and speed up the process? You can do that by adding a new filter on the left side of the Network tab

You can also rely on your browser’s developer tools to identify other requests that could be blocked (live chat widgets, advertising networks, analytics services…) and experiment with blocking JavaScript (.js as a pattern). A robot should keep working fine unless you want to extract dynamic content. Blocking CSS isn’t recommended, as it won’t have a significant impact and would make building a robot a bit harder. One can block images by adding separate network filters for .gif, .jpeg, .jpg and .png with Block requests as a behavior.

The blocking is a smart way to make an Extractor robot reduce bandwidth consumption.

In our case, after we set the filter to block all the resources from alicdn, the resulting load turned out to be like this:

Block requests means that the robot won’t download the resources in question, but with Ignore requests, the server will download them but won’t wait for the download to complete when navigating to a new page.