Some may argue that extracting 3 records per minute is not fast enough for an automated scraper (see my last post on Dexi multi-threaded jobs). However, you should realize that Dexi extractor robots behave like a full-blown modern browser and fetch all the resources that crawled pages load (CSS, JS, fonts, etc.).
With some assistance from the Dexi support team, I learned about the process of improving the speed of the data extraction process. Since extractor robots load all the resources needed for serving a particular web page, one can block some of these requests without preventing the robot from extracting target data. You can, for example, block images, even if you’re going to extract them (links or actual files).
The Dexi WYSIWYG extractor editor makes adding network filters easy via the Network tab. When a robot is opened in an Editor mode you can switch to the Network tab to find a list of all the requests that have been sent to the server:
You may notice that some images are being served from a CDN. Thus why not to block these requests to save some bandwidth and speed up the process? You can do that by adding a new filter on the left side of the Network tab as highlighted in the following screenshot:
The blocking is a smart way to make an Extractor robot reduce bandwidth consumption.
In our case, after we set the filter to block all the resources from alicdn, the resulting load turned out to be like this:
[box style=”info blue”]Block requests means that the robot won’t download the resources in question, but with Ignore requests, the server will download them but won’t wait for the download to complete when navigating to a new page.[/box]