The modern web requires you to spend huge amount of processing power to mine it for information. How could a start-up or a small business do comprehensive data crawling without having to build the giant server farms used by major search engines?
80legs makes web crawling technology more accessible to small companies and individuals by allowing leased access and letting customers pay only for what they crawl. Now, nontechnical users are be able to set up a crawl job with more or less adaptive control. Developers can incorporate 80legs API into their applications to spread the crawling net. The distributed computing network is put together by the 3rd party enterprise, which rents it to 80legs. The main idea is to make a home PC crawl web pages on demand during its idle time.
How it Works
Make a crawl application
Completing the form is all you have to do to make a crawl application. Write your own application by completing the 80legs form and make it run on multiple computers over the web to do a custom search. For custom crawling, a user might set various crawl MIME types (text, image, application, video and so on). Analysis types differ from URLs to meta data and md5 hashes. Depth level (for a paid account) and crawl types (Fast, Comprehensive, Breadth-First) must be set. As soon as you’ve completed the form, your job is in a queue to be run.
Run the job
The job runs distributed, with low consumption of memory or bandwidth.
Download the results
Download the results when the job is done (you’ll be notified by email). The results are typically of 2 types for a free account: crawled URLs and analyzed URLs. The latter’s value might differ from the first one.
I quickly composed a crawl job by checking 14 form fields and within 10 minutes, I got the crawl results. The Regex crawling plugin helped me to specify more on the search terms. One thing that was missing is there is no way to edit the created crawl job, but copy to clone it works well.
Detailed job summary
It’s important to see job details, especially the failure or success overview.
Custom or Pre-Built Crawling App
Certainly, the crawl service isn’t very adaptive and programmable for a “layman”. So, some companies have developed a custom processing API, or used the pre-built 80app from its store to run on the 80legs service base to crawl and index the web under their conditions. These pre-built apps are plugins that can be used in a crawl to parse or analyze the target data. Some of them are free, but with a crawl fee applicable.
Select a Chunk of Web, Crawl and Index it
If you know the data sources you want to grab, set up your crawl engine rules to search, index and save data (for the last option, one definitely needs a custom app). For instance, IndexTank is teaming up with 80legs so the customer may select a chunk of the web, crawl it and index it.
Pre-Configured Live Crawls of Websites from 80legs
This service also affords crawl packages on popular topics: business, real estate, retail and others. The data update frequency is quite high; crawl results are posted every 1 – 3 hours. The output format is usually XML.
Crawl Speed Claimed to Catch up with Modern Search Engines
The owners claim that single crawl job can process a million pages in about 15 min, which is about 1000 pages/second or 30 ms for a page. The service offers this speed to every customer, while respecting the bandwidth of the sites crawled. The developers claim this crawl speed is similar to Google, Yahoo and other giants’ engines. For my test of 1000 pages crawled, the CPUs spent 0.01 hour or about 36 seconds, similar to the claimed crawl speed. Priority of a job execution should be another factor considered.
Pricing starts from free for a Basic Plan (low crawl priority, up to 1000 pages for a job and other limitations) to $299 per month for the Premium plan. Additionally, you’ll be charged crawling fees. Usage fees are “$2.20 per million pages crawled & $0.03 per CPU-hour used & $0.00000008 per KB for results over 25 KB per page.” This is fairly affordable compared to pricing for crawl with other web services.
Results were limited with only crawled URLs and some indexes. Pulling all of the contents from the pages ends up using lots of our bandwidth and your bandwidth. Page content developers may write a custom code for your 80app.
This service created two zipped .csv files for me. I’ve downloaded those two files. The result files are in binary format, “Pages crawled” and “Pages Analyzed”. In the first file are the URLs with pages crawled (process status, page size, parse time and other fields) and the second file are pages analyzed, mentioning if the search terms appeared on pages and how many on a page (unique count, total count, search string_1 count and so on).
API and More
- Results Downloader API. For developers to incorporate data into other services/applications, there are more API available in JAVA, Python and JPype library.
- Results Downloader application is available, too.
This web service works best for small- or middle-scale businesses that want to crawl the data at only budget prices. The general crawl form might not be very adaptive to someone’s specific crawl requirements. Users must choose between custom set crawled data, pre-built API and crawl application to be developed.