Categories
Web Scraping Software

80legs Review – Crawler for rent in the sky

80legs offers a crawling service that allows users to (1) easily compose crawl jobs and (2) cloud run their crawl jobs over the distributed computer network.

The modern web requires you to spend huge amount of processing power to mine it for information. How could a start-up or a small business do comprehensive data crawling without having to build the giant server farms used by major search engines?

80legs makes web crawling technology more accessible to small companies and individuals by allowing leased access and letting customers pay only for what they crawl. Now, nontechnical users are be able to set up a crawl job with more or less adaptive control. Developers can incorporate 80legs API into their applications to spread the crawling net. The distributed computing network is put together by the 3rd party enterprise, which rents it to 80legs. The main idea is to make a home PC crawl web pages on demand during its idle time.

How it Works

Make a crawl application

Completing the form is all you have to do to make a crawl application. Write your own application by completing the 80legs form and make it run on multiple computers over the web to do a custom search. For custom crawling, a user might set various crawl MIME types (text, image, application, video and so on).  Analysis types differ from URLs to meta data and md5 hashes. Depth level (for a paid account) and crawl types (Fast, Comprehensive, Breadth-First) must be set. As soon as you’ve completed the form, your job is in a queue to be run.

Run the job

The job runs distributed, with low consumption of memory or bandwidth.

Download the results

Download the results when the job is done (you’ll be notified by email). The results are typically of 2 types for a free account: crawled URLs and analyzed URLs. The latter’s value might differ from the first one.

I quickly composed a crawl job by checking 14 form fields and within 10 minutes, I got the crawl results. The Regex crawling plugin helped me to specify more on the search terms. One thing that was missing is there is no way to edit the created crawl job, but copy to clone it works well.

Detailed job summary

It’s important to see job details, especially the failure or success overview.

Custom or Pre-Built Crawling App

Certainly, the crawl service isn’t very adaptive and programmable for a “layman”. So, some companies have developed a custom processing API, or used the pre-built 80app from its store to run on the 80legs service base to crawl and index the web under their conditions. These pre-built apps are plugins that can be used in a crawl to parse or analyze the target data. Some of them are free, but with a crawl fee applicable.

Select a Chunk of Web, Crawl and Index it

If you know the data sources you want to grab, set up your crawl engine rules to search, index and save data (for the last option, one definitely needs a custom app). For instance, IndexTank is teaming up with 80legs so the customer may select a chunk of the web, crawl it and index it.

Pre-Configured Live Crawls of Websites from 80legs

This service also affords crawl packages on popular topics: business, real estate, retail and others. The data update frequency is quite high; crawl results are posted every 1 – 3 hours. The output format is usually XML.

Crawl Speed Claimed to Catch up with Modern Search Engines

The owners claim that single crawl job can process a million pages in about 15 min, which is about 1000 pages/second or 30 ms for a page. The service offers this speed to every customer, while respecting the bandwidth of the sites crawled. The developers claim this crawl speed is similar to Google, Yahoo and other giants’ engines. For my test of 1000 pages crawled, the CPUs spent 0.01 hour or about 36 seconds, similar to the claimed crawl speed. Priority of a job execution should be another factor considered.

Pricing

Pricing starts from free for a Basic Plan (low crawl priority, up to 1000 pages for a job and other limitations) to $299 per month for the Premium plan. Additionally, you’ll be charged crawling fees. Usage fees are “$2.20 per million pages crawled & $0.03 per CPU-hour used & $0.00000008 per KB for results over 25 KB per page.” This is fairly affordable compared to pricing for crawl with other web services.

Get Results

Results were limited with only crawled URLs and some indexes. Pulling all of the contents from the pages ends up using lots of our bandwidth and your bandwidth. Page content developers may write a custom code for your 80app.

This service created two zipped .csv files for me. I’ve downloaded those two files. The result files are in binary format, “Pages crawled” and “Pages Analyzed”. In the first file are the URLs with pages crawled (process status, page size, parse time and other fields) and the second file are pages analyzed, mentioning if the search terms appeared on pages and how many on a page (unique count, total count, search string_1 count and so on).

API and More

  • Results Downloader API. For developers to incorporate data into other services/applications, there are more API available in JAVA, Python and JPype library.
  • Results Downloader application is available, too.

Conclusion

This web service works best for small- or middle-scale businesses that want to crawl the data at only budget prices. The general crawl form might not be very adaptive to someone’s specific crawl requirements. Users must choose between custom set crawled data, pre-built API and crawl application to be developed.

7 replies on “80legs Review – Crawler for rent in the sky”

80legs reduced one of our main webservers to a smoking crater with their botnet, something they’ve gained a reputation online for doing. (Google search “80legs ddos”)

They supposedly limit the number of requests they issue per second, per domain… which seems to completely ignore the concept of a single host containing multiple domains. Their botnet is slow to react to changes to robots.txt, so their botnet will attack for quite a while after the robots.txt changes – who, ever, thinks of blocking a badly-behaved webcrawler until the webcrawler becomes a problem? Do they seriously expect everyone to have a “User-Agent: * Disallow: /” by default?

It’s notable that they hide their domain contact information behind domainsbyproxy, just like a lot of other shady online businesses, *forcing* people to go through their contact form, which is NOT acceptable in an emergency… especially an emergency they created. And if you complain, expect them to blame you for not proactively blocking them via robots.txt.

i concur…

80legs just reduced my 3 load balanced backend servers to a smoking pile with thousands of requests from over 800 ip addresses.
What an appalling company.

Blocked at firewall (for those 800), blocked using mod_security , plus added to robots.txt

There went my sunday afternoon! scumbags.

I tried 80legs extensively last month. It is THE slowest web crawler I worked with till date. Even small jobs would queue indefinitely, and even the premium package (we paid $498) would take 24 hours on average to process 1000k urls!! Their pricing page claims seem exaggerated, there are some hidden parameters that they have deliberately not displayed on their pricing page, and they only use those excuses when you send them a query regarding the crawling speed not being up to the mark. Also their support is not too great, we have sent numerous emails that have gone un-answered and the once that have been answered till date haven’t been straight answers (or any useful ones) but only suggestions to upgrade the package so that things improve (which was never the case) or to try out their data service Datfinity. So, don’t go by what they claim – esp their claims regarding crawling speeds. The pricing looks attractive, but only till you actually subscribe for the service, from that point onwards it is a total mess!

80legs.com are bunch of thugs that figured out a way to DDOS your site legally. They brag openly that there’s no point in blocking them http://www.80legs.com/webcrawler.html:

\Blocking our web crawler by IP address will not work. Due to the distributed nature of our infrastructure, we have thousands of constantly changing IP addresses. We strongly recommend you don’t try to block our web crawler by IP address, as you’ll most likely spend several hours of futile effort and be in a very bad mood at the end of it. You really should just include us in your robots.txt or contact us directly.\

We’ve had our site attacked by them on 9/27/2013 and my whole day (plus ecommerce revenue) went to hell. There has got to be a way to report these guys and have them blacklisted.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.