Author: admin

Exception handling in php scrapers

Post author By admin
Post date March 2, 2013
No Comments on Exception handling in php scrapers

Suppose we want to set only one exception handler function for all exceptions in the scraper program. This exception handler might be working for a multi-level program. Here is how it works in PHP.

Tags PHP

Data Science

Distributed File System Implementations and MapReduce strategy

Post author By admin
Post date March 1, 2013
No Comments on Distributed File System Implementations and MapReduce strategy

We have already mentioned the MapReduce distributed computation style in data analysis for computing clusters in the previous post. Here we want to touch more on the matter of implementation of this strategy for distributed hardware.

Tags data mining, Google, MapReduce

Review

Inspyder Power Search Review

Post author By admin
Post date February 28, 2013
No Comments on Inspyder Power Search Review

Inspyder Power Search is a crawling and scraping application which is more for straightforward scraping, using both XPath and Regex. The program has a simple, nice interface making it easy to learn and employ it.

Inspyder is designed for multiple purposes:

Tags crawling, scraper

Web Scraping Software

Inspyder Power Search Review

Post author By admin
Post date February 28, 2013
No Comments on Inspyder Power Search Review

Tags scraper

Uncategorized

Distil: Scrape Bot Protection Test

Post author By admin
Post date February 26, 2013
No Comments on Distil: Scrape Bot Protection Test

The anti scrape bot service test has been my focus for some time now. How well can the Distil service protect the real website from scrape? The only answer comes from an actual active scrape. Here I will share the log results and conclusion of the test. In the previous post we briefly reviewed the service’s features, and now I will do the live test-drive analysis.

Tags anti-scrape, scrape detection, scrape protection, service

Review

Distil Review: Anti-Scrape-Bot Service

Post author By admin
Post date February 22, 2013
No Comments on Distil Review: Anti-Scrape-Bot Service

Are you thinking of protecting your website content from theft and nonlegal scraping? Are you suspecting that some ‘innocent bots’ are continually visiting your web pages for data retrieval? Now we come to the anti scraping bot software and services. In this post we want to briefly review the new anti scrape bot service called Distil.

Tags anti-scrape, scrape detection, scrape protection, service

Web Scraping Software

TEST DRIVE: Invalid HTML

Now we will start a new Scraper Test Drive stage called ‘Invalid HTML‘. How do scrapers behave with a broken html code? Basically they did well, with almost common problem of not recognizing an unmatched quotes link.

Tags Xpath

Uncategorized

Anti Web Scraping WordPress Plugins Review

Post author By admin
Post date February 19, 2013
No Comments on Anti Web Scraping WordPress Plugins Review

As we have been considering web scraping for positive use, there is also the aspect of the negative use of scraping for the purpose of stealing other bloggers’ proprietary content. Let’s consider some anti web scraping WP plugins.

As for a web content ownership the main indicator here is the indexing done mainly by Google. This means that if the content is scraped and immediately reposted, Google might be fooled to index it as the original, while the genuine source will be counted as content farming. Higher ranking sites might have better chances of being indexed earlier than sites with the original content, and the latter might even get a mark for being spam. This is not necessarily a tendency, but in the past some precedents have happened. This seems ridiculous, but through a published feed the offenders might detect and quickly scrape the original content for repost.

Tags anti-scrape, plugin

Web Scraping Software

Web Scraper Shortcode WordPress Plugin Review

Post author By admin
Post date February 19, 2013
2 Comments on Web Scraper Shortcode WordPress Plugin Review

This short post is on the WP-plugin called Web Scraper Shortcode, that enables one to retrieve a portion of a web page or a whole page and insert it directly into a post. This plugin might be used for getting fresh data or images from web pages for your WordPress driven page without even visiting it. More scraping plugins and sowtware you can find in here.

Data Science

Implementing frequent itemsets algorithm thru MapReduce

Post author By admin
Post date February 18, 2013
No Comments on Implementing frequent itemsets algorithm thru MapReduce

The problem of finding frequent itemsets in data analysis is described in this post, and here i state the practical steps for finding the frequent itemsets thru MapReduce.

Tags data mining