Categories
Development Guest posting

Web Scraping with Java and HtmlUnit

java-htmlunit-post-front-cover-smallWeb scraping or crawling is the act of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want. It can be done manually, but generally this term refers to the automated process of downloading the HTML content of a page, parsing/extracting the data, and saving it into a database for further analysis or use.

Categories
Development

Finding duplicate rows in a SQL table and printing out their ids

Today I was trying to find duplicate rows in db table rows, but besides finding them, I needed to get their indexes in order to address them.

Categories
Development Guest posting

CaptchaSolutions test results for ReCaptcha v2.0

captchasolutionsRecently we executed the CaptchaSolutions.com service testing. CaptchaSolutions.com provides an automated online captcha solver API service (the name speaks for itself). It also includes solving google reCaptcha 2.0. So we decided to test it against this challenging captcha.

If you want to compare the other services’ solving reCaptcha 2.0 test results, then please refer to this post
Categories
Challenge

Prevent automated services from solving captcha?

Question: Is there any way to include captcha on the site and at the same time prevent services like 2captcha from resolving it?

Categories
Miscellaneous

Scraping.pro load test

Recently I got a chance to perform a website load test. Since I run the blog, it’s always useful to check its abilities, load capacity. So, I was offered a free opportunity for a load test by www.dotcom-monitor.com .

Categories
Guest posting

Death By Captcha now supporting recaptcha v2

deathbycaptchaThe Death By Captcha developers have just released a beta of their shiny new NoCAPTCHA by token (reCaptcha v2) solving method!
They have been working on this for a while, and they promise the solution will soon be the solving reference for these challenges.

Categories
Miscellaneous

FunCaptcha solve algorithm needed

One of our readers is interesting if there is any tools/algorithms to solve funcaptcha.
If you have any ideas or you’re willing to take this project please comment down here.

 

Categories
Miscellaneous

Octoparse – a scraping tool designed for non-programmers

Octoparse is an easy and powerful visual web scraper enabling anyone, even those without much programming background, to collect and extract data from the web. Octoparse is designed in a way to help users easily deal with complex website structures, such as those with JavaScript; it can be compared to other web scraping tools such as Import.io and Mozenda.

Categories
Development

Proxy speed and performance test

I want to test a proxy [gateway] service. What would be the simplest script to check the proxy’s IP speed and  performance? See the following script.

Categories
Development

Php Curl download file

We want to show how one can make a Curl download file from a server. See comments in the code as explanations.

// open file descriptor
$fp = fopen ("image.png", 'w+') or die('Unable to write a file'); 
// file to download
$ch = curl_init('http://scraping.pro/ewd64.png');
// enable SSL if needed
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); 
// output to file descriptor
curl_setopt($ch, CURLOPT_FILE, $fp);          
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
// set large timeout to allow curl to run for a longer time
curl_setopt($ch, CURLOPT_TIMEOUT, 1000);     
curl_setopt($ch, CURLOPT_USERAGENT, 'any');
// Enable debug output
curl_setopt($ch, CURLOPT_VERBOSE, true);   
curl_exec($ch);
curl_close($ch);                               
fclose($fp);