Categories
Development

Crawling web pages with Netpeak Spider in conjunction with MarsProxies, NetNut and IPRoyal proxies

NS-owl

Agree, it’s hard to overestimate the importance of information – “Master of information, master of situation”. Nowadays, we have everything to become a “master of situation”. We have all needed tools like spiders and parsers that could scrape various data from websites. Today we will consider scraping the Amazon with a web spider equipped with proxy services.

Categories
Development

Simple JAVA scraper that handles user-agent, headers and cookies

How to handle cookie, user-agent, headers when scraping with JAVA? We’ll use for this a static class ScrapeHelper that easily handles all of this. The class uses Jsoup library methods to fetch from data from server and parse html into DOM document.

Categories
Development

Backconnect Proxy Service with authorization in JAVA

Working with a Backconnect proxy service (Oxylab.io) we spent a long time looking for a way to authorize it. Originally we used JSoup to get the web pages’ content. The proxy() method can be used there when setting up the connection, yet it only accepts the host and port, no authentication is possible. One of the options that we found, was the following:

 

Categories
Development

JAVA, Selenium, headless Chrome, JSoup to scrape data of the web

In this post we share with you how to perform web scraping of a JS-rendered website. The tools as seen in the header are JAVA with Selenium library driving headless Chrome instances (download driver) and JSoup as parser to fetch data of the acquired HTML.

Categories
Development

JAVA library to scrape Linkedin & its data affiliates

In this post we want to share with you a new useful JAVA library that helps to crawl and scrape Linkedin companies. Get business directories scraped!

If you are considering the Linkedin data scrape legal issues, please refer to the following post: Linkedin lost in court to data analytic company that scrapes Linkedin’s public profiles info
Categories
Uncategorized

Handy Web Extractor

Handy Web ExtractorHandy Web Extractor is a simple tool for everyday web content monitoring. It will periodically download the web page, extract the necessary content and display it in the window on your desktop. One may consider it as the data extraction software, taking its own nitch in the scraping software and plugins.

It’s totally free and available for download.

Categories
Challenge Development

Scrape with Google App Script

In this post I want to let you how I’ve managed to complete the challenge of scraping a site with Google Apps Script (GAS).

Categories
Review

Test ReCaptcha 2.0 solving services

We’ve tested several captcha solving services. The test results are based on 1000 ReCaptchas 2.0 submitted to each service.

Stars Avg.
solving time,
seconds
Fastest
solving time,
seconds
Performance,
%
Notes
DeathByCaptcha
41 16 96,8 Dec. 2019
2Captcha
63 15 95,2 Dec. 2019
CaptchaSolutions
111 37 78 Oct. 2017
Useful testing codes

2Captcha Test Code (JAVA)

CaptchaSolutions Test Code (Python)

Categories
Uncategorized

Smartproxy Review

Getting precise and localized data is becoming difficult. Advanced proxy networks are the only thing that is keeping some companies running intense data gathering operations.

Residential proxies are in extremely high demand, and there are only a few networks available that can offer millions of IP addresses around the world. 

Smartproxy is one of those networks, rapidly growing to offer the best product in residential and data center proxies.

Categories
Legal

US court stated scraping, even when against TOS, is legal

court_smallLast month a legal case took place in a US court where four professors plus a media organization sued the US Government. The District Court for the District of Columbia conclusion stated that moderate scraping, even when against ToS, is legal.