webscraping.pro – Page 24

Octoparse – a scraping tool designed for non-programmers

Post author By admin
Post date August 31, 2017
No Comments on Octoparse – a scraping tool designed for non-programmers

Octoparse is an easy and powerful visual web scraper enabling anyone, even those without much programming background, to collect and extract data from the web. Octoparse is designed in a way to help users easily deal with complex website structures, such as those with JavaScript; it can be compared to other web scraping tools such as Import.io and Mozenda.

Tags Octoparse, web scraping

Development

Proxy speed and performance test

Post author By admin
Post date August 30, 2017
3 Comments on Proxy speed and performance test

I want to test a proxy [gateway] service. What would be the simplest script to check the proxy’s IP speed and performance? See the following script.

Tags proxy

Development

Php Curl download file

We want to show how one can make a Curl download file from a server. See comments in the code as explanations.

// open file descriptor
$fp = fopen ("image.png", 'w+') or die('Unable to write a file'); 
// file to download
$ch = curl_init('http://scraping.pro/ewd64.png');
// enable SSL if needed
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); 
// output to file descriptor
curl_setopt($ch, CURLOPT_FILE, $fp);          
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
// set large timeout to allow curl to run for a longer time
curl_setopt($ch, CURLOPT_TIMEOUT, 1000);     
curl_setopt($ch, CURLOPT_USERAGENT, 'any');
// Enable debug output
curl_setopt($ch, CURLOPT_VERBOSE, true);   
curl_exec($ch);
curl_close($ch);                               
fclose($fp);

Tags Curl, PHP

Miscellaneous Web Scraping Software

Hotel: scrape prices, Q&A

Post author By admin
Post date July 31, 2017
No Comments on Hotel: scrape prices, Q&A

Question

I want to extract the hotel name and the current room price of some hotels daily from https://www.expedia.ca/Hotel-Search?#&destination=Quebec,%20Quebec,%20Canada&startDate=06/11/2016&endDate=07/11/2016&regionId=&adults=2

I am a small hotel owner and want those info quite often, and hope I can do it with codes automatically in someway. You are expert in this field, what is the easiest ways to get those information? Can you give me some example codes?

Tags business directory, scraping tool

Uncategorized

How to detect your site is being scraped?

Post author By admin
Post date July 27, 2017
No Comments on How to detect your site is being scraped?

scrape_detect In the age of the modern web there are a lot of data hunters people who want to take the data that is on your website and re-use it. The reasons someone might want to scrape your site are incredibly varied, but regardless it is important for website owners to know if it is happening. You need to be able to identify any illegal bots and take necessary action to make sure they aren’t bringing down your site.

Tags scrape detection

Development

Selenium using proxy gateway, how?

Post author By admin
Post date June 13, 2017
No Comments on Selenium using proxy gateway, how?

I develop a web scraping project using Selenium. Since I need rotating proxies [in mass quantities] to be utilized in the project, I’ve turned to the proxy gateways (nohodo.com, charityengine.com and some others). The problem is how to incorporate those proxy gateways into Selenium for surfing web?

Tags proxy

Web Scraping Software

Dexi.io – how to improve performance

Post author By admin
Post date June 8, 2017

Intro

Some may argue that extracting 3 records per minute is not fast enough for an automated scraper (see my last post on Dexi multi-threaded jobs). However, you should realize that Dexi extractor robots behave like a full-blown modern browser and fetch all the resources that crawled pages load (CSS, JS, fonts, etc.).
In terms of performance, an extractor robot might not be as fast as a pure HTTP scraping script, but its advantage is the ability to extract data from dynamic websites which require running JavaScript code in order to generate a user-facing content. It will also be harder for anti-bot mechanisms to detect and block it.

Tags Dexi, HTTP, scraping tool, web scraping

Development

Python – parameterized storing into db to prevent SQL injection example

Post author By admin
Post date May 24, 2017
No Comments on Python – parameterized storing into db to prevent SQL injection example

test.py

import MySQLdb, db_config
class Test:
    def connect(self): 
        self.conn = MySQLdb.connect(host=config.db_credentials["mysql"]["host"],
                                   user=config.db_credentials["mysql"]["user"],
                                   passwd=config.db_credentials["mysql"]["pass"],
                                   db=config.db_credentials["mysql"]["name"]) 
        self.conn.autocommit(True) 
        return self.conn  

    def insert_parametrized(self, test_value="L'le-Perrot"):
        cur = self.connect().cursor()
        cur.execute("INSERT INTO a_table (name, city) VALUES (%s,%s)", ('temp', test_value))

# run it
t=Test().insert_parametrized("test city'; DROP TABLE a_table;")

db_config.py (place it in the same directory as the test.py file)

db_credentials = {
    "mysql": {
        "name": "db_name",
        "host": "db_host", # eg. '127.0.0.1'
        "user": "xxxx",
        "pass": "xxxxxxxx",
    }
}

Tags Python

Development

What are the ways of inserting web scraping results into an SQL server?

Post author By admin
Post date May 18, 2017
No Comments on What are the ways of inserting web scraping results into an SQL server?

Apply a webhook service to request your target data and store them to DB.
Continue reading “What are the ways of inserting web scraping results into an SQL server?”

Tags Curl, PHP

Legal Miscellaneous

Is this a legal method of acquiring insurance leads?

Post author By admin
Post date May 11, 2017
1 Comment on Is this a legal method of acquiring insurance leads?

Recently I received a question on insurance leads:

Is this a legal method of acquiring insurance leads [from the web]? Are there any agent testimonials on the efficiency of this type of service?

Legality issue in web scraping

With the matter of legality in web scraping, there should be a clear approach – it depends on the website and its privacy policy. There could be at least 2 cases:

Public info (prices, inventory info, public offers), i.e. everything that is not protected by copyright and available for scraping.
The copyright protected info – website Terms of Use or Terms of Service restrictions make copying and therefore web scraping illegal.

The US court of appeals has affirmed that a certain [data] analytic company is lawful to scrape data aggregator’s (LinkedIn’s) public profiles info.

So far I have no insurance agent testimonies on the efficiency of any insurance lead scrape service. The web sites I searched [on the insurance leads] have given me the impression that the customer info they gather is highly secured (not viewable). I doubt that any sites are going to expose insurance leads. In most of them the leads are available by paid subscription plans.

If there are any such websites like insurance leads directories (public insurance quotes), we might develop a scraper that consistently grabs fresh or new info for further analysis. It does save the agent’s time for re-searching, re-visiting and so on. One scraper might work with multiple directory pages for scrape.

The US district court has concluded that moderate scraping, even when against ToS, is legal.

You might find it interesting to read about web page change tracking if you only need to see updates (no data storing applied).