Categories
Miscellaneous

FunCaptcha solve algorithm needed

One of our readers is interesting if there is any tools/algorithms to solve funcaptcha.
If you have any ideas or you’re willing to take this project please comment down here.

 

Categories
Miscellaneous

Octoparse – a scraping tool designed for non-programmers

Octoparse is an easy and powerful visual web scraper enabling anyone, even those without much programming background, to collect and extract data from the web. Octoparse is designed in a way to help users easily deal with complex website structures, such as those with JavaScript; it can be compared to other web scraping tools such as Import.io and Mozenda.

Categories
Development

Proxy speed and performance test

I want to test a proxy [gateway] service. What would be the simplest script to check the proxy’s IP speed and  performance? See the following script.

Categories
Development

Php Curl download file

We want to show how one can make a Curl download file from a server. See comments in the code as explanations.

// open file descriptor
$fp = fopen ("image.png", 'w+') or die('Unable to write a file'); 
// file to download
$ch = curl_init('http://scraping.pro/ewd64.png');
// enable SSL if needed
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); 
// output to file descriptor
curl_setopt($ch, CURLOPT_FILE, $fp);          
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
// set large timeout to allow curl to run for a longer time
curl_setopt($ch, CURLOPT_TIMEOUT, 1000);     
curl_setopt($ch, CURLOPT_USERAGENT, 'any');
// Enable debug output
curl_setopt($ch, CURLOPT_VERBOSE, true);   
curl_exec($ch);
curl_close($ch);                               
fclose($fp);

 

Categories
Miscellaneous Web Scraping Software

Hotel: scrape prices, Q&A

 

Question

I want to extract the hotel name and the current room price of some hotels daily from https://www.expedia.ca/Hotel-Search?#&destination=Quebec,%20Quebec,%20Canada&startDate=06/11/2016&endDate=07/11/2016&regionId=&adults=2

I am a small hotel owner and want those info quite often, and hope I can do it with codes automatically in someway.  You are expert in this field, what is the easiest ways to get those information?  Can you give me some example codes?

Categories
Uncategorized

How to detect your site is being scraped?

scrape_detectIn the age of the modern web there are a lot of data hunters people who want to take the data that is on your website and re-use it. The reasons someone might want to scrape your site are incredibly varied, but regardless it is important for website owners to know if it is happening. You need to be able to identify any illegal bots and take necessary action to make sure they aren’t bringing down your site.

Categories
Development

Selenium using proxy gateway, how?

I develop a web scraping project using Selenium. Since I need rotating proxies [in mass quantities] to be utilized in the project, I’ve turned to the proxy gateways (nohodo.com, charityengine.com and some others). The problem is how to incorporate those proxy gateways into Selenium for surfing web?

Categories
Web Scraping Software

Dexi.io – how to improve performance

Intro

Some may argue that extracting 3 records per minute is not fast enough for an automated scraper (see my last post on Dexi multi-threaded jobs). However, you should realize that Dexi extractor robots behave like a full-blown modern browser and fetch all the resources that crawled pages load (CSS, JS, fonts, etc.).
In terms of performance, an extractor robot might not be as fast as a pure HTTP scraping script, but its advantage is the ability to extract data from dynamic websites which require running JavaScript code in order to generate a user-facing content. It will also be harder for anti-bot mechanisms to detect and block it.

Categories
Development

Python – parameterized storing into db to prevent SQL injection example

test.py

import MySQLdb, db_config
class Test:
    def connect(self): 
        self.conn = MySQLdb.connect(host=config.db_credentials["mysql"]["host"],
                                   user=config.db_credentials["mysql"]["user"],
                                   passwd=config.db_credentials["mysql"]["pass"],
                                   db=config.db_credentials["mysql"]["name"]) 
        self.conn.autocommit(True) 
        return self.conn  

    def insert_parametrized(self, test_value="L'le-Perrot"):
        cur = self.connect().cursor()
        cur.execute("INSERT INTO a_table (name, city) VALUES (%s,%s)", ('temp', test_value))

# run it
t=Test().insert_parametrized("test city'; DROP TABLE a_table;")

db_config.py (place it in the same directory as the test.py file)

db_credentials = {
    "mysql": {
        "name": "db_name",
        "host": "db_host", # eg. '127.0.0.1'
        "user": "xxxx",
        "pass": "xxxxxxxx",
    }
}
Categories
Development

What are the ways of inserting web scraping results into an SQL server?

  1. Apply a webhook service to request your target data and store them to DB.