One of our readers is interesting if there is any tools/algorithms to solve funcaptcha.
If you have any ideas or you’re willing to take this project please comment down here.
One of our readers is interesting if there is any tools/algorithms to solve funcaptcha.
If you have any ideas or you’re willing to take this project please comment down here.
Octoparse is an easy and powerful visual web scraper enabling anyone, even those without much programming background, to collect and extract data from the web. Octoparse is designed in a way to help users easily deal with complex website structures, such as those with JavaScript; it can be compared to other web scraping tools such as Import.io and Mozenda.
I want to test a proxy [gateway] service. What would be the simplest script to check the proxy’s IP speed and performance? See the following script.
We want to show how one can make a Curl download file from a server. See comments in the code as explanations.
// open file descriptor $fp = fopen ("image.png", 'w+') or die('Unable to write a file'); // file to download $ch = curl_init('http://scraping.pro/ewd64.png'); // enable SSL if needed curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // output to file descriptor curl_setopt($ch, CURLOPT_FILE, $fp); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1); // set large timeout to allow curl to run for a longer time curl_setopt($ch, CURLOPT_TIMEOUT, 1000); curl_setopt($ch, CURLOPT_USERAGENT, 'any'); // Enable debug output curl_setopt($ch, CURLOPT_VERBOSE, true); curl_exec($ch); curl_close($ch); fclose($fp);
I want to extract the hotel name and the current room price of some hotels daily from https://www.expedia.ca/Hotel-
I am a small hotel owner and want those info quite often, and hope I can do it with codes automatically in someway. You are expert in this field, what is the easiest ways to get those information? Can you give me some example codes?
In the age of the modern web there are a lot of data hunters people who want to take the data that is on your website and re-use it. The reasons someone might want to scrape your site are incredibly varied, but regardless it is important for website owners to know if it is happening. You need to be able to identify any illegal bots and take necessary action to make sure they aren’t bringing down your site.
I develop a web scraping project using Selenium. Since I need rotating proxies [in mass quantities] to be utilized in the project, I’ve turned to the proxy gateways (nohodo.com, charityengine.com and some others). The problem is how to incorporate those proxy gateways into Selenium for surfing web?
Some may argue that extracting 3 records per minute is not fast enough for an automated scraper (see my last post on Dexi multi-threaded jobs). However, you should realize that Dexi extractor robots behave like a full-blown modern browser and fetch all the resources that crawled pages load (CSS, JS, fonts, etc.).
In terms of performance, an extractor robot might not be as fast as a pure HTTP scraping script, but its advantage is the ability to extract data from dynamic websites which require running JavaScript code in order to generate a user-facing content. It will also be harder for anti-bot mechanisms to detect and block it.
test.py
import MySQLdb, db_config class Test: def connect(self): self.conn = MySQLdb.connect(host=config.db_credentials["mysql"]["host"], user=config.db_credentials["mysql"]["user"], passwd=config.db_credentials["mysql"]["pass"], db=config.db_credentials["mysql"]["name"]) self.conn.autocommit(True) return self.conn def insert_parametrized(self, test_value="L'le-Perrot"): cur = self.connect().cursor() cur.execute("INSERT INTO a_table (name, city) VALUES (%s,%s)", ('temp', test_value)) # run it t=Test().insert_parametrized("test city'; DROP TABLE a_table;")
db_config.py (place it in the same directory as the test.py file)
db_credentials = { "mysql": { "name": "db_name", "host": "db_host", # eg. '127.0.0.1' "user": "xxxx", "pass": "xxxxxxxx", } }