Categories
Miscellaneous Web Scraping Software

Hotel: scrape prices, Q&A

 

Question

I want to extract the hotel name and the current room price of some hotels daily from https://www.expedia.ca/Hotel-Search?#&destination=Quebec,%20Quebec,%20Canada&startDate=06/11/2016&endDate=07/11/2016&regionId=&adults=2

I am a small hotel owner and want those info quite often, and hope I can do it with codes automatically in someway.  You are expert in this field, what is the easiest ways to get those information?  Can you give me some example codes?

Categories
Uncategorized

How to detect your site is being scraped?

scrape_detectIn the age of the modern web there are a lot of data hunters people who want to take the data that is on your website and re-use it. The reasons someone might want to scrape your site are incredibly varied, but regardless it is important for website owners to know if it is happening. You need to be able to identify any illegal bots and take necessary action to make sure they aren’t bringing down your site.

Categories
Development

Selenium using proxy gateway, how?

I develop a web scraping project using Selenium. Since I need rotating proxies [in mass quantities] to be utilized in the project, I’ve turned to the proxy gateways (nohodo.com, charityengine.com and some others). The problem is how to incorporate those proxy gateways into Selenium for surfing web?

Categories
Web Scraping Software

Dexi.io – how to improve performance

Intro

Some may argue that extracting 3 records per minute is not fast enough for an automated scraper (see my last post on Dexi multi-threaded jobs). However, you should realize that Dexi extractor robots behave like a full-blown modern browser and fetch all the resources that crawled pages load (CSS, JS, fonts, etc.).
In terms of performance, an extractor robot might not be as fast as a pure HTTP scraping script, but its advantage is the ability to extract data from dynamic websites which require running JavaScript code in order to generate a user-facing content. It will also be harder for anti-bot mechanisms to detect and block it.

Categories
Development

Python – parameterized storing into db to prevent SQL injection example

test.py

import MySQLdb, db_config
class Test:
    def connect(self): 
        self.conn = MySQLdb.connect(host=config.db_credentials["mysql"]["host"],
                                   user=config.db_credentials["mysql"]["user"],
                                   passwd=config.db_credentials["mysql"]["pass"],
                                   db=config.db_credentials["mysql"]["name"]) 
        self.conn.autocommit(True) 
        return self.conn  

    def insert_parametrized(self, test_value="L'le-Perrot"):
        cur = self.connect().cursor()
        cur.execute("INSERT INTO a_table (name, city) VALUES (%s,%s)", ('temp', test_value))

# run it
t=Test().insert_parametrized("test city'; DROP TABLE a_table;")

db_config.py (place it in the same directory as the test.py file)

db_credentials = {
    "mysql": {
        "name": "db_name",
        "host": "db_host", # eg. '127.0.0.1'
        "user": "xxxx",
        "pass": "xxxxxxxx",
    }
}
Categories
Development

What are the ways of inserting web scraping results into an SQL server?

  1. Apply a webhook service to request your target data and store them to DB.
Categories
Legal Miscellaneous

Is this a legal method of acquiring insurance leads?

Recently I received a question on insurance leads:

Is this a legal method of acquiring insurance leads [from the web]? Are there any agent testimonials on the efficiency of this type of service?

Legality issue in web scraping

With the matter of legality in web scraping, there should be a clear approach –  it depends on the website and its privacy policy. There could be at least 2 cases:

  1. Public info (prices, inventory info, public offers), i.e. everything that is not protected by copyright and available for scraping.
  2. The copyright protected info –  website Terms of Use or Terms of Service restrictions make copying and therefore web scraping illegal.
The US court of appeals has affirmed that a certain [data] analytic company is lawful to scrape data aggregator’s (LinkedIn’s) public profiles info.

So far I have no insurance agent testimonies on the efficiency of any insurance lead scrape service. The web sites I searched [on the insurance leads] have given me the impression that the customer info they gather is highly secured (not viewable). I doubt that any sites are going to expose insurance leads. In most of them the leads are available by paid subscription plans.

If there are any such websites like insurance leads directories (public insurance quotes), we might develop a scraper that consistently grabs fresh or new info for further analysis. It does save the agent’s time for re-searching, re-visiting and so on. One scraper might work with multiple directory pages for scrape.

The US district court has concluded that moderate scraping, even when against ToS, is legal.

You might find it interesting to read about web page change tracking if you only need to see updates (no data storing applied).

Categories
Miscellaneous

Death By Captcha Updated API clients

Death By Captcha is a reputable CAPTCHA solving service with more than 7 years in the Captcha Solving business. They have recently updated all their API clients, so users can experience maximum efficiency and faster solving times.

They enthusiastically recommend that users and software developers visit the API page and update their DBC API implementation in order to get the most out of it (the API and docs are available for registered users only).  The free credits are provided for users to test or implement the new client API!
[box style=’info blue’]If you tell them you saw this info through the scraping.pro blog, they’ll give you a 1K free CAPTCHAs additional credit![/box]
For further info, you may contact them directly.

Categories
Development

Charles CA certificate with OpenSSL in Windows

Today I needed to enable a Charles proxy on my Windows PC. Later I have managed the Genymotion virtual device to be monitored by the Charles proxy.

Categories
Guest posting Web Scraping Software

UiPath PDF Data Extraction

UiPath, one of the big providers of robotic process automation software, has some very interesting positioning. Unlike the other players on the market, they provide a free and fully featured community edition of their product for anybody to test and develop. The tool automates any application and is packed with all the web scraping and screen scraping capabilities for both desktop and web.  The platform also has a lively community forum featuring jobs, automation contests and knowledge-sharing between UiPath users: www.forum.uipath.com.