Categories
Development Guest posting

Bright Data Proxy Manager with built-in scraping features

Web Data Extraction is critical to the online operations of companies across the globe. With more data being scraped daily, websites implement techniques to block extraction efforts.

Blocking techniques

Common IP based blocking techniques include tracking IP location and blocking geolocations in their entirety.  

Some sites even block data center IPs altogether by purchasing lists of known data center IP addresses and denying or flagging them outright.  

Other blocking techniques include rate-limiting which refers to limiting the number of requests allowed: per IP, per second and the other methods to block bots based on their user agents – to differentiate crawlers from real users.

In order to pass these obstacles, start by using the right proxy network

To solve IP tracking and data center IPs blocks, begin by using a residential proxy network.

Residential IP –  is an IP address assigned from an Internet Service Provider to a user. 

Data center IP – is a static IP sold by companies who own servers containing many consecutive IP addresses.

Residential IPs

Bright Data provides residential IPs in any country or city in the world, allowing you to truly emulate a real-user in any location, helping overcome many common IP based blocking techniques.

See more Proxy options of Bright Data

Residential Proxies72 million+ IPs rotated from real-peer devices in 195 countries
ISP Proxies700,000+ real home IPs across the globe, for long-term use
Datacenter Proxies770,000+ shared datacenter IPs from any geolocation
Mobile Proxies7,000,000+ IPs forming the largest real-peer 3G/4G mobile network

Software

The right software is the next step to overcome more sophisticated blocking techniques.

The [Bright Data] Proxy Manager is a free, open-source software that was created with built-in scraping features. These features, if set-up correctly, automatically overcome common request-based blocking techniques such as rate-limiting and bot blocking techniques like fingerprint detection.

This is accomplished with built-in features that automate:

  • IP rotation
  • Auto Retry
  • Limiting requests
  • Routing Requests
  • Bandwidth reduction
  • Random User-Agent
  • Override headers

Installation for the Proxy Manager can be done using:

  1. Windows Installer file:  exe file v.1.407.3 (github.com)
  2. BASH install script (Mac OS/Linux):
 url -L https://luminati.io/static/lpm/luminati-proxy-latest-setup.sh | bash

3. NPM Package:

sudo npm install -g @luminati-io/luminati-proxy 

4. Docker Image:

docker pull luminati/luminati-proxy

5. GitHub Source Code: https://github.com/luminati-io/luminati-proxy

To overcome geolocation blocking, requests are to be sent using the Residential proxy network with a country targeted IP.  If an issue arises, such as a (4xx|5xx) error code, the same request can be automatically retried with the Residential proxy network using a city targeted IP.  

Issues can refer to anything such as an unwanted:

  • status code
  • URL
  • Body element
  • Request time

All of which are automatically avoided using the Proxy Manager rules which allow for a trigger (issue) and an action to be taken if the issue arises.

The actions that can be taken consist of:

  • Retry with new IP
  • Retry New Port
  • Ban IP
  • Ban IP per Domain
  • Refresh IP
  • Save IP to a reserve pool

Request-based blocking techniques

Rate-Based Blocking: Refers to limiting the number of requests allowed, per IP, per second. Utilizing a large network that allows for continuous rotation of an IP address is an easy solution to this restriction. Within the LPM, go to the IP control tab, set ‘Max Request’ to 1 and this will automatically rotate the IP every request.  

Bot-based blocking: Takes into account a user-agent to differentiate crawlers from real users. Upon entering a website, the site itself collects information in order to deliver the right language, operating system, screen size and more. By paying attention to the user-agent and response headers, this common blocking technique can be avoided altogether. Under the ‘headers’ tab in the LPM are options to employ Random User-Agent and Override headers for every request. Click ‘yes’ beside these options and after each request, the session is terminated and all variables changed.

Presets

The PM also contains preset configurations already programmed for specific use-cases.

Round-robin config

One of the most common presets of LPM is the Round-robin configuration. To successfully scrape specific data elements, merely set-up a proxy port with the Round-Robin preset.  The Round-Robin preset automatically creates a round-robin pool type which rotates the request IP address with every request. This preset also disables ‘multiply’ options, sets ‘Pool size’ requests to 10 and sets ‘max requests’ to 1.

Pool size refers to a group of IPs allocated to your port which in this particular preset is 10 (and can be changed accordingly). This group of 10 IPs is continuously rotated with each request using 1 IP and then switching. IP switches are configured in the ‘max request’ setting which here is set to 1 but can be set to any number required.

In the examples below, the request is first tried using a residential country targeted IP, and if the request fails (returns an error code), then it is automatically retried with a city targeted IP.  This is referred to as the ‘Waterfall Method’ and consists of automatically resending the same request on a failure, using a different IP type.

Here is an example of the Manual Configuration file in the LPM for the round-robin configuration with the Waterfall method:

{
 "_defaults": {
   "customer": "lum<customer>",
   "password": "<password>",
   "token": "",
   "token_auth": "",
   "version": "1.124.317",
   "zone": "residential"
 },
 "proxies": [
   {
     "city": "Miami",
     "country": "us",
     "dns": "remote",
     "keep_alive": true,
     "last_preset_applied": "round_robin",
     "max_requests": 1,
     "override_headers": true,
     "password": [
       "<password>"
     ],
     "pool_size": 10,
     "pool_type": "round-robin",
     "port": 24002,
     "random_user_agent": true,
     "session": true,
     "ssl": true,
     "state": "fl",
     "zone": "residential"
   },
   {
     "country": "us",
     "dns": "remote",
     "keep_alive": true,
     "last_preset_applied": "round_robin",
     "max_requests": 1,
     "pool_size": 10,
     "pool_type": "round-robin",
     "port": 24001,
     "random_user_agent": true,
     "rules": [
       {
         "action": {
           "retry_port": 24002
         },
         "action_type": "retry_port",
         "status": "4|5",
         "trigger_type": "status",
         "url": ""
       }
     ],
     "session": true,
     "ssl": true,
     "zone": "residential"
   }
 ]
}

Online Shopping Preset 

The Online Shopping Preset is configured for shopping pages and automatically creates a round-robin pool type which rotates request IP address with every request. It sets DNS to resolve remotely, generates a random user-agent for each request, creates an explanatory rule for post-processing each request to scrape the data required and enables SSL analyzing.

Below is an example of the Manual Configuration file in the LPM for the Online Shopping preset with rules and the Waterfall method.

{
 "_defaults": {
   "customer": "lum<customer>",
   "password": "<password>",
   "token": "",
   "token_auth": "",
   "version": "1.124.317",
   "zone": "residential"
 },
 "proxies": [
   {
     "country": "us",
     "dns": "remote",
     "last_preset_applied": "shop",
     "max_requests": 0,
     "override_headers": true,
     "pool_size": 0,
     "pool_type": "round-robin",
     "port": 24001,
     "random_user_agent": true,
     "rules": [
       {
         "action": {
           "retry_port": 24002
         },
         "action_type": "retry_port",
         "status": "(4|5)..",
         "trigger_type": "status",
         "url": ""
       },
       {
         "action": {
           "process": {
             "bullets": "$('#featurebullets_feature_div li span').map(function(){ return $(this).text() }).get()",
             "price": "$('#priceblock_ourprice').text().trim()",
             "title": "$('#productTitle').text()"
           }
         },
         "action_type": "process",
         "trigger_type": "url",
         "url": "luminati.io|dp\\/[A-Z0-9]{10}"
       }
     ],
     "secure_proxy": true,
     "session": true,
     "session_duration": 0,
     "ssl": true,
     "zone": "res"
   },
   {
     "city": "Miami",
     "country": "us",
     "dns": "remote",
     "last_preset_applied": "shop",
     "max_requests": 0,
     "override_headers": true,
     "password": [
       "ly3doq1b6utn"
     ],
     "pool_size": 0,
     "pool_type": "round-robin",
     "port": 24002,
     "random_user_agent": true,
     "rules": [
       {
         "action": {
           "process": {
             "bullets": "$('#featurebullets_feature_div li span').map(function(){ return $(this).text() }).get()",
             "price": "$('#priceblock_ourprice').text().trim()",
             "title": "$('#productTitle').text()"
           }
         },
         "action_type": "process",
         "trigger_type": "url",
         "url": "luminati.io|dp\\/[A-Z0-9]{10}"
       }
     ],
     "session": true,
     "session_duration": 0,
     "ssl": true,
     "state": "fl",
     "zone": "residential"
   }
 ]
}

Wrap up

Bright Data Proxy Network provides 72+ million residential IPs across the globe and with the Bright Data Proxy Manager, collecting accurate worldwide pricing data is simple.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.