Brigth Data residential proxy for extracting from a data aggregator

In this post I’d like to share my experience with scraping data aggregator/business directory using the residential proxy of the Bright Data proxy provider in conjuction with its proxy manager.

Residential proxies’ advantages

The traditional proxies’ disadvantage is that they are provided by data centers. Web services can easily recognize that those IPs originate from a dc and thus block it as a web robot (not a regular user) visit. Even with a decent proxy, websites can cloak their data or modify them when detecting a bot visit.

Residential IP is an IP provided to a home user by an Internet Service Provider (ISP). Users [of web apps] give their consent to allow access to their residential IPs while a device is idle, connected to the internet and not in use and has enough power. Bright Data does not collect any user data, rather it is interested only in IPs. To date, Bright Data connects to over 72 million residential IPs, located across the world.

See all the Bright Data proxies options

Website	Protection tool	Notes	Solution
3m.com	Akamai	Akamai Cookies: `_abck, bm_sz, bm_sv, bm_mi, ak_bmsc`	headful browser automation
badmintonhub.in	CloudFlare
govets.com	Recaptcha , CloudFlare	The following anti-bots got detected: Cloudflare Headers: cf-chl-gen, cf-ray, cf-mitigated Server header: cloudflare https://www.govets.com/ https://www.govets.com/ https://www.govets.com/static/version1708067196/_cache/merged/ba2734da0d740dd8fa764a0ea52576d8.min.js https://www.govets.com/static/version1708067196/_cache/merged/ba2734da0d740dd8fa764a0ea52576d8.min.js https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_Theme/js/utils/svg-sprite.min.js https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_Theme/js/utils/svg-sprite.min.js https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_PageCache/js/form-key-provider.min.js https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_PageCache/js/form-key-provider.min.js https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/js/bundle/cms.min.js https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/js/bundle/cms.min.js https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_ReCaptchaWebapiUi/js/jquery-mixin.min.js https://www.govets.com/static/version1708067196/frontend/Amasty/JetTheme/en_US/Magento_ReCaptchaWebapiUi/js/jquery-mixin.min.js https://www.govets.com/static/version1695119800/frontend/Amasty/JetTheme/en_US/Tapita_Tpbuilder/js/simi-pagebuilder-react@1.4.0.umd.min.js https://www.govets.com/static/version1695119800/frontend/Amasty/JetTheme/en_US/Tapita_Tpbuilder/js/simi-pagebuilder-react@1.4.0.umd.min.js https://static.zdassets.com/ekr/snippet.js?key=a51665c3-83da-4d85-9319-2953bf16b020 Recaptcha Script loaded: recaptcha/api.js JavaScript Properties: window.grecaptcha, window.recaptcha Detected on 1 urls: https://www.google.com/recaptcha/api.js?onload=globalOnRecaptchaOnLoadCallback&render=explicit 22.02.2024 at 8:26 PM EET
ticketmaster.co.uk	Recaptcha, PerimeterX, Imperva	The following anti-bots got detected: Recaptcha JavaScript Properties: `window.grecaptcha`, `window.recaptcha` PerimeterX -- Script loaded: `init.js` Detected on 1 urls: https://epsf.ticketmaster.co.uk/asset/iamNotaRobot.js 22.02.2024 at 7:15 PM EET (detectinon tool)
app.impact.com	CloudFlare
lowes.com	?	No title https://www.lowes.com/ ⚠ Error ⚠ Network related / Timeout / Bad status code. undefined 23.02.2024 at 12:15 PM
zoominfo.com	CloudFlare, PerimeterX	The following anti-bots got detected: Cloudflare Headers: `cf-chl-gen`, `cf-ray`, `cf-mitigated` Server header: `cloudflare` / /_next/static/chunks/7206.5779b179352572ea.js /_next/static/chunks/7206.5779b179352572ea.js /_next/static/chunks/517-12dd13ab7bc0ba21.js /_next/static/chunks/517-12dd13ab7bc0ba21.js /_next/static/chunks/4084-8cdd0bbc3e96a5cd.js /_next/static/chunks/4084-8cdd0bbc3e96a5cd.js /_next/static/chunks/5961.3915b792b508d644.js /_next/static/chunks/5961.3915b792b508d644.js /_next/static/chunks/3664.76fa9ca93df1e82a.js /_next/static/chunks/3664.76fa9ca93df1e82a.js /_next/static/chunks/8859.1e3a134de88852f5.js /_next/static/chunks/8859.1e3a134de88852f5.js /_next/static/yON_enBU9_TV4tk092d2e/_buildManifest.js PerimeterX Cookies: `_px3`, `_pxhd`, `_px_vid` 23.02.2024 at 12:39 PM
kleinanzeigen.de (former DE eBay)	Cloudflare, Akamai	The following anti-bots got detected: Cloudflare • Headers: cf-chl-gen, cf-ray, cf-mitigated • Server header: cloudflare Detected on 2 urls: https://polyfill.io/v3/polyfill.min.js?features=default%2Ces5%2Ces6%2Ces7%2Cfetch%2CPromise.allSettled https://polyfill.io/v3/polyfill.min.js?features=default%2Ces5%2Ces6%2Ces7%2Cfetch%2CPromise.allSettled Akamai • Cookies: _abck, bm_sz, bm_sv, bm_mi, ak_bmsc • Sensor Data Detected on 4 urls: https://www.kleinanzeigen.de/TXWhuJfv4o3hlfm_Xyl3KyIL/ruQizQJcD4iGN9/bAx0YT95BQ/S0M6J3/xFQkg https://www.kleinanzeigen.de/TXWhuJfv4o3hlfm_Xyl3KyIL/ruQizQJcD4iGN9/bAx0YT95BQ/S0M6J3/xFQkg https://www.kleinanzeigen.de/TXWhuJfv4o3hlfm_Xyl3KyIL/ruQizQJcD4iGN9/bAx0YT95BQ/S0M6J3/xFQkg https://www.kleinanzeigen.de/TXWhuJfv4o3hlfm_Xyl3KyIL/ruQizQJcD4iGN9/bAx0YT95BQ/S0M6J3/xFQkg

Bright Data Proxy Manager

What does the Bright Data Proxy Manager do? It is a [open-source] software for managing multiple proxies seamlessly via API and admin UI.
To see all the ways to install it, use Tools->Proxy Manager on the Bright Data’s Dashboard left-side panel. Read more on the PM in here.

The advantages of using of the Proxy Manager:

One entry point
Concurrent connections
Auto retry rules
Real time statistics

Besides, Proxy Manager can be used with someone’s phone for scraping.

Note that residential proxies require special approval, so it took me 3 working ways till I could get that zone working. The personal Bright Data manager (assistant) has been helpful in getting acquainted with the PM, creating zones and making it to work.

Test

We decided to test the Bright Data service, particularly its residential proxies (residential, limited to 7 days only).

Setup

First of all we set up the Local Proxy Manager (PM) and the residential proxies zone. Read more about PM. Zones are a service’s custom configurations of parameters for [proxided] requests.

Note that residential proxies require special approval, so it took me 3 working ways till I could get that zone working. The personal Luminati manager (assistant) has been helpful in getting acquainted with PM, creating zones and making it to work.

We’ve set up 4 zones inside of PM:

data-center proxies zone, port 24000 (port number is assigned automatically or set manually)
residential proxies zone, port 24001
city asn proxies zone, port 24002
gip proxies zone, port 24003

In the test code we were using port 24001, corresponding to the residential zone of my PM, running at the address http://127.0.0.1:22999/. Basically the process looks like this:

Below you can see the proxy ports utilizing different proxy zones.

Note: There is also a mobile IPs option [proxies zone]. Mobile IPs are almost unblockable. Mobile proxies usage covers (1) website performance, (2) retail/travel: prices fetching, app promotions (Adverification).

So, we started to test YellowPages.com to gather links thru a simple GET request of hotel keyword and 2 letter state abbreviation (NY, CA, etc.)

https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}&page={2}

We performed consequent requests to the YP site and extracted all US hotels rotating thru the 50 US states abbreviations array (as the scraper had not gotten new hotel items for a given state).

Test code

import requests, json
import re, time, sys, random

test_url="http://lumtest.com/myip.json"

proxies = {
  'http':  'http://lum-customer-scrapingpro-zone-gen:[luminati password]@zproxy.lum-superproxy.io:22225',
  }

def get_content(url='',
                proxies = {'http':"127.0.0.1:24001"},
                headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
                ):
    test_url="http://lumtest.com/myip.json"
    try:
        real_proxy = requests.get(test_url, proxies = proxies, headers = headers)
        real_proxy = json.loads(real_proxy.text)
    except: 
        real_proxy={'ip':'', 'country':''}
    try:
        r = requests.get( url, proxies = proxies, headers = headers)
        res_length=len(r.text)
        #print "Url: ", url, "\nGot html of "+str(res_length)+' bytes.'
        if '/recaptcha/' in r.text:
            captcha=True
        else:
            captcha=False
        return {'text': r.text,
            'params': { 'size':res_length, 'captcha':captcha,
                       'exit_node': {"ip":real_proxy['ip'], 'country':real_proxy['country']}
                       }
            }
    except:
        print 'Failure get html by url:', url
        return { 'text':{}, 'params':{'size':0, 'captcha':False, 'exit_node': real_proxy }  }
    
    #if res_length<10000:
        #print 'Result www.yellowpages.com.au.html\n', r.text
        
    
state_codes = ['AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA','HI','ID','IL','IN','IA','KS','KY','LA',
             'ME','MD','MA','MI','MN','MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH','OK',
             'OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA','WV','WI','WY','AS','DC','FM','GU',
             'MH','MP','PW','PR','VI','AE','AA','AE','AE','AE','AP']
curr_state_code = 'CA'
state_codes_processed = []
p_type='residential'
proxy_type={'data-center':'0', 'residential':'1', 'city-asn':'2', 'gip':'3' }
print '****************************\n'+ p_type , 'proxies test:'
proxies ={
    'http':"127.0.0.1:2400" + proxy_type[p_type]
    }
#url = "https://www.yellowpages.com"
url = "https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}&page={2}"
hotel_counter=0
start = time.time() 
total_links = set()
page_index=1
for i in range(1,5000): 
    res = get_content(url.format('hotel', curr_state_code, page_index), proxies,
    {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'},
                        )
     
    print '**************\nRequest', str(i), 'url: ' , url.format('hotel', curr_state_code, page_index) , '.\n', res['params']
    # write to file
    file_name= 'yellopages_search_{}_{}_page{}_links.html'.format('hotel', curr_state_code, page_index)
    all_sets = re.findall( r"<h2 class=\"n\">.*?<\/h2>", res['text'])
    all_links = re.findall( r'\"business-name\" href=\"(.*?)\"', res['text'])
    page_index+=1
    prev_total_links_amount=len(total_links)
    for link in all_links:
        total_links.add(link.split('?')[0])
    
    # check if links are the same
    if prev_total_links_amount >= len(total_links):
        state_codes_processed.append(curr_state_code)
        while curr_state_code in state_codes_processed:
            curr_state_code = random.choice(state_codes)
        page_index=1
        print "Processed state codes:", state_codes_processed
        print 'New state code:', curr_state_code
       
    print 'Found', len(all_links), 'links.'
    print 'All links amount:', len(total_links)
    print 'Total requests:', i
    print "Process time (seconds): " , round(time.time() - start, 1)
    #print ''.join(all_sets)
    with open('total_links.txt', "w" ) as text_file:
        text_file.write('\n'.join(total_links))

Results

All hotel links/items amount: 7147
Total requests: 263
Process time (seconds): 1267.1

We counted total links to assure that the aggregator does not expose the same hotel links in order to spoof a scrape-bot. The test result has shown us that for each request to a web page we got an average 7147/263 = 27 items per page (32 items on a page). The proxy extraction time was 1267/263 =~ 5 seconds per request.

Other service figures

Luminati has a 4-second timeout for DNS lookup.

Network uptime 99.9%, and it can be viewed live at https://luminati.io/cp/status.

Conclusion

The Bright Data Proxy Manager proved to be reliable in scraping from a challenging site aggregator. Its residential proxies proved to be high output proxies, and the scraper ran seamlessly using them.

See more of Bright Data Scraping Solutions

Provider	Time	Bandwidth
IPRoyal	2 ½ hour	5.9 Gb
MarsProxies	3 ½ hour	4.7 Gb
NetNut	3 ¾ hour	5.8 Gb

3 replies on “Brigth Data residential proxy for extracting from a data aggregator”

Hey, do you have alternatives to Lumintati.io proxy? Those seem to fault a bit for me.
Any info on Smartproxy.io and stormproxies.com ones? Cheers!

So far not.

Even if luminati have pretty great services, first of all, their services are really expensive and secondly, sometimes their services lag so it’s not great especially when you’re in the middle of web project or bot testing. I also tried some other providers, like smartproxy and geosurf since i needed advanced pool of ips for scraping and haven’t had any issues with these providers. Would be nice if you could check them as well and add your opinion.