Categories
Development

Node.js, Puppeteer, Apify for Web Scraping (Xing scrape) – part 2

In the post we share the practical implementation (code) of the Xing companies scrape project using Node.js, Puppeteer and the Apify library. The first post, describing the project objectives, algorithm and results, is available here.

The scrape algorithm you can look at here.

Start Apify

For starting an Apify actor locally, it is better to quick start it by using the  apify create  command, if you have Apify’s CLI. Then you can git init. For a Quick Start refer to here.

The main.js file structure might be summed up as follows:

  1. Init global vars
  2. Apify.main(async () => {
    • fetch from input
    • init settings
    • check deactivated accounts
    • compiling search urls
    • adding previously failed urls to queue
    • const crawler = new Apify.PuppeteerCrawler({
      • crawler settings
      • save each found url into dataset

      });

    • await crawler.run(); // launch crawler
    • get results from datasets

    });

PuppeteerCrawler

The main engine of the scraping code has chosen Apify SDK PuppeteerCrawler. The code runs having a given INPUT and finishes as a request queue is exhausted.

const input = await Apify.getValue('INPUT'); 
var base_name = input.dataset_name;

Input is taken from an INPUT.json file. The approximate input file content should be like the following:

{	
	"page_handle_max_wait_time" : 1,
	"concurrency" : 8,
	"max_requests_per_crawl" : 12000, 
	"retireInstanceAfterRequestCount": 7000,
	"cookieFile": "cookies.json",
	"account_index": 0,
	"account_exceptions":[],
	"account" : {  
	    "0" : { "username":"xxxxxxxxxx", "password":"xxxxxxxxx" },
	    "1" : { "username":"yyyyyyyyyy", "password":"yyyyyyyyy" }
	},
	"dataset_name": "test",
	"queue_name"  : "test",
	"letters": "a,b,c,d,e,f,g,h,i,j,k,l... wY,xA,xE,xI,xO,xU,xY",	 
	"crawl" : {
		"country":"2951839",		 
		"landern_with_letters" : "2953481,2951839,2950157,2945356,2822542",
		"landern_only" : "",
		"empl_range": "4",
	},
	"init_urls": "",
	"deleteQueue": 0,
	"output":"OUTPUT"
}

Here in the INPUT,  most parameters are self-explanatory.
The page_handle_max_wait_time is used to make a random time delay in the crawl process, just to be sure the script does not throttle the site.

The account_index parameter sets the init account index to choose from the INPUT accounts.
In the INPUT file the parameter crawl has sub-parameters that are needed for forming and compiling complex search urls [at the run time]. Those urls include the following:

  • country index (country, landern_with_letters, landern_only)
  • employee range (empl_range)
  • keywords (letters) are also used for forming complex search urls.
Note, for each xing category I run separate script instances with different employee ranges.

Urls inside of crawler

The crawler handles 2 kinds of urls:

  1. Search url. One may call it a crawl url as opposed to scraping url. This is a service url, enabling the crawler to gather as much as page urls. Such a url is compiled [artificially] based on the xing request filter parameters; we get them from xing. Parameters’ values are set in INPUT.json. Parameters are e.g.,

    filter.size[]=4
    filter.location[]=2951839
    keywords=aH

    The mix of url various GET parameters composes a unique search url. A search url might look like the following:
    https://www.xing.com/search/companies?filter.location%5B%5D=2951839&filter.size%5B%5D=4&keywords=aH
    Therefore, such a url returns a different search result having different parameters. This way we may query the db of xing companies backwards and forwards (inside out).
    Those search urls are generated upon each PuppeteerCrawler startup, and they get added to the queue:

    if (input.letters){	
    	let letters = input.letters.split(','); //console.log('letters:', letters); 
    	// landern composed with letters
    	if (input.crawl.landern_with_letters) {
    		let counter=0;
    		let i,j;
    		let landern = input.crawl.landern_with_letters.split(',');
    		for (i = 0; i < landern.length  ; i++) { 
    			for (j = 0; j < letters.length  ; j++) {
    				//console.log('letter:',letters[j] ,'land:', landern[i]);
    				let url = base_req_land + "&filter.location[]=" + landern[i].trim()
    				url += "&keywords=" + letters[j].trim(); 
    				await requestQueue.addRequest({ url: url });
    				counter+=1;
    			} 	 
    		}
    		console.log(`\n{counter} url(s) have been added from 'landern_with_letters' input composed with 'letters' input.`);} else { // only letters input (with given countries)			
    			console.log(`\nAdding requests from input letters (${letters.length}).`);			
    			let i;
    			for (i = 0; i < letters.length  ; i++) {  	
    				let url = base_req + "&keywords=" + letters[i].trim(); 
    				await requestQueue.addRequest({ url: url });
    			} 
    			console.log(`${i} url(s) been added from letters input.`);
    		}
    	}
    	if (input.crawl.landern_only){
    		let landern = input.crawl.landern_only.split(',');
    		let counter=0;
    		for (let i = 0; i < landern.length  ; i++) {
    			let url = base_req_land + "&filter.location[]=" + landern[i].trim() 
    			await requestQueue.addRequest({ url: url });
    			counter+=1;
    		}
    		console.log(`\n{counter} url(s) have been added from 'landern_only' input`); 
    	}

    If a url is already in the queue or has been processed (handled) then the smart Apify SDK requestQueue object will filter it out, by not adding it. That is what makes working with Apify SDK so convenient. The requestQueue object smartly manages handled and pending urls.

  2. Page url. Each page url is a url of a particular xing company. They are gathered from the search results of search urls inside of the PuppeteerCrawler’s handlePageFunction. Page url e.g.,
    https://www.xing.com/companies/daimlerag

Handling xing pagination

A single search request might find many (over 10) companies’ pages, so this requires pagination to gather them all. So we spawn search urls by adding page parameter to the initial search url, new urls being added to requestQueue:

if (companies_for_base_search_page > 10) { // we create paging sub-requests 
      let max_page = companies_for_base_search_page > 300 ? 30 : Math.ceil(companies_for_base_search_page / 10); 
      if (max_page*10 < companies_for_base_search_page) {
            console.log(`!!! Warning, for the request \
            {request.url} the number of companies is \
            ${companies_for_base_search_page}`); 
            oversise_req[request.url.split('?')[1]]=total_companies;
            try{ 
                await oversized_search_dataset.pushData({url: request.url});                              
            } catch (e) {  
              console.log(`Failure to write to "${input.oversized_search_file}"...\nPlease check if file exists.`); 
            }    
                         
      }
      for (let i = 2; i <= max_page ; i++) { 
          let url = request.url + '&page=' + i.toString()                    
          requestQueue.addRequest({ url: url });                    
          console.log(' - added a paging request: page='+ i.toString());
      }                
}

Exclude non-company page links

In the crawling process we decided not use Apify’s own Apify.utils.enqueueLinks() utility to better filter in page urls. So we’ve composed our own check_link() procedure to be able to extract only the companies’ pages links (page urls).

var exclude_links_with =['search', 'icons', 'industries', '/img',\
 'scraping', "application-", "statistics-", "draggable-"];
...
function check_link(elem) {
	let check_flag=true;
	exclude_links_with.forEach(function(item) { 
		if (elem.includes(item)){
			check_flag = false; 
		}
	});
	return check_flag;
}

Logging in

We created a separate file login_xing.js where several login procedures are stored. The  login_page_simple() procedure and check_if_logged_in() are used to login and to check if an account is logged in, using an existing PuppeteerCrawler page instance.

const puppeteer = require('puppeteer');
async function check_if_logged_in(page, page_content = false){
	let user = await page.$('span.myxing-profile-name'); 
	let user_name = false;
	if (user){
		user_name = await (await user.getProperty('textContent')).jsonValue();
		return user_name;
	}
	if (!page_content) { 
		var page_content = await page.content();	 
	}
	if (page_content.includes('class="Me-Me')){ 
		return true;  
	} 
	return false;
}

async function login_page_simple(page, username="", password=""){	 
	try {      // regular login
		await page.goto('https://www.xing.com/signup?login=1', { waitUntil: 'networkidle0' }); // wait until page load
		await page.type('input[name="login_form[username]"]', username);
		await page.type('input[name="login_form[password]"]', password);
		await page.evaluate(() => {
				document.getElementsByTagName('button')[1].click();
			});
	} catch(e){  //trying to relogin, another form		
                // we click 3 times in that input to select any content that browser might have pre-filled in it
		await page.click('input[name="username"]', {clickCount: 3});
		await page.type('input[name="username"]',  username);
		await page.type('input[name="password"]',  password);
		await page.click('button[type="submit"]'); 
	} 
	await page.waitForNavigation({ waitUntil: 'networkidle0' });	 
	console.log('After login_page_simple()\n  account:', username ,'\n  page url:', await page.url())
	//let page_url = ; 
	//let page_content = await page.content();
	//console.log('  Page content size :', page_content.length );
	var login_check = await check_if_logged_in(page);
	console.log('  login result:', login_check );
	return login_check;
}

Work with accounts

Since upon crawl of certain number of requests, xing bans an account, we have developed the strategy to check and discard any account that fails to log in to xing. Initially, we give each account a credit (validity) of 3 points. That means that the script will make max 3 attempts to try to login with it. If all 3 attempts are failed, the account will be set as deactivated.

// fill out `account_validity` with init values
for(let i in input.account){
     account_validity[parseInt(i)] = 3; 
}

Since xing may ban/deactivate any account that we use, for the purpose of choosing a new one, we have a procedure for checking non-active (deactivated) xing accounts: accounts_check.js

require('./login-xing.js');
async function check_non_active_accounts(accounts){	
	let deactivated = new Set();
	//console.log('\nStart deactivated accounts check');
	for(let i in accounts) {
		//console.log('\n' + i +'.',  accounts[i].username );
		let result_page, login_check;		
		result_page = await login(accounts[i].username, accounts[i].password);
		if ( (typeof result_page) == 'string') { 
			//console.log('Result page:',result_page );
			if (result_page.includes('deactivated')) {				
				deactivated.add(parseInt(i));
			}
		} else { // result_page is a page Object of pupeeter
			//console.log('Result page:', result_page.url());
			login_check = await check_if_logged_in(result_page);
			console.log('login result:', login_check);
			if (result_page.url().includes('deactivated') || !login_check) {
				deactivated.add(parseInt(i));
			}
		}		
	}
	console.log('deactivated set:', deactivated);
	return deactivated;
}
global.check_non_active_accounts=check_non_active_accounts;

Datasets

The crawler pushes scraped data into several data sets.

dataset = await Apify.openDataset(dataset_name); # main dataset to store data 
wrong_website_dataset = await Apify.openDataset('wrong-website-'+base_name);
no_links_search_url_dataset = await Apify.openDataset('no-links-searches-'+base_name);
oversized_search_dataset = await Apify.openDataset('no-links-searches-'+base_name);

The wrong_website_dataset stores urls of the main dataset items that have a website item parameter equal to https://www.xing.com/.

The  no_links_search_url_dataset and oversized_search_dataset accumulate search requests that have either zero number of results or a total number of results over 300 (oversized). The main.js code fetches the urls from corresponding datasets [before the Apify.PuppeeterCrawler launch] and re-runs those [search] requests at the following launch. 

But the requestQueue [Apify object] is smart enough to exclude repeating urls, the filter parameter being an url itself. Therefore before we rerun main.js that adds urls from those datasets, we have to remove those urls from handled links located at the apify_storage/request_queues/<queue_name>/handled folder. For that I used a separate Python script:

First we save empty_searches_dataset items into a corresponding *.txt file

# save dataset items into txt file
import json
lines=[]
empty_searches_dataset_name = 'empty_searches_xxx'
path ='./' + empty_searches_dataset_name 
print 'Opening dataset:', path
for x in os.listdir(path):
    with open(os.path.join( path, x ) ) as f:
        content = f.read()
    item = json.loads(content)
    lines.append(item['url'])
print 'Found ', len(lines), 'urls.'
with open( empty_searches_dataset_name + '.txt' , 'w') as f:
    for item in lines:
        f.write( item+'\n') 
print 'They have been written into', empty_searches_dataset_name

Now we need to copy that text file from apify_storage/datasets/<dataset_name>  into apify_storage/request_queues/<queue name> 

And run the following script:

# removing empty searches from handled urls
empty_searches_file = 'empty_searches_xxx.txt'
with open(empty_searches_file, 'r') as f:
    zero_search_pages_list = f.readlines()

zero_search_list=[]   
for line in zero_search_pages_list:
    line = line.lstrip("\n").rstrip()
    #line = line.split('keywords')[1];
    if line:
        zero_search_list.append(line)
# remove duplicates
zero_search_list = list(dict.fromkeys(zero_search_list))
print '\nZero search list length:', len(zero_search_list),'\n'
path_to_handled_requests ='./handled'
remove_counter=0
found_counter=0
to_remove=[]
for x in os.listdir(path_to_handled_requests):
    remove = False
    with open(os.path.join( path_to_handled_requests, x ) ) as f:
        content = f.read()          
        if any(line in content for line in zero_search_list):     
            #print 'found :', content[50:].split('"uniqueKey":')[1].split('"method":')[0] 
            found_counter+=1
            to_remove.append(x)
            
#print 'List of file to remove:', to_remove
print '\nFound', found_counter, 'files to remove.' 
y = raw_input('\nYou want to remove them (r)?') 
if y=='r':
    for x in to_remove:                
        print 'removing', x
        remove_counter+=1
        os.remove( os.path.join( path_to_handled_requests, x ) )
        
print '\nFound', found_counter, 'files.'
print 'Removed', remove_counter, 'files.'

The project files you might get from here.

Conclusion

The scrape of the xing business directory has been a decent challenge to me. I managed to develop a mid-level- of- difficulty crawler with custom links from gathering, handling login inside of crawler’s gotoFunction, to storing and re-running  failed searches that found no links, etc.

Note: If you need a step by step guide on how to install puppeteer, Apify and start the project – let me know in comments.

2 replies on “Node.js, Puppeteer, Apify for Web Scraping (Xing scrape) – part 2”

Hi Igor, I like your project, very well documented. I would like to set this up to learn puppeteer too.
Can you help me with it?

Leave a Reply to Andreas Stoll Cancel reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.