In this post we share with you how to perform web scraping of a JS-rendered website. The tools as seen in the header are JAVA with Selenium library driving headless Chrome instances (download driver) and JSoup as parser to fetch data of the acquired HTML.
Category: Development
Try the code
See the scripts below.
Check if cookies are enabled
function areCookiesEnabled()
{
var cookieEnabled = (navigator.cookieEnabled) ? true : false;
if (typeof navigator.cookieEnabled == "undefined" && !cookieEnabled)
{
document.cookie = "test";
cookieEnabled = (document.cookie.indexOf("test") != -1) ? true : false;
}
return cookieEnabled;
}
Navigator is the interface represents the state and the identity of the user agent. It allows scripts to query it and to register themselves to carry on some activities.
A Navigator
object can be retrieved using the read-only window.navigator
property.
Check if sessionStorage is enabled
function isStorageEnabled() {
try{
sessionStorage.setItem("test","value");
if(sessionStorage.getItem("test") == "value") {
sessionStorage.removeItem("test");
return true;
} else {
return false;
}
} catch(err) {
return false;
}
}
Problem
I’ve made a simple node.js server at VDS:
var http = require('http');
http.createServer(function (req, res) {
let port = 9999;
res.writeHead(200, {'Content-Type': 'text/plain'});
res.end('Hello World\n');
}).listen(post, '0.0.0.0');
console.log('Server running at port ' + port);
It works outputting:
Server running at port 9999
yet I can’t reach it at VPS/VDS IP where the code is residing: http://webscraping.pro:9999/ How to solve that?
We recently composed a scraper that works to extract data of a static site. By a static site, we mean such a site that does not utilize JS scripting that loads or transforms on-site data.
Technologies stack
- Node.js, the server-side JS environment. The main characteristic of Node.js is the code asynchronous execution.
- Apify SDK, the scalable web scraping and crawling library for JavaScript/Node.js. Let’s highlight its excellent characteristics:
- automatically scales a pool of headless Chrome/Puppeteer instances
- maintains queues of URLs to crawl (handled, pending) – this makes it possible to accommodate crawler possible failures and resumes.
- saves crawl results to a convenient [json] dataset (local or in the cloud)
- allows proxies rotation
We’ll use a Cheerio crawler of Apify to crawl and extract data off the target site. The target is https://www.ebinger-gmbh.com/.
When working with Apify crawlers, it’s necessary to init RequestQueue. How to fill in RequestQueue from txt file?
Given
A text file with urls to crawl. In our case it’s categories.txt. We’ll use LineReader node package to open and iterate the file line by line.
LineReader to install:
npm i --save line-reader
Since requestQueue methods return Promise, when iterating over the lines of the file we need to apply async function for each line to be added as url into the requestQueue.
The code
const queue_name ='ebinger';
const base_url = 'https://www.ebinger.com/';
Apify.main(async () => {
const requestQueue = await Apify.openRequestQueue(queue_name);
const lineReader = require('line-reader');
lineReader.eachLine('categories.txt', async function(line) {
//console.log('adding ', line);
let url = base_url + line.trim();
await requestQueue.addRequest({ url: url });
});
var { totalRequestCount, handledRequestCount, pendingRequestCount, name } = await requestQueue.getInfo();
console.log(`RequestQueue "${name}" with requests:` );
console.log(' handledRequestCount:', handledRequestCount);
console.log(' pendingRequestCount:', pendingRequestCount);
console.log(' totalRequestCount:' , totalRequestCount);
...
Recently I noticed the question about extracting emails, phones, links(urls) from text fragments and immediately I decided to write this short post.
Regex comes to rescue
Each of the following: email, phones, link, form a category that falls under/matches a certain text pattern. What are the text patterns ? These are regexes, aka regex patterns, short for regular expressions. Eg. most emails fit into the following regex pattern:
^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$
I’ve met with the challenge that composer failed to load guzzle library file:
https://packagist.org/p/guzzlehttp/guzzle%241f150aaa79afd8bc5d6f08f730634a0d60 f5dfcd1dd4a6fc5263fb4b1cefeb16.json" file could not be downloaded (HTTP/1.1 404 Not Found)
The solution has been the following:
composer clear-cache
Let me tell you what you already know! Octoparse is a great web scraping tool! But like every great tool, it’s got its limitations. At times, you may wonder if there are any alternatives to Octoparse. We wondered the same and put together this blog to provide you a short list of Octoparse alternatives along with their features and distinguishing factors. Let’s get started!