Month: October 2020

2 coding-free ways to extract content from websites to boost web traffic

Post author By admin
Post date October 27, 2020
No Comments on 2 coding-free ways to extract content from websites to boost web traffic

Content is most basic way to attract traffic – without a certain amount of quality content, neither Google nor visitors would be interested in your website because there is little value they can get browsing it.

There are 2 main coding-free solutions for extracting content from websites to build your content base: choose one or a combination of themand have a try!

Tags Octoparse

Development

Fast scrape of a simple website using Node.js, Apify & Cheerio scraper

Post author By admin
Post date October 26, 2020
No Comments on Fast scrape of a simple website using Node.js, Apify & Cheerio scraper

We recently composed a scraper that works to extract data of a static site. By a static site, we mean such a site that does not utilize JS scripting that loads or transforms on-site data.

If you are interested in a scrape JS-rendered site, please read the following: Scraping a Javascript-dependent website with puppeteer.

Technologies stack

Node.js, the server-side JS environment. The main characteristic of Node.js is the code asynchronous execution.
Apify SDK, the scalable web scraping and crawling library for JavaScript/Node.js. Let’s highlight its excellent characteristics:

automatically scales a pool of headless Chrome/Puppeteer instances
maintains queues of URLs to crawl (handled, pending) – this makes it possible to accommodate crawler possible failures and resumes.
saves crawl results to a convenient [json] dataset (local or in the cloud)
allows proxies rotation

We’ll use a Cheerio crawler of Apify to crawl and extract data off the target site. The target is https://www.ebinger-gmbh.com/.

Tags Node.js

Development

Node.js, Apify, how to fill in RequestQueue from txt file

Post author By admin
Post date October 21, 2020
No Comments on Node.js, Apify, how to fill in RequestQueue from txt file

When working with Apify crawlers, it’s necessary to init RequestQueue. How to fill in RequestQueue from txt file?

Given

A text file with urls to crawl. In our case it’s categories.txt. We’ll use LineReader node package to open and iterate the file line by line.
LineReader to install:

npm i --save line-reader

Since requestQueue methods return Promise, when iterating over the lines of the file we need to apply async function for each line to be added as url into the requestQueue.

The code

const queue_name ='ebinger';
const base_url = 'https://www.ebinger.com/';

Apify.main(async () => {
 
	const requestQueue = await Apify.openRequestQueue(queue_name);
	const lineReader = require('line-reader');
	lineReader.eachLine('categories.txt', async function(line) {
		//console.log('adding ', line);
		let url = base_url + line.trim(); 
		await requestQueue.addRequest({ url: url });
	}); 
	var { totalRequestCount, handledRequestCount, pendingRequestCount, name } = await requestQueue.getInfo();
	console.log(`RequestQueue "${name}" with requests:` );
	console.log(' handledRequestCount:', handledRequestCount);
	console.log(' pendingRequestCount:', pendingRequestCount);
	console.log(' totalRequestCount:'  , totalRequestCount);
...

Tags Node.js

Uncategorized

Chromium Command Line switches

Post author By admin
Post date October 20, 2020
No Comments on Chromium Command Line switches

When we use Selenium or Node.js + Puppeteer to run [headless] Chrome/Chromium we might need to add some extra functionality/conditions to launch browsers with. Below you’ll find all kinds of Conditions and their explanations.

How to use command line switches?

The Chromium Team has made a page on which they briefly explain how to use these switches.

Development

How to extract emails, phones, links (urls) from text fragments?

Post author By admin
Post date October 6, 2020
No Comments on How to extract emails, phones, links (urls) from text fragments?

Recently I noticed the question about extracting emails, phones, links(urls) from text fragments and immediately I decided to write this short post.

Regex comes to rescue

Each of the following: email, phones, link, form a category that falls under/matches a certain text pattern. What are the text patterns ? These are regexes, aka regex patterns, short for regular expressions. Eg. most emails fit into the following regex pattern:

^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$

Tags email, Regex