Category: Development

Fast scrape of a simple website using Node.js, Apify & Cheerio scraper

Post author By admin
Post date October 26, 2020
No Comments on Fast scrape of a simple website using Node.js, Apify & Cheerio scraper

We recently composed a scraper that works to extract data of a static site. By a static site, we mean such a site that does not utilize JS scripting that loads or transforms on-site data.

If you are interested in a scrape JS-rendered site, please read the following: Scraping a Javascript-dependent website with puppeteer.

Technologies stack

Node.js, the server-side JS environment. The main characteristic of Node.js is the code asynchronous execution.
Apify SDK, the scalable web scraping and crawling library for JavaScript/Node.js. Let’s highlight its excellent characteristics:

automatically scales a pool of headless Chrome/Puppeteer instances
maintains queues of URLs to crawl (handled, pending) – this makes it possible to accommodate crawler possible failures and resumes.
saves crawl results to a convenient [json] dataset (local or in the cloud)
allows proxies rotation

We’ll use a Cheerio crawler of Apify to crawl and extract data off the target site. The target is https://www.ebinger-gmbh.com/.

Tags Node.js

Development

Node.js, Apify, how to fill in RequestQueue from txt file

Post author By admin
Post date October 21, 2020
No Comments on Node.js, Apify, how to fill in RequestQueue from txt file

When working with Apify crawlers, it’s necessary to init RequestQueue. How to fill in RequestQueue from txt file?

Given

A text file with urls to crawl. In our case it’s categories.txt. We’ll use LineReader node package to open and iterate the file line by line.
LineReader to install:

npm i --save line-reader

Since requestQueue methods return Promise, when iterating over the lines of the file we need to apply async function for each line to be added as url into the requestQueue.

The code

const queue_name ='ebinger';
const base_url = 'https://www.ebinger.com/';

Apify.main(async () => {
 
	const requestQueue = await Apify.openRequestQueue(queue_name);
	const lineReader = require('line-reader');
	lineReader.eachLine('categories.txt', async function(line) {
		//console.log('adding ', line);
		let url = base_url + line.trim(); 
		await requestQueue.addRequest({ url: url });
	}); 
	var { totalRequestCount, handledRequestCount, pendingRequestCount, name } = await requestQueue.getInfo();
	console.log(`RequestQueue "${name}" with requests:` );
	console.log(' handledRequestCount:', handledRequestCount);
	console.log(' pendingRequestCount:', pendingRequestCount);
	console.log(' totalRequestCount:'  , totalRequestCount);
...

Tags Node.js

Development

How to extract emails, phones, links (urls) from text fragments?

Post author By admin
Post date October 6, 2020
No Comments on How to extract emails, phones, links (urls) from text fragments?

Recently I noticed the question about extracting emails, phones, links(urls) from text fragments and immediately I decided to write this short post.

Regex comes to rescue

Each of the following: email, phones, link, form a category that falls under/matches a certain text pattern. What are the text patterns ? These are regexes, aka regex patterns, short for regular expressions. Eg. most emails fit into the following regex pattern:

^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$

Tags email, Regex

Development

Composer failed to download guzzle package

Post author By admin
Post date September 8, 2020
No Comments on Composer failed to download guzzle package

I’ve met with the challenge that composer failed to load guzzle library file:

https://packagist.org/p/guzzlehttp/guzzle%241f150aaa79afd8bc5d6f08f730634a0d60
f5dfcd1dd4a6fc5263fb4b1cefeb16.json" 
file could not be downloaded (HTTP/1.1 404 Not Found)

The solution has been the following:

composer clear-cache

Development Guest posting Web Scraping Software

Octoparse Alternatives

Let me tell you what you already know! Octoparse is a great web scraping tool! But like every great tool, it’s got its limitations. At times, you may wonder if there are any alternatives to Octoparse. We wondered the same and put together this blog to provide you a short list of Octoparse alternatives along with their features and distinguishing factors. Let’s get started!

Tags Octoparse, scraper, web scraping

Development

Selenium Web Scraping in simple words

Post author By admin
Post date August 14, 2020
No Comments on Selenium Web Scraping in simple words

Question: What is Selenium web scraping?

Answer: A picture is better than 1000 words: selenium main diagram

So, you make a program with Python, PHP, JAVA, Ruby and whatever language you use in order to browse(), select(), click(), submit(), save(), etc., target web pages.

Tags Selenium, web scraping

Development

Linkedin scrape guide lines

Post author By admin
Post date August 4, 2020
No Comments on Linkedin scrape guide lines

The LinkedIn crawl success rate is low; one request that a bot makes might require several retries to be successful. So, here we share the crucial Linkedin scraping guide lines.

Rate limit
Limit the crawling rate for LinkedIn. The acceptable approximate frequency is: 1 request every second, 60 requests per minute.
Public pages only
LinkedIn allows for bots only public pages; pages that are private cannot be crawled.

Tags LinkedIn, Node.js, web scraping

Development

Get and pass CSRF token using python requests library

Post author By admin
Post date July 20, 2020
No Comments on Get and pass CSRF token using python requests library

import sys
import requests
URL = 'https://portal.bitcasa.com/login'
client = requests.session()

# Retrieve the CSRF token first
client.get(URL)  # sets cookie
if 'csrftoken' in client.cookies:
    # Django 1.6 and up
    csrftoken = client.cookies['csrftoken']
else:
    # older versions
    csrftoken = client.cookies['csrf']

# Pass CSRF token both in login parameters (csrfmiddlewaretoken)
# and in the session cookies (csrf in client.cookies)
login_data = dict(username=EMAIL, password=PASSWORD, csrfmiddlewaretoken=csrftoken, next='/')
r = client.post(URL, data=login_data, headers=dict(Referer=URL))

Tags cookie, CSRF, Python

Challenge Development

Business directory simple scraper (python) at pythonanywhere

Post author By admin
Post date July 3, 2020
No Comments on Business directory simple scraper (python) at pythonanywhere

My goal was to retrieve data from a web business directory.

Since the business directories scrape is the most challenging task (beside SERP scrape) there are some basic questions for me to answer:

Is there any scrape protection set at that site?
How much data is in that web business directory?
What kind of queries can I run to find all the directory’s items?

Tags business directory

Challenge Development

Scraping a Javascript-dependent website with puppeteer

Post author By admin
Post date June 25, 2020
No Comments on Scraping a Javascript-dependent website with puppeteer

Support us by purchasing the book (under $5) on this topic.

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

Tags Javascript, Node.js, scrape protection