Category: Development

My experience of choosing web scraping platform for company critical data feed

Post author By admin
Post date February 23, 2022
No Comments on My experience of choosing web scraping platform for company critical data feed

Service	Residential	Cost/month	Traffic/month	$ per GB	Rotating	IP whitelisting	Performance and more	Notes
MarsProxies		N/A	N/A	3.5	yes	yes	500K+ IPs, 190+ locations Test results	SOCKS5 supported Proxy grey zone restrictions
Oxylabs.io		N/A	25 GB	9 - 12 "pay-as-you-go" - 15	yes	yes	100M+ IPs, 192 countries - 30K requests - 1.3 GB of data - 5K pages crawled	Not allowing to scrape some of grey zone targets, incl. Linkedin.
Smartproxy		Link to the price page	N/A	5.2 - 7 "pay-as-you-go" - 8.5	yes	yes	65M+ IPs, 195+ countries	Free Trial Not allowing to scrape some of grey zone targets, incl. Linkedin.
Infatica.io		N/A	N/A	3 - 6.5 "pay-as-you-go" - 8	yes	yes	Over 95% success *Bans from Cloudflare are also few, less than 5%.	Black list of sites —> proxies do not work with those. 1000 ports for one Proxy List Up to 20 Proxy Lists at a time Using via API Tool ISP-level targeting Rotation time selection
Mango Proxy		N/A	1-50 GB	3-8"pay-as-you-go" - 8	yes	yes	90M+ IPs, 240+ countries
IPRoyal		N/A	N/A	$4.55	yes	yes	32M+ IPs, 195 countries	Not allowing to scrape some of grey zone targets, incl. Facebook. List of bloked sites.
Rainproxy.io	yes	$ 4	from 1 GB	4	yes
BrightData	yes			15
ScrapeOps Proxy Aggregator	yes	API Credits per month	N/A	N/A		yes	Allows multithreading, the service provides browsers at its servers. It allows to run N [cloud] browsers from a local machine. The number of threads depends on the subscription: min 5 threads.	The All-In-One Proxy API that allows to use over 20+ proxy providers from a single API
Lunaproxy.com	yes	from $15	x Gb per 90 days	0.85 - 5				Each plan allows certain traffic amount for 90 days limit.
LiveProxies.io	yes	from $45	4-50 GB	5 - 12	yes	yes		Eg. 200 IPs with 4 GB for $70.00, for 30 days limit.
Charity Engine -docs	yes	-	-	starting from 3.6 Additionally: CPU computing - from $0.01 per avg CPU core-hour - from $0.10 per GPU-hour - source.			failed to connect so far
proxy-sale.com	yes	from $17	N/A	3 - 6 "pay-as-you-go" - 7	yes	yes	10M+ IPs, 210+ countries	30 days limit for a single proxy batch
Tabproxy.com	yes	from $15	N/A	0.8 - 3 (lowest price is for a chunk of 1000 GB)	yes	yes	200M+ IPs, 195 countries	,30-180 days limit for a single proxy batch (eg. 5 GB)
proxy-seller.com	yes	N/A	N/A	4.5 - 6 "pay-as-you-go" - 7	yes	yes	15M+ IPs, 220 countries	- Generation up to 1000 proxy ports in each proxy list - HTTP / Socks5 support - One will be able to generate an infinite number of proxies by assigning unique parameters to each list

Tags service, web scraping

Development

Simple Apify Puppeteer crawler

Post author By admin
Post date February 21, 2022
No Comments on Simple Apify Puppeteer crawler

const Apify = require('apify');
var total_data=[];
const regex_name = /[A-Z][a-z]+\s[A-Z][a-z]+(?=\.|,|\s|\!|\?)/gm
const regex_address = /stand:(<\/strong>)?\s+(\w+\s+\w+),?\s+(\w+\s+\w+)?/gm;
const regex_email = /(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))/i;
Apify.main(async () => {
    const requestQueue = await Apify.openRequestQueue('123');
    await requestQueue.addRequest(new Apify.Request({ url: 'https://www.freeletics.com/de/pages/imprint/' }));
    await requestQueue.addRequest(new Apify.Request({ url: 'https://di1ara.com/pages/impressum' }));
	console.log('\nStart PuppeteerCrawler\n');
    const crawler = new Apify.PuppeteerCrawler({
        requestQueue,
        handlePageFunction: async ({ request, page }) => {
            const title = await page.title();
            console.log(`Title of ${request.url}: ${title}`);
			const page_content = await page.content();
            console.log(`Page content size:`, page_content.length);
            let obj = { 'url' : request.url }; 
	 
			console.log('Names:');
			while ((m = regex_name.exec(page_content)) !== null) {
				// This is necessary to avoid infinite loops with zero-width matches
				if (m.index === regex_name.lastIndex) {
					regex_name.lastIndex++;
				}
				
				// The result can be accessed through the `m`-variable.
				m.forEach((match, groupIndex) => {
					console.log(`Found match, group ${groupIndex}: ${match}`);
					if (match !='undefined' ) { 
						obj['names'] +=  match + ', ';
					}
				}); 
				
				
			}	
			console.log('\nAddress:');
			while ((m = regex_address.exec(page_content)) !== null) {
				// This is necessary to avoid infinite loops with zero-width matches
				if (m.index === regex_address.lastIndex) {
					regex_address.lastIndex++;
				}
				
				// The result can be accessed through the `m`-variable.
				m.forEach((match, groupIndex) => {
					console.log(`Found match, group ${groupIndex}: ${match}`);
				});
				m[0] = m[0].includes('</strong>') ? m[0].split('</strong>')[1] : m[0];
				m[0] = m[0].replace('<', '');
				obj['address']= m[0] ?? '';
			}
			console.log('\Email:');
			while ((m = regex_email.exec(page_content)) !== null) {
				// This is necessary to avoid infinite loops with zero-width matches
				if (m.index === regex_email.lastIndex) {
					regex_email.lastIndex++;
				}
				
				// The result can be accessed through the `m`-variable.
				m.forEach((match, groupIndex) => {
					console.log(`Found match, group ${groupIndex}: ${match}`);
				}); 
				if (m[0]) 
				{
					obj['email'] = m[0];
					break;
				}
			}
			total_data.push(obj);
			console.log(obj);
        },
        maxRequestsPerCrawl: 2000000,
        maxConcurrency: 20,
    });
    await crawler.run();
	console.log('Total data:');
	console.log(total_data);
});

Tags crawling, Javascript, Puppeteer

Development

Hoppscotch – API ecosystem

Post author By admin
Post date January 28, 2022
No Comments on Hoppscotch – API ecosystem

Hoppscotch is the open-source API development ecosystem.

Tags API

Development

Add a composite constraint and view it, MySQL/MariaDB

Post author By admin
Post date December 24, 2021
No Comments on Add a composite constraint and view it, MySQL/MariaDB

Add

Adding constraint NetAppToken composing of 3 columns: network, application, token.
Note: you supposed to have chosen a right db:
use <database_name>;

ALTER TABLE crypto   
    ADD CONSTRAINT NetAppToken UNIQUE (network, application, token);

View

SELECT 
   table_schema,  
   table_name,    
   constraint_name
FROM information_schema.table_constraints
WHERE table_name = 'crypto';

Result

+--------------+------------+-----------------+
| table_schema | table_name | constraint_name |
+--------------+------------+-----------------+
| admin_crypto | crypto     | PRIMARY         |
| admin_crypto | crypto     | NetAppToken     |
+--------------+------------+-----------------+
2 rows in set (0.06 sec)

Development

CURL request into Curl PHP code

Post author By admin
Post date December 22, 2021
No Comments on CURL request into Curl PHP code

Recently I needed to transform the CURL request into the PHP Curl code, binary data and compressed option having been involved. See the query itself:

curl 'https://terraswap-graph.terra.dev/graphql' 
-H 'Accept-Encoding: gzip, deflate, br' 
-H 'Content-Type: application/json' 
-H 'Accept: application/json' 
-H 'Connection: keep-alive' 
-H 'DNT: 1' 
-H 'Origin: https://terraswap-graph.terra.dev' 
--data-binary '{"query":"{\n  pairs {\n    pairAddress\n    latestLiquidityUST\n    token0 {\n      tokenAddress\n      symbol\n    }\n    token1 {\n      tokenAddress\n      symbol\n    }\n    commissionAPR\n    volume24h {\n      volumeUST\n    }\n  }\n}\n"}' 
--compressed

Tags Curl, PHP

Development

Auth proxy with JAVA

In the post we’ll show how to leverage auth ptoxy (with login/pass) for JAVA application.

Tags JAVA

Development

Vesta CP install SSL certificate for a subdomain

Post author By admin
Post date November 16, 2021
No Comments on Vesta CP install SSL certificate for a subdomain

In this post I’ll share how I’ve added a LetsEncrypt SSL certificate to a subdomain at VPS with Centos 7 using Vesta CP.

Tags HTTP

Development

Subdomain at Centos 7 with Laravel project

Post author By admin
Post date November 11, 2021
No Comments on Subdomain at Centos 7 with Laravel project

This post is devoted to the steps of how to create subdomain (Centos 7 and Vesta CP) and map a [Laravel] project folder to it.

Tags Centos, HTTP

Development

How to add Git Personal Access Token (PAT) into git console

Post author By admin
Post date October 1, 2021
No Comments on How to add Git Personal Access Token (PAT) into git console

Remove previous git origin

git remote remove origin

Add new origin with PAT (<Token>) :

git remote add origin https://<TOKEN>@github.com/<USERNAME>/<REPO>.git

Push once with –set-upstream

git push --set-upstream origin main

Now you might commit changes to the remote repo without adding PAT into a push command every time.

If you need to create PAT, use the following tut.

Development

Cheerio.js, get items from html table into object

Post author By admin
Post date April 16, 2021
2 Comments on Cheerio.js, get items from html table into object

Suppose there is a table like below (1 info row only):

Blows Minute (BPM)	Speed (RPM)	Power, PSI	Flow, PSI	Tool Sys
0-2500	0-250	1.8 HP	2.6-13.2 GPM	SDS Max

How to scrape it using cheerio.js as a parser?

Case 1 (1 row only)

Tags Cheerio, HTML, parse