Categories
Miscellaneous

What information do internet services collect about users?

we use your personal data so we can provide the best service, tell you about products and services you may be interested in…

These or similar statements are often “tiny printed” at most of modern sites as part of Terms of Service (ToS). Below we share with you what particular data are collected from web users or app users.

Categories
Development Miscellaneous

Extracting sequential HTML elements with XPath and Regex

Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

Categories
Development Miscellaneous

Cookie browser-server workflow

Categories
Guest posting Miscellaneous

Data extraction: web crawling vs. web scraping in E-commerce

Nowadays, when one has some questions, it comes almost naturally for us to just type it in a search bar and get helpful answers. But we rarely wonder how all that information is available and how it appears as soon as we start typing. Search engines provide easy access to information, but web crawling and scraping tools, which are not such well-known players, have a crucial role in wrapping up online content.

Categories
Miscellaneous

Huge JSON files view and search tool with excellent performance

Dadroit JSON Viewer Logo
The results of scraping activities are most often stored as json data, the latter having many advantages over .xml or .csv formats. Recently in one of my projects, I had to deal with JSON files of over 6Mb. Even though I managed them in Notepad++, still the proper search and count could have been better.

Categories
Miscellaneous

Endcaptcha now solving Recaptcha V2!

endcaptchaSo far the latest developments of the services that develop captchas  (google, nucaptcha, etc.) are no match for the captcha bypassers, and Endcaptcha is living proof of it.
Endcaptcha developers have been working hard to make this new feature possible – they’re finally releasing Recaptcha V2 support!

Categories
Miscellaneous

What are the best online resources to acquire data?

Recently I received this question: What are the best online resources to acquire data from?

The top sites for data scrape are data aggregators. Why are they top in data extraction?
They are top because they provide the fullest, most comprehensive data [sets]. The data in them are highly categorized. Therefore you do not need to crawl and fetch other resources and then combine multiple-resource data.

Those sites fall into 2 categories:

1. Goods and services aggregators. Eg. AliExpress, Amazon, Craiglist.
2. Personal data and companies data aggregators. Eg. Linkedin, Xing, YellowPages. For such aggregators another name is business directories.

The first category of sites and services is quite wide-spread. These sites and services promote their goods with the goal of being well-known online, to have as many backlinks as possible to them.

The second category, the business directories, does not tend to reveal its data to the public. These directories rather promote their brand and give scraping bots minimum opportunity for data acquiring*.

Consider the following picture where a company’s data aggregator gives to the user only 2 input fields: what and where.

what_where_data_aggregator
You can find more of how to scrape data aggregators in this post.

————–
*You have to adhere to the ToS of each particular website/web service when you perform its data scraping.

Categories
Miscellaneous

Phantombuster API list

I’ve categorized the Phantombuster’s scraping APIs for my sake. Yet it might be a good reference point to others too.

Categories
Miscellaneous

Bypass distil network, the anti-scraper protection

safe-key

For details of how to bypass distil-network, the anti-scraper protection, please contact by email: igor [dot] savinkin [at] gmail [dot] com.

Categories
Miscellaneous

Scraping HTML graphic elements: possibilities and limits

Question: “How do I set up a daily automatic scraping of www.pollen.com data into a Google sheet?” (link)

Answer: Originally I doubted if svg HTML elements are scrapable. After some trial and error experience I realized, that svg elements are indeed scrapable; one can get their xPath, children nodes. Yet, they are scrapable by importXML() when being static html.