Categories
Miscellaneous

What information do internet services collect about users?

we use your personal data so we can provide the best service, tell you about products and services you may be interested in…

These or similar statements are often “tiny printed” at most of modern sites as part of Terms of Service (ToS). Below we share with you what particular data are collected from web users or app users.

Categories
Development Miscellaneous

Extracting sequential HTML elements with XPath and Regex

Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

Categories
Development Miscellaneous

Cookie browser-server workflow

Categories
Guest posting Miscellaneous

Data extraction: web crawling vs. web scraping in E-commerce

Nowadays, when one has some questions, it comes almost naturally for us to just type it in a search bar and get helpful answers. But we rarely wonder how all that information is available and how it appears as soon as we start typing. Search engines provide easy access to information, but web crawling and scraping tools, which are not such well-known players, have a crucial role in wrapping up online content.

Categories
Miscellaneous

Huge JSON files view and search tool with excellent performance

Dadroit JSON Viewer Logo
The results of scraping activities are most often stored as json data, the latter having many advantages over .xml or .csv formats. Recently in one of my projects, I had to deal with JSON files of over 6Mb. Even though I managed them in Notepad++, still the proper search and count could have been better.

Categories
Miscellaneous

Endcaptcha now solving Recaptcha V2!

endcaptchaSo far the latest developments of the services that develop captchas  (google, nucaptcha, etc.) are no match for the captcha bypassers, and Endcaptcha is living proof of it.
Endcaptcha developers have been working hard to make this new feature possible – they’re finally releasing Recaptcha V2 support!

Categories
Miscellaneous SEO and Growth Hacking

Linkedin lost in court to data analytic company that scrapes Linkedin’s public profiles info

On September 9th, 2019 the UNITED STATES COURT OF APPEALS 1 has affirmed the former district court’s determination that a certain [data] analytic company is lawful to scrape [perform automated gathering] LinkedIn’s public profiles info. Now the historical event has happened in which a court is protecting a data extractor’s right for mass gathering openly presented business directory information.

Categories
Miscellaneous

What are the best online resources to acquire data?

Recently I received this question: What are the best online resources to acquire data from?

The top sites for data scrape are data aggregators. Why are they top in data extraction?
They are top because they provide the fullest, most comprehensive data [sets]. The data in them are highly categorized. Therefore you do not need to crawl and fetch other resources and then combine multiple-resource data.

Those sites fall into 2 categories:

1. Goods and services aggregators. Eg. AliExpress, Amazon, Craiglist.
2. Personal data and companies data aggregators. Eg. Linkedin, Xing, YellowPages. For such aggregators another name is business directories.

The first category of sites and services is quite wide-spread. These sites and services promote their goods with the goal of being well-known online, to have as many backlinks as possible to them.

The second category, the business directories, does not tend to reveal its data to the public. These directories rather promote their brand and give scraping bots minimum opportunity for data acquiring*.

Consider the following picture where a company’s data aggregator gives to the user only 2 input fields: what and where.

what_where_data_aggregator
You can find more of how to scrape data aggregators in this post.

————–
*You have to adhere to the ToS of each particular website/web service when you perform its data scraping.

Categories
Miscellaneous

Meet Phantombuster – awesome tool for creating own APIs and extend audience via social networks

As you know, huge social networks are very useful instruments to improve business, especially IT-business. Developers, designers, CEO, HR- and Product-managers share some useful information, looking for useful acquaintances, business partners and co-workers. But how does one automatize the process of searching and attracting new people to your resource? With Phantombuster it’s not a problem at all. In our today’s article we will consider how to use the Phantombuster APIs in different areas.

Categories
Miscellaneous

Phantombuster API list

I’ve categorized the Phantombuster’s scraping APIs for my sake. Yet it might be a good reference point to others too.