Web scraping, also known as data mining or web harvesting, is the process of extracting data from websites automatically. The extracted data can be used for various purposes, such as market research, price monitoring, sentiment analysis, and many more. However, web scraping has both advantages and disadvantages. In this article, we will discuss the five main pros and cons of web scraping.
Tag: web scraping
AirTable scrape challenge
The problem is that data are loaded highly dynamically. HTML contains only the information that you currently see on the browser screen.

If there are a lot of records then it is difficult to collect such a table. One of the possible ways is to calculate the size of the screen and rows in the table. Then using the browser automation and use to make a script that will scroll through it bit by bit and collect data.
Is there any other feasible way to get data of a table? For example there is a HTTP requests coding way to get dynamic data.
JS infinite scroll does not work for AirTable either.
Please comment down here if having some tips, hints.
We’ve already stated some Tips and Tricks of scraping business directories or data aggregators sites. Yet recently someone has asked us to do aggregators’ scraping in the context of Google Sheets and/or MS Excel.
Recently we’ve performed the Yelp business directory scrape for acquiring high quality B2B leads (company + CEO info). This forced us to apply many techniques like proxying, external company site scrape, email verification and more.
Octoparse | Dexi.io | Mozenda | Sequentum SaaS | Import.io | |
---|---|---|---|---|---|
Able to set up robot/agent | 3 min | 3 failures in a row | "For some insight, we are working with customers in managed service engagements for large scale, mission critical web integration requirements - so we no longer have a SaaS tool offering. We have a heavy focus in digital commerce and work with customers on use cases in ecomm/retail, travel/hospitality, and tickets/events." - customer service | ||
Support response | 12 hours. It does excellent job. | 12 hours | 12 hours | 12 hours | |
Base64 encoding | no | Using a JavaScript step; btoa() is a function that takes a string and encodes it to Base64. | yes, one can encode the given value in the Transformation Script of any command | ||
Robot/agent development assistance | yes |

Recently we encountered a new powerful scraping service called Web Scraper IDE [of Bright Data]. The life-test and thorough drill-in are coming soon. Yet now we want to highlight its main features that has badly (in positive sense, strongly) impressed us.

Let me tell you what you already know! Octoparse is a great web scraping tool! But like every great tool, it’s got its limitations. At times, you may wonder if there are any alternatives to Octoparse. We wondered the same and put together this blog to provide you a short list of Octoparse alternatives along with their features and distinguishing factors. Let’s get started!
Selenium Web Scraping in simple words
Question: What is Selenium web scraping?
Answer: A picture is better than 1000 words:
So, you make a program with Python, PHP, JAVA, Ruby and whatever language you use in order to browse(), select(), click(), submit(), save(), etc., target web pages.
Linkedin scrape guide lines
The LinkedIn crawl success rate is low; one request that a bot makes might require several retries to be successful. So, here we share the crucial Linkedin scraping guide lines.
- Rate limit
Limit the crawling rate for LinkedIn. The acceptable approximate frequency is: 1 request every second, 60 requests per minute. - Public pages only
LinkedIn allows for bots only public pages; pages that are private cannot be crawled.
- Online marketplaces
In the marketplaces people offer their products for sale. Similar to garage sales, but online. (eg. eCrater, www.1188.no).
Easy to scrape since they are usually free and do not tend to protect their data. - Business directories
The usually huge online directories targeted at the general audience. (eg. Yellow Pages). They do protect their data to avoid duplication and loss of audience. See some posts on this.