Categories
Development

Scrapy to get dynamic business directory data thru API

In this post I want to share on how one may scrape business directory data, real estate using Scrapy framework.

Categories
Challenge Development

How do I get pass dynamic “load more” btn?

Recently I’ve got a question: How do I get pass the dynamic “load more” button using a Python web scraper?

Categories
Challenge Development

CloudFlare – a limited feature anti-content-duplicate tool

Here we come to the next anti-scrape tool, called CloudFlare, former ScrapeShield. CloudFlare The CloudFlare app has been developed by CloudFlare to guard a site’s content. Its features are limited number, but it’s still an interesting tool to look at for anyone interested in web scraping.

Categories
Review Web Scraping Software

Web Content Extractor Review

Web Content Extractor is a visual user-oriented tool that scrapes typical pages. Its simplicity makes for a quick start up in data ripping.

Categories
Challenge Development

Python, Selenium for custom browser automation scraper

Recently we’ve got the tricky website, its data being of dynamic nature. Yet we’ve applied the modern day scraping tools to fetch data. We’ve develop an effective Python scraper using Selenium library for browser automation. About the project We were asked to have a look at a retailer website. And our task was to gather […]

Categories
Development

Puppeteer async scraper with browsers number to be tuned based on CPU capacity

Recently we’ve got a tricky website of dynamic content to scrape. The data are loaded thru XHRs into each part of the DOM (HTML markup). So, the task was to develop an effective scraper that does async while using reasonable CPU recourses.

Categories
Uncategorized

Pros and Cons of using Selenium WebDriver for Website Scraping

Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. In this post we will consider some advantages and drawbacks of using WebDriver for web scraping.

Categories
Development SaaS

Dexi Pipes: multi-threaded web scraping of site aggregators

Today I want to share my experience with Dexi Pipes. Pipes is a new kind of robot introduced by Dexi.io to integrate web data extraction and web data processing into a single seamless workflow. The main focus of the testing is to show how Dexi might leverage multi-threaded jobs for extraction of data from a […]

Categories
Web Scraping Software

Dexi.io – how to improve performance

Intro Some may argue that extracting 3 records per minute is not fast enough for an automated scraper (see my last post on Dexi multi-threaded jobs). However, you should realize that Dexi extractor robots behave like a full-blown modern browser and fetch all the resources that crawled pages load (CSS, JS, fonts, etc.). In terms […]

Categories
Development

HTTP vs HTTPS

In this post we will deal with the most vital facts and the pros and cons concerning the HTTPS vs HTTP issue. Besides the security advantage, we will consider the main things that make a difference: caching, performance issue, virtual hosting issue and others.