Web Page Change Tracking

Often, you want to detect changes on some eBay offerings or get notified of the latest items of interest from craigslist in your area. Or, you want to monitor updates on a website (your competitor’s, for example) where no RSS feed is available. How would you do it, by visiting it over and over again? No, now there are handy tools for website change monitoring. We’ve evaluated some tools and would like to recommend the most useful ones that will make your monitoring job easy. Those tools nicely complement the web scraping software, service and plugins.

About XPath

XPath is a formal language that is used to navigate through and query elements and attributes in XML documents. While this notation is being used in XSL and XQuery, it is very useful for DOM data access and extraction. XML documents and also HTML/XHTML documents are objects of DOM parsing while using XPath.

Python: submit authenticated form using cookie and session

Recently I was challenged to do bulk submits thru an authenticated form, the website requiring a login. While there are plenty of examples of how to use POST and GET in Python, I want to share with you how I’ve done the handling of session along with a cookie and authenticity token (CSRF-like protection).

In the post we are going to cover the crucial techniques needed in the scripting web scraping:

  • persistent session usage
  • cookie finding and storing [in session]
  • “auth token” finding, retrieving and submitting in a form

UiPath – Robotic Process Automation Software

UiPath is Enterprise Robotic Process Automation (RPA) Software designed empower companies to automate repetitive, manual, rules-based business processes.Any repetitive task a user performs on his computer, including data entry, legacy application integration, data or content migration, screen scraping and testing can be automated with UiPath.

Pros and Cons of using Selenium WebDriver for Website Scraping

Since Selenium WebDriver is created for browser automation, it can be easily used for scraping data from the web. In this post we will consider some advantages and drawbacks of using WebDriver for web scraping.

What is Web Scraping?

Web scraping (a.k.a. web data mining, web data processing, web data extraction, web content extraction, web harvesting, web screen scraping, web crawling, web ripping, web content extraction, etc.) is a process of extracting useful information from the web.

Software for Web Scraping

There are many web data extraction applications and some cloud services available and they vary widely in cost and features. Here weíve summarized them to help you to make your choice. All of these programs and services have been either tested by us or have been in general use for web ripping. We hope these brief overviews and the following reviews will help you to choose a best web scraper for your purposes.

Bypass Distil

The Distil scrape protection is a prominent one in the modern anti-scrape techniques. So, now we want to share with you some tips of how to bypass it. If you are interested, please make an inquiry to the following email: igor[dot]savinkin[at]gmail[dot]com


How to scrape Yellow Pages with ScreenScraper Chrome Extension

Recently I was asked to help with the job of scraping company information from the Yellow Pages website using the ScreenScraper Chrome Extension. After working with this simple scraper, I decided to create a tutorial on how to use this Google Chrome Extension for scraping pages similar to this one. Hopefully, it will be useful to many of you.

Scraping JavaScript protected content

Here we come to one new milestone: the JavaScript-driven or JS-rendered websites scrape.

Recently a friend of mine got stumped as he was trying to get content of a website using PHP simplehtmldom library. He was failing to do it and finally found out the site was being saturated with JavaScript code. The anti-scrape JavaScript insertions do a tricky check to see if the page is requested and processed by a real browser and only if that is true, will it render the rest of page’s HTML code.