Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.
Recently I was asked to look at a brand- new online regex tester, regviz.org, developed as a collaboration of VISUS, University of Stuttgart and University of Trier. Though there are a lot of regex online testers on the market today, and many of them are quite good, let’s look at what is special about regviz.org and what it lacks.
Debuggex is an online Regex testing tool that allows visualization of Regex match algorithms. The visualization feature is good both for the learners who do some Regex exploration and for the experienced users who might want to track the Regex match forward or back. It is also useful for an instant Regex pattern match by highlighting, thus eliminating the need for pressing any buttons to run Regex patterns. This tool is one of a dozen online Regex testers.
In this post, I’ll explain how to do a simple web page extraction in PHP using cURL, the ‘Client URL library’.
The curl is a part of libcurl, a library that allows you to connect to servers with many different types of protocols. It supports the http, https and other protocols. This way of getting data from web is more stable with header/cookie/errors process rather than using simple file_get_contents(). If curl() is not installed, you can read here for Win or here for Linux.
We’d like to introduce the new SCRAPER TEST DRIVE stage, called ‘Text list‘. This seemingly simple test case hides within itself a non-ordinary structure. This time the HTML DOM structure is so plain, making you scratch your head, wondering how to approach to it. Yet, those off-the-shelf products have shown their best features extracting even a smallest thing from seemingly plain content.