Categories
Development

Scraping in PHP with cURL

In this post, I’ll explain how to do a simple web page extraction in PHP using cURL, the ‘Client URL library’.

The curl  is a part of libcurl, a library that allows you to connect to servers with many different types of protocols. It supports the http, https and other protocols. This way of getting data from web is more stable with header/cookie/errors process rather than using simple file_get_contents(). If curl() is not installed, you can read here for Win or here for Linux.

Setting Up cURL

First, we need to initiate the cURL handle:

$curl = curl_init("http://testing-ground.scraping.pro/
             textlist");

Then, set CURLOPT_RETURNTRANSFER to TRUE to return the transfer page as a string rather than put it out directly:

curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);

Executing the Request & Checking for Errors

Now, start the request and perform an error check:

$page = curl_exec($curl);

if(curl_errno($curl)) // check for execution errors
{
	echo 'Scraper error: ' . curl_error($curl);
	exit;
}

Closing the Connection

To close the connection, type the following:

curl_close($curl);

Extracting Only the Needed Part and Printing It

After we have the page content, we may extract only the needed code snippet, under id=”case_textlist”:

$regex = '<div id="case_textlist">(.*?)<\/div>/s';
if ( preg_match($regex, $page, $list) )
    echo $list[0];
else
    echo "Not found";

The Whole Scraper Listing

<?php
$curl = curl_init('http://testing-ground.scraping.pro/textlist');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
 
$page = curl_exec($curl);
 
if(curl_errno($curl)) // check for execution errors
{
echo 'Scraper error: ' . curl_error($curl);
exit;
}
 
curl_close($curl);

$regex = '/<div id="case_textlist">(.*?)<\/div>/s';
if ( preg_match($regex, $page, $list) )
    echo $list[0];
else
    print "Not found";
?>

This sample will guide you and give you further practice in daily web scraping.

22 replies on “Scraping in PHP with cURL”

Great article. I found it very helpful. I am having an issue pulling information from a website and am convinced it is a problem with the regex, but not sure where the issue is. What is a good venue to seek assistance?

Thanks.

I am writing this regular expression,but i am also getting the div tags in the output. I only want to get the text between the tags.
i.e i am getting ALLERGY SCREEN.
what to do if i only want Allergy screen from it.

Greetings,

Am trying to get cURL to reprint the page to the new clicked target page for example if I visit localhost.com it will print the home page and if I click localhost.com/test (or any other links) I want it to print that page instead of the home page. But the url change to the link I clicked but not the index page. I posted the same question here I hope someone will help me with its driving me mad (https://stackoverflow.com/questions/47962419/curl-php-always-returns-to-homepage)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.