Categories
Development

Scraping in PHP with cURL

In this post, I’ll explain how to do a simple web page extraction in PHP using cURL, the ‘Client URL library’.

The curl  is a part of libcurl, a library that allows you to connect to servers with many different types of protocols. It supports the http, https and other protocols. This way of getting data from web is more stable with header/cookie/errors process rather than using simple file_get_contents(). If curl() is not installed, you can read here for Win or here for Linux.

Setting Up cURL

First, we need to initiate the cURL handle:

$curl = curl_init("http://testing-ground.scraping.pro/
             textlist");

Then, set CURLOPT_RETURNTRANSFER to TRUE to return the transfer page as a string rather than put it out directly:

curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);

Executing the Request & Checking for Errors

Now, start the request and perform an error check:

$page = curl_exec($curl);

if(curl_errno($curl)) // check for execution errors
{
	echo 'Scraper error: ' . curl_error($curl);
	exit;
}

Closing the Connection

To close the connection, type the following:

curl_close($curl);

Extracting Only the Needed Part and Printing It

After we have the page content, we may extract only the needed code snippet, under id=”case_textlist”:

$regex = '<div id="case_textlist">(.*?)<\/div>/s';
if ( preg_match($regex, $page, $list) )
    echo $list[0];
else
    echo "Not found";

The Whole Scraper Listing

<?php
$curl = curl_init('http://testing-ground.scraping.pro/textlist');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
 
$page = curl_exec($curl);
 
if(curl_errno($curl)) // check for execution errors
{
echo 'Scraper error: ' . curl_error($curl);
exit;
}
 
curl_close($curl);

$regex = '/<div id="case_textlist">(.*?)<\/div>/s';
if ( preg_match($regex, $page, $list) )
    echo $list[0];
else
    print "Not found";
?>

This sample will guide you and give you further practice in daily web scraping.

24 replies on “Scraping in PHP with cURL”

Great article. I found it very helpful. I am having an issue pulling information from a website and am convinced it is a problem with the regex, but not sure where the issue is. What is a good venue to seek assistance?

Thanks.

I am writing this regular expression,but i am also getting the div tags in the output. I only want to get the text between the tags.
i.e i am getting ALLERGY SCREEN.
what to do if i only want Allergy screen from it.

Greetings,

Am trying to get cURL to reprint the page to the new clicked target page for example if I visit localhost.com it will print the home page and if I click localhost.com/test (or any other links) I want it to print that page instead of the home page. But the url change to the link I clicked but not the index page. I posted the same question here I hope someone will help me with its driving me mad (https://stackoverflow.com/questions/47962419/curl-php-always-returns-to-homepage)

/* include your libcurl files at the top of your php file */
include('./libcurl/LIB_parse.php');
include('./libcurl/LIB_http.php');

/* set the page you want to scrape */
$web_page = http_get($target="https://page-to-scrape.com/", $ref="www.google.com");

/* set the parse variables for the data you want to get/scrape */
$parse_item = parse_array($web_page['FILE'], 'left side of item i want to scrape', 'right side of item i want to scrape');

/* if you're having trouble trying to filter the data you have parsed you can make it a bit easier on yourself a lot of times by using strip_tags();. 

strip_tags() will remove any HTML tags, and data within the tags (ie: title="" or alt="") but it leaves the visible data alone.  like data here the bold tags will be removed but the "data here" will make it through with your $parse_item var. */

$var = 10;  /* this var is for how many times you want to run that particular parse.  
OR
if you plan on scraping every item of that type on the page and you don't want to count them you can do something like... (below) */

$count = count($parse_item); /* count the array's parsed/scraped and it will automatically scrape all the items you're trying to get by changing the $var to $count in the for() loop below. */

for ($x = 0; $x < $var; $x++) {
     $parse_var = strip_tags($parse_item[$x]);
     echo $parse_var;
}

I just wrote this but if you copy and paste it in a php file it should work. Just edit the $target at the top and the parse_array left and right side values i pointed out.

include('./libcurl/LIB_parse.php'); 
include('./libcurl/LIB_http.php'); 

$web_page = http_get($target='https://page-to-scrape.com/', $ref='www.google.com'); 

$parse_item = parse_array($web_page[‘FILE’], 'left side of item i want to scrape', 'right side of item i want to scrape'); 

$var = 10;
$count = count($parse_item);

for ($x = 0; $x < $var; $x++) { 
    $parse_var = strip_tags($parse_item[$x]); 
    echo $parse_var; 
} 

Same as my post above without my tips/comments. i didn’t realize this comment box doesn’t format properly.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.