Scraping in PHP with cURL

Post author By admin
Post date 24.11.2012
19 Comments on Scraping in PHP with cURL

In this post, I’ll explain how to do a simple web page extraction in PHP using cURL, the ‘Client URL library’.

The curl is a part of libcurl, a library that allows you to connect to servers with many different types of protocols. It supports the http, https and other protocols. This way of getting data from web is more stable with header/cookie/errors process rather than using simple file_get_contents(). If curl() is not installed, you can read here for Win or here for Linux.

Setting Up cURL

First, we need to initiate the cURL handle:

$curl = curl_init("http://testing-ground.scraping.pro/
             textlist");

Then, set CURLOPT_RETURNTRANSFER to TRUE to return the transfer page as a string rather than put it out directly:

curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);

Executing the Request & Checking for Errors

Now, start the request and perform an error check:

$page = curl_exec($curl);

if(curl_errno($curl)) // check for execution errors
{
	echo 'Scraper error: ' . curl_error($curl);
	exit;
}

Closing the Connection

To close the connection, type the following:

curl_close($curl);

Extracting Only the Needed Part and Printing It

After we have the page content, we may extract only the needed code snippet, under id=”case_textlist”:

$regex = '&lt;div id="case_textlist"&gt;(.*?)&lt;\/div&gt;/s';
if ( preg_match($regex, $page, $list) )
    echo $list[0];
else
    echo "Not found";

The Whole Scraper Listing

<?php
$curl = curl_init('http://testing-ground.scraping.pro/textlist');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
 
$page = curl_exec($curl);
 
if(curl_errno($curl)) // check for execution errors
{
echo 'Scraper error: ' . curl_error($curl);
exit;
}
 
curl_close($curl);

$regex = '/<div id="case_textlist">(.*?)<\/div>/s';
if ( preg_match($regex, $page, $list) )
    echo $list[0];
else
    print "Not found";
?>

This sample will guide you and give you further practice in daily web scraping.

Tags Curl, PHP, Regex

19 replies on “Scraping in PHP with cURL”

Hello, what if I want to scrape from multiple urls? Sorry for the question but I am a newbie to this. Thanks in advance.

Multiple urls is not a hard case. You just make this code a procedure with url as an input parameter and then you loop it over the urls’ list.

nice article but i want to use loop how it work

Very Clear and Neat Explanation of curl. 🙂 Thank you Very Much.
if You have time please visit this website give review. http://www.manaresults.com

what will be regex for <span class=”*******”> inside a ;
Note: contains multiple .

“contains multiple” what? a tags or span tags? and what inside what? Better you give an example.

I am writing this regular expression,but i am also getting the div tags in the output. I only want to get the text between the tags.
i.e i am getting ALLERGY SCREEN.
what to do if i only want Allergy screen from it.

Raghav Sikri, I’m not clear what do yo mean. Can you explain more? Any [screen]shots?

I want to parse div >ul > li > div > strong > a (LINK and LINK 1 from below code) data from website. But I can’t get success. can you suggest regex to get value of LINK and LINK 1

HTML is:

icon

Link 1

Regex is:

/\s*([^<]+?)<\/a>/g

See this link where regex works.

Thank u very much for this code.
This code is very effective.

I want to show some specific data from table and match my database table using where condition.
Website: http://www.cse.com.bd/current_share_price_ltp.php
Class: RightColInner

Can you suggest me how can I get this.
Thanks

thanks alot 😉

Really helpful script! Thanks for share that.

Thanks for this simple tutorial.

Greetings,

Am trying to get cURL to reprint the page to the new clicked target page for example if I visit localhost.com it will print the home page and if I click localhost.com/test (or any other links) I want it to print that page instead of the home page. But the url change to the link I clicked but not the index page. I posted the same question here I hope someone will help me with its driving me mad (https://stackoverflow.com/questions/47962419/curl-php-always-returns-to-homepage)

How can i get the links without domain name? I’m getting domain.com/link.html, but i just want /link.html.

Just cut off a domain name from each harvested link.

/* include your libcurl files at the top of your php file */
include('./libcurl/LIB_parse.php');
include('./libcurl/LIB_http.php');

/* set the page you want to scrape */
$web_page = http_get($target="https://page-to-scrape.com/", $ref="www.google.com");

/* set the parse variables for the data you want to get/scrape */
$parse_item = parse_array($web_page['FILE'], 'left side of item i want to scrape', 'right side of item i want to scrape');

/* if you're having trouble trying to filter the data you have parsed you can make it a bit easier on yourself a lot of times by using strip_tags();. 

strip_tags() will remove any HTML tags, and data within the tags (ie: title="" or alt="") but it leaves the visible data alone.  like data here the bold tags will be removed but the "data here" will make it through with your $parse_item var. */

$var = 10;  /* this var is for how many times you want to run that particular parse.  
OR
if you plan on scraping every item of that type on the page and you don't want to count them you can do something like... (below) */

$count = count($parse_item); /* count the array's parsed/scraped and it will automatically scrape all the items you're trying to get by changing the $var to $count in the for() loop below. */

for ($x = 0; $x < $var; $x++) {
     $parse_var = strip_tags($parse_item[$x]);
     echo $parse_var;
}

I just wrote this but if you copy and paste it in a php file it should work. Just edit the $target at the top and the parse_array left and right side values i pointed out.

include('./libcurl/LIB_parse.php'); 
include('./libcurl/LIB_http.php'); 

$web_page = http_get($target='https://page-to-scrape.com/', $ref='www.google.com'); 

$parse_item = parse_array($web_page[‘FILE’], 'left side of item i want to scrape', 'right side of item i want to scrape'); 

$var = 10;
$count = count($parse_item);

for ($x = 0; $x < $var; $x++) { 
    $parse_var = strip_tags($parse_item[$x]); 
    echo $parse_var; 
}

Same as my post above without my tips/comments. i didn’t realize this comment box doesn’t format properly.

Свежие записи

Свежие комментарии

Архивы

Рубрики