In this post, I’ll explain how to do a simple web page extraction in PHP using cURL, the ‘Client URL library’.
The curl is a part of libcurl, a library that allows you to connect to servers with many different types of protocols. It supports the http, https and other protocols. This way of getting data from web is more stable with header/cookie/errors process rather than using simple file_get_contents(). If curl() is not installed, you can read here for Win or here for Linux.
Setting Up cURL
First, we need to initiate the cURL handle:
$curl = curl_init("http://testing-ground.scraping.pro/
textlist");
Then, set CURLOPT_RETURNTRANSFER to TRUE to return the transfer page as a string rather than put it out directly:
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
Executing the Request & Checking for Errors
Now, start the request and perform an error check:
$page = curl_exec($curl);
if(curl_errno($curl)) // check for execution errors
{
echo 'Scraper error: ' . curl_error($curl);
exit;
}
Closing the Connection
To close the connection, type the following:
curl_close($curl);
Extracting Only the Needed Part and Printing It
After we have the page content, we may extract only the needed code snippet, under id=”case_textlist”:
$regex = '<div id="case_textlist">(.*?)<\/div>/s';
if ( preg_match($regex, $page, $list) )
echo $list[0];
else
echo "Not found";
The Whole Scraper Listing
<?php
$curl = curl_init('http://testing-ground.scraping.pro/textlist');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$page = curl_exec($curl);
if(curl_errno($curl)) // check for execution errors
{
echo 'Scraper error: ' . curl_error($curl);
exit;
}
curl_close($curl);
$regex = '/<div id="case_textlist">(.*?)<\/div>/s';
if ( preg_match($regex, $page, $list) )
echo $list[0];
else
print "Not found";
?>
This sample will guide you and give you further practice in daily web scraping.
19 replies on “Scraping in PHP with cURL”
Hello, what if I want to scrape from multiple urls? Sorry for the question but I am a newbie to this. Thanks in advance.
Multiple urls is not a hard case. You just make this code a procedure with url as an input parameter and then you loop it over the urls’ list.
nice article but i want to use loop how it work
Very Clear and Neat Explanation of curl. 🙂 Thank you Very Much.
if You have time please visit this website give review. http://www.manaresults.com
what will be regex for <span class=”*******”> inside a ;
Note: contains multiple .
“contains multiple” what? a tags or span tags? and what inside what? Better you give an example.
I am writing this regular expression,but i am also getting the div tags in the output. I only want to get the text between the tags.
i.e i am getting ALLERGY SCREEN.
what to do if i only want Allergy screen from it.
Raghav Sikri, I’m not clear what do yo mean. Can you explain more? Any [screen]shots?
I want to parse div >ul > li > div > strong > a (LINK and LINK 1 from below code) data from website. But I can’t get success. can you suggest regex to get value of LINK and LINK 1
HTML is:
icon
Link 1
icon
Link 1
icon
Link 1
Regex is:
See this link where regex works.
Thank u very much for this code.
This code is very effective.
I want to show some specific data from table and match my database table using where condition.
Website: http://www.cse.com.bd/current_share_price_ltp.php
Class: RightColInner
Can you suggest me how can I get this.
Thanks
thanks alot 😉
Really helpful script! Thanks for share that.
Thanks for this simple tutorial.
Greetings,
Am trying to get cURL to reprint the page to the new clicked target page for example if I visit localhost.com it will print the home page and if I click localhost.com/test (or any other links) I want it to print that page instead of the home page. But the url change to the link I clicked but not the index page. I posted the same question here I hope someone will help me with its driving me mad (https://stackoverflow.com/questions/47962419/curl-php-always-returns-to-homepage)
How can i get the links without domain name? I’m getting domain.com/link.html, but i just want /link.html.
Just cut off a domain name from each harvested link.
I just wrote this but if you copy and paste it in a php file it should work. Just edit the $target at the top and the parse_array left and right side values i pointed out.
Same as my post above without my tips/comments. i didn’t realize this comment box doesn’t format properly.