Most of developers stuck with the cookie handlng in web scraping. Sure it’s a tricky thing and this once has been my stumbling stone too. So here mainly for new scraing engineers i’d like to share of how to handle cookie in web scraping when using PHP. We’ve already done the post on scrape by cURL in PHP, so here we’ll only focus on a cookie side. The cookie is a small piece of data sent from a website and stored in a user’s web browser while the user is browsing that website. So when browser requests a page and along with web content cookie is returned browser does all the dirty job to store cookie and later send them back to server which rendered that web page in following web requests.
Cookie Jar
With server side web scraping a script must handle all the abovementioned processes. The cURL library allows to handle requests with cookie thru using a “cookie jar”. “Cookie jar” is a simple text file stored on scraping server to save and yield cookie in http requests. With such a cookie handler scraping jobs will be done perfectly seamless. After we’ve set the cURL options we set a certain file as “cookie jar” that will contain cookie:
// additionally to store cookie
$tmpfname = dirname(__FILE__).'/cookie.txt';
curl_setopt($ch, CURLOPT_COOKIEJAR, $tmpfname);
curl_setopt($ch, CURLOPT_COOKIEFILE, $tmpfname);
Now the library will automatically create such a file and handle all the cookie thru it. An example of a cookie file:
# Netscape HTTP Cookie File
# This file was generated by libcurl! Edit at your own risk.
.auto.com TRUE / FALSE 1452087781 ___suid 2ecfe4287cbeacd8399eaf98bec9ce0b.59089b9d033bc7c6dce8ea2fca139920
.auto.com TRUE / FALSE 1452865380 all7_user_region_confirmed 1
.auto.com TRUE / FALSE 1452865380 geo_location a%3A3%3A%7Bs%3A7%3A%22city_id%22%3Ba%3A0%3A%7B%7Ds%3A9%3A%22region_id%22%3Ba%3A1%3A%7Bi%3A0%3Bi%3A89%3B%7Ds%3A10%3A%22country_id%22%3Ba%3A0%3A%7B%7D%7D
.auto.com TRUE / FALSE 1423921380 autoru_sid ee094d60fa32eada_daf2da69dc79a59b7c8702a29554abbc
.auth.auto.com TRUE / FALSE 1421329026 autoru_sid
.auth.auto.com TRUE / FALSE 1421329026 autoru_sid_key
.auto.com TRUE / FALSE 1421329026 cc6882cb6b6f0c912cf9589734fcc1e6
.auto.com TRUE / FALSE 1452865027 user_name igor.savinkin5%40gmail.com
.auto.com TRUE / FALSE 1452865027 username igor.savinkin5%40gmail.com
Code
Now we put the whole function (or in the github) using cURL library handling cookie:
function get_web_page( $url ){
$options = array(
CURLOPT_RETURNTRANSFER => true, // to return web page
CURLOPT_HEADER => true, // to return headers in addition to content
CURLOPT_FOLLOWLOCATION => true, // to follow redirects
CURLOPT_ENCODING => "", // to handle all encodings
CURLOPT_AUTOREFERER => true, // to set referer on redirect
CURLOPT_CONNECTTIMEOUT => 120, // set a timeout on connect
CURLOPT_TIMEOUT => 120, // set a timeout on response
CURLOPT_MAXREDIRS => 10, // to stop after 10 redirects
CURLINFO_HEADER_OUT => true, // no header out
CURLOPT_SSL_VERIFYPEER => false,// to disable SSL Cert checks
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
);
$handle = curl_init( $url );
curl_setopt_array( $handle, $options );
// additional for storing cookie
$tmpfname = dirname(__FILE__).'/cookie.txt';
curl_setopt($handle, CURLOPT_COOKIEJAR, $tmpfname);
curl_setopt($handle, CURLOPT_COOKIEFILE, $tmpfname);
$raw_content = curl_exec( $handle );
$err = curl_errno( $handle );
$errmsg = curl_error( $handle );
$header = curl_getinfo( $handle );
curl_close( $handle );
$header_content = substr($raw_content, 0, $header['header_size']);
$body_content = trim(str_replace($header_content, '', $raw_content));
// let's extract cookie from raw content for the viewing purpose
$cookiepattern = "#Set-Cookie:\\s+(?[^=]+=[^;]+)#m";
preg_match_all($cookiepattern, $header_content, $matches);
$cookiesOut = implode("; ", $matches['cookie']);
$header['errno'] = $err;
$header['errmsg'] = $errmsg;
$header['headers'] = $header_content;
$header['content'] = $body_content;
$header['cookies'] = $cookiesOut;
return $header;
}
$header=get_web_page('http://www.example.com');
print_r($header);
Update
For the second and following calls you just execute the same function and it will refer to the same cookie.txt file, passing cookies to a server and saving new values (if any) from the server in the same cookie file. That’s the convenience of the cURL cookie jar.
$header=get_web_page('http://www.example.com');
print_r($header);
That’s it, welcome your comments or questions.
7 replies on “Handling HTTP Cookies in cURL”
Thanks for the nice description and sample. How can i reuse the cookies on 2nd call?
Appreciate your help in advance.
Regards,
Zuhaib Khan
Zuhaib, see my update.
It doesn’t create cookies.txt file for me. I’m using PHP. I also tried to put a file with this name on the root path but still it doesn’t write anything in it and the website which I’m crawling does set cookies in browser. Can you help?
I can help. Just expose the code that you have. Also please check the folder’s rights for writing in it.
Hello can you help me? I need to open a ‘cookie.txt’ file and load the cookies in the visited url.
I’ve tried your code, but in the header it does not load cookies, look at an image.
It needs to load a cookies.txt file in a url.
Zuhaib Khan
AUG 04, 2015 @ 01:19
Thanks for the nice description and sample. How can i reuse the cookies on 2nd call?
Appreciate your help in advance.
Regards,
Zuhaib Khan
Great Post
I copied & pasted your sample code to my PHP file
and just changed $url = ‘https://www.nordstromrack.com/shop/Women/Handbags/Backpacks’;
With this URL,
whole source page was not coming up like it does with PC browser.
my cookie file was recorded like below, after executing your code
# Netscape HTTP Cookie File
# https://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.
#HttpOnly_.nordstromrack.com TRUE / FALSE 1542543683 bm_sz 1785C6257A98E651C6B7D820EFBF34DD~QAAQnTP+pSRfJhZnAQAAXvnoJQPXgtSXKGSAXSMS8WuNZeZfW6yCmiSgXsxKScPXmKfxresxgLdCGLudpxbEk5skimT3v5egnv1PoAxcPqn00fZU5LmgE4DMNSoGJI3xNHOc4WlhxoOuP0ljz+HHVJOqc4gAUNDmJ5hPSw0QWxk68F5nxbcciw1wG2Gppngl2HSPnOKT
.nordstromrack.com TRUE / FALSE 1574065285 _abck FBD604AE57A80F8F08BE37DDAC680058A5FE339D1D5000000421F15BC67C3776~-1~w8wN4k8sniduCxM2p2lTWY+plNpmVvWfT0+sbS71jFs=~-1~-1