Categories
Development

Handling HTTP Cookies in cURL

http cookieMost of developers stuck with the cookie handlng in web scraping. Sure it’s a tricky thing and this once has been my stumbling stone too. So here mainly for new scraing engineers i’d like to share of how to handle cookie in web scraping when using PHP. We’ve already done the post on scrape by cURL in PHP, so here we’ll only focus on a cookie side. The cookie is a small piece of data sent from a website and stored in a user’s web browser while the user is browsing that website. So when browser requests a page and along with web content cookie is returned browser does all the dirty job to store cookie and later send them back to server which rendered that web page in following web requests.

Cookie Jar

With server side web scraping a script must handle all the abovementioned processes. The cURL library allows to handle requests with cookie thru using a “cookie jar”. “Cookie jar” is a simple text file stored on scraping server to save and yield cookie in http requests. With such a cookie handler scraping jobs will be done perfectly seamless. After we’ve set the cURL options we set a certain file as “cookie jar” that will contain cookie:

// additionally to store cookie 
$tmpfname = dirname(__FILE__).'/cookie.txt';
curl_setopt($ch, CURLOPT_COOKIEJAR, $tmpfname);
curl_setopt($ch, CURLOPT_COOKIEFILE, $tmpfname);

Now the library will automatically create such a file and handle all the cookie thru it. An example of a cookie file:

# Netscape HTTP Cookie File
# This file was generated by libcurl! Edit at your own risk.
.auto.com	TRUE	/	FALSE	1452087781	___suid	2ecfe4287cbeacd8399eaf98bec9ce0b.59089b9d033bc7c6dce8ea2fca139920
.auto.com	TRUE	/	FALSE	1452865380	all7_user_region_confirmed	1
.auto.com	TRUE	/	FALSE	1452865380	geo_location	a%3A3%3A%7Bs%3A7%3A%22city_id%22%3Ba%3A0%3A%7B%7Ds%3A9%3A%22region_id%22%3Ba%3A1%3A%7Bi%3A0%3Bi%3A89%3B%7Ds%3A10%3A%22country_id%22%3Ba%3A0%3A%7B%7D%7D
.auto.com	TRUE	/	FALSE	1423921380	autoru_sid	ee094d60fa32eada_daf2da69dc79a59b7c8702a29554abbc
.auth.auto.com	TRUE	/	FALSE	1421329026	autoru_sid	
.auth.auto.com	TRUE	/	FALSE	1421329026	autoru_sid_key	
.auto.com	TRUE	/	FALSE	1421329026	cc6882cb6b6f0c912cf9589734fcc1e6	
.auto.com	TRUE	/	FALSE	1452865027	user_name	igor.savinkin5%40gmail.com
.auto.com	TRUE	/	FALSE	1452865027	username	igor.savinkin5%40gmail.com

Code

Now we put the whole function (or in the github) using cURL library handling cookie:

function get_web_page( $url ){
        $options = array( 
	    CURLOPT_RETURNTRANSFER => true, // to return web page
            CURLOPT_HEADER         => true, // to return headers in addition to content
            CURLOPT_FOLLOWLOCATION => true, // to follow redirects
            CURLOPT_ENCODING       => "",   // to handle all encodings
            CURLOPT_AUTOREFERER    => true, // to set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120,  // set a timeout on connect
            CURLOPT_TIMEOUT        => 120,  // set a timeout on response
            CURLOPT_MAXREDIRS      => 10,   // to stop after 10 redirects
            CURLINFO_HEADER_OUT    => true, // no header out
            CURLOPT_SSL_VERIFYPEER => false,// to disable SSL Cert checks
            CURLOPT_HTTP_VERSION   => CURL_HTTP_VERSION_1_1,
        );

        $handle = curl_init( $url );
        curl_setopt_array( $handle, $options );
 
    // additional for storing cookie 
        $tmpfname = dirname(__FILE__).'/cookie.txt';
        curl_setopt($handle, CURLOPT_COOKIEJAR, $tmpfname);
        curl_setopt($handle, CURLOPT_COOKIEFILE, $tmpfname);

        $raw_content = curl_exec( $handle );
        $err = curl_errno( $handle );
        $errmsg = curl_error( $handle );
        $header = curl_getinfo( $handle ); 
        curl_close( $handle );
 
        $header_content = substr($raw_content, 0, $header['header_size']);
        $body_content = trim(str_replace($header_content, '', $raw_content));
    
    // let's extract cookie from raw content for the viewing purpose         
        $cookiepattern = "#Set-Cookie:\\s+(?[^=]+=[^;]+)#m"; 
        preg_match_all($cookiepattern, $header_content, $matches); 
        $cookiesOut = implode("; ", $matches['cookie']);

        $header['errno']   = $err;
        $header['errmsg']  = $errmsg;
        $header['headers']  = $header_content;
        $header['content'] = $body_content;
        $header['cookies'] = $cookiesOut;
    return $header;
}

$header=get_web_page('http://www.example.com');
print_r($header);

Update

For the second and following calls you just execute the same function and it will refer to the same cookie.txt file, passing cookies to a server and saving new values (if any) from the server in the same cookie file. That’s the convenience of the cURL cookie jar.

$header=get_web_page('http://www.example.com');
print_r($header);

That’s it, welcome your comments or questions.

7 replies on “Handling HTTP Cookies in cURL”

It doesn’t create cookies.txt file for me. I’m using PHP. I also tried to put a file with this name on the root path but still it doesn’t write anything in it and the website which I’m crawling does set cookies in browser. Can you help?

Hello can you help me? I need to open a ‘cookie.txt’ file and load the cookies in the visited url.
I’ve tried your code, but in the header it does not load cookies, look at an image.
It needs to load a cookies.txt file in a url.

Zuhaib Khan AUG 04, 2015 @ 01:19 Thanks for the nice description and sample. How can i reuse the cookies on 2nd call? Appreciate your help in advance. Regards, Zuhaib Khansays:

Zuhaib Khan
AUG 04, 2015 @ 01:19
Thanks for the nice description and sample. How can i reuse the cookies on 2nd call?

Appreciate your help in advance.

Regards,
Zuhaib Khan

Great Post
I copied & pasted your sample code to my PHP file
and just changed $url = ‘https://www.nordstromrack.com/shop/Women/Handbags/Backpacks’;

With this URL,
whole source page was not coming up like it does with PC browser.

my cookie file was recorded like below, after executing your code

# Netscape HTTP Cookie File
# https://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

#HttpOnly_.nordstromrack.com TRUE / FALSE 1542543683 bm_sz 1785C6257A98E651C6B7D820EFBF34DD~QAAQnTP+pSRfJhZnAQAAXvnoJQPXgtSXKGSAXSMS8WuNZeZfW6yCmiSgXsxKScPXmKfxresxgLdCGLudpxbEk5skimT3v5egnv1PoAxcPqn00fZU5LmgE4DMNSoGJI3xNHOc4WlhxoOuP0ljz+HHVJOqc4gAUNDmJ5hPSw0QWxk68F5nxbcciw1wG2Gppngl2HSPnOKT
.nordstromrack.com TRUE / FALSE 1574065285 _abck FBD604AE57A80F8F08BE37DDAC680058A5FE339D1D5000000421F15BC67C3776~-1~w8wN4k8sniduCxM2p2lTWY+plNpmVvWfT0+sbS71jFs=~-1~-1

Leave a Reply to Igor Savinkin Cancel reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.