Categories
Uncategorized

Crawler vs Scraper vs Parser

In the post we share the differences between Crawler, Scraper and Parser.

Crawler is a web bot that visits a stack of web pages (one might call them nodes) and accumulates the links (urls) of the nodes, deriving new urls from each new web page [html] that it visits. Crawler might or might not get pages’ info in a data storage. It does not go deep (e.g. into detail pages) unless programmed explicitly.

Scraper is a bot that visits web pages of a given set of urls. It does not collect new urls (as a crawler does). It rather visits pre-collected urls and retrieves relevant data to store into a data storage.

Parser is an [offline] robot that processes or analyses given data to make of them proper data structures. It retrieves information from [unstructured] data, whether from data storage or directly from the web (eg. HTML). Consider the following html piece supposedly scraped of a certain web page by url=”https://battery-store.com/Batteries+Plus+Calcium-f4d67gh”:

<form id="form-2345609">
   <div id="item-2345609" >Batteries Plus Calcium 12V 74Ah 680A battery AK-ZP57412</div>
   <label name="price" currency="US" >48.08</label> 
   <label name="price" currency="CA" hidden >53.00</label> 
   <input type="hidden" id="sku-YU23809" name="SKU" > 
   <input type="hidden" id="csrf" value="dca4545878573fe5de89ddffaba5aa051a3b" > 
   <input type="submit" value="Order" name="submit" > 
</from>

Parser may make of it a useful data item:

[{
    "id":2345609,
    "name": "Batteries Plus Calcium 12V 74Ah 680A battery AK-ZP57412",
    "sku":"YU23809",
    "price_us":"48.08",
    "price_ca":"53.00",
    "url":"https://battery-store.com/Batteries+Plus+Calcium-f4d67gh"
}]

Often a scraper includes the parser functionality in itself.

See the examples of simple email crawlers (Python, Java) and a scraping project where the scraper and crawler functionality go side by side. In that project a crawler gathers the [domain] urls and processes them based on whether it is a detail page or a search result.

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.