Here are some basic principles I follow when I do web scraping. All these principles came from my personal experience and I hope they may help others to avoid many mistakes and difficulties.
1. Load data piecemeal
Typically, we consider the web a somewhat unreliable environment in which the connection may be lost (or become very slow) at any time. Due to that, it’s a good idea to break all the data to be loaded into separate pieces to be loaded separately (of course, they may be loaded simultaneously). If some pieces of data can’t be received because of the website or connection being overloaded, then the other pieces are not influenced and you can safely save them for further processing. Using this plan is especially useful when you load a huge amount of data, leaving your computer working for several hours (or even days). Needless to say, it’s also a good idea to save every portion of data to disk as soon as you receive it and make several attempts to load a specific piece of data in case of failure.
2. Take maximum
When you scrape a new website (especially with variable structure), it’s a good practice to take all the data (web pages) and save it as it is to disk, opposed to trying to process it as it comes from the server. This may save you much time and save you from being banned.
For example, you need to scrape 1000 pages with some structured information, parse it and save in your database. If you do it by scraping a page, processing it and going to the next one, you may find the 900th page has a structure difference that breaks your scraping algorithm and you need to adjust it and start scraping again. It may cost you a significant amount of time.
3. Parse pessimistically
As you parse the scraped data, my advice is to be quite “pessimistic”, “narrow” and “strict”. For example, if you expect a value you get from the scraped page to be a number, then check that it’s really a number. This may seem to be too narrow, but it may save you from a surprise that instead of expected page structure, you get a different though very similar one.
I would even advise you to do a semantic check of all text values if it’s possible (for example, you may check country names and US states’ abbreviations). This may greatly increase the quality of output data, especially if you scrape a huge amount of data it’s hard to check manually. And the problem is not merely in typos. The fact that the data you get is not what you expected may indicate you got a page of different format and your parsing algorithm needs to be adjusted.
4. Gather statistics
When you scrape megabytes and gigabytes of information, statistics will help you. I always suggest setting some metrics and evaluating the quality of the output by them. For example, if you scrape user information, you may check how many males and females you got. If the ratio seems to be strange – check your algorithm. Also, if among a thousand values you got only one that is not empty, then it’s a good idea to check your parser. Probably the other values are specified in another place on a different page.