Often we see “invalid data”, “clean data”, “normalize data”. What does it mean as to practical data extraction and how does one deal with that? One shot is better than 1000 words though:
Invalid data
Since we talk of invalid data in the context of web scraping, the invalid data are just invalid HTML scraped, causing data to be misrepresented, malformed and thus be invalid. Consider the testing ground with invalid html. Consider the url inside of the unmatched quotes link anchor code:
<a href="http://scrapetools.com'>scrapetools.com</a>" <span/>
The original url http://scrapetools.com
now becomes the following http://scrapetools.com%27%3Escrapetools.com%3C/a%3E
leading to nowhere…
So, when choosing a scraper tool (software or service), do pay attention to if it handles invalid html issues.
Non-valid, abnormal data
When speaking of data normalization, we mean that data should be consistent, having one norm, denominator.
1. Wrong types
Data scraped to be stored should be of one type (number, string, json). The abnormality might be seen in the shot that we’ve shown above:
The data shown in the shot are well suited as long as they are inside of a CSV file. Yet, if one tries to store them in DB, integrate into a data flow process or something similar, that inconsistency might play a bad joke. It is better to deal with the data, validating them right after you scrape data.
2. Out of range data
Another type of abnormity is when data are out of their expected range. Consider the data row: [3, 2, 2, 5, 2, 6, 4, 1, 6, 1, 1, 1, 1, 5, 5, 3, 2, 2, 3, 6, 6, 2, 6, 4, 4, 2, 2, 4, 4, 1, 5, 1, 3, 6, 2, 6, 4, 3, 3, 5, 6, 6, 5, 6, 3, 122, 6, 5, 4]
The value 122 goes way out of range in the array values of {1-6} and therefore might influence the statistical figures upon data processing.
Application
This kind of data abnormality might happen when you analyze web page visits, namely average time on page. Suppose a page visitor has left the tab with an open page on it for hours. Obviously a simple post like this one does not require hours to read it. This abnormal time on page, say 5 hours, will surely make a negative impact to the overall stat analysis. So, one needs to normalize data by removing weird (out of known threshold) values before storing data or feeding into data flows.
See the same data array without the abnormal value:
[3, 2, 2, 5, 2, 6, 4, 1, 6, 1, 1, 1, 1, 5, 5, 3, 2, 2, 3, 6, 6, 2, 6, 4, 4, 2, 2, 4, 4, 1, 5, 1, 3, 6, 2, 6, 4, 3, 3, 5, 6, 6, 5, 6, 3, 6, 5, 5, 4]