Categories
Data Mining

Data Mining with Google Refine

Google Refine is a free tool for data processing, it standing in line with some other free Google data analysis tools. Because of its close association with web scraping, we want to shed some light on it. 

“Google Refine is a desktop application in that you download it, install it, and run it on your own computer. However, unlike most other desktop applications, it runs as a small web server on your own computer and you point your web browser at that web server in order to use Refine. So, think of Refine as a personal and private web application.”

What does it do?

Google Refine (new brand is OpenRefine) features these 3 major data transform areas:

  1. Data refine by joining the records of the similar values in a field into one group. It’s very good in cleaning up inconsistent formats and transforming them from one format into another. Those techniques are requisite when we perform multiple source web scraping.  (video 1)

  1. Data filtering. Filters and facets allow you to isolate rows to act upon and invoke a command to all these rows, changing them all in one shot. It does take effort to master the commands and the GREL (Google Refine Expression Language) to easily operate it for data transforming. This is an example of a GREL command: not (value.startsWith (‘**’)). Watch video 2.
  2. Data augmentation with FreeBase (video 3). My extracted raw/unclear data could be enriched with sources from FreeBase open data base. Standardization, reconciliation and data normalization are issues Google Refine takes care of. Simply stated, it takes you from ambiguous text name to precisely identified entities (taken from FreeBase). For example, ‘Apple’ might be -> (1) Apple, Inc., (2) a fruit or (3) New York City (‘Big Apple’). For every word there I might stick a ‘details tag’ from Google affiliated FreeBase, the free successor of Metaweb.

For developers, the standard reconciliation service API is the HTTP-based RESTful JSON-formatted API.

Data Processing Tools

Google Refine owns some common data processing tools:

  • programmatic column splitting
  • transpose cells in columns into rows and vice versa
  • collapsing of internal whitespaces
  • regular expression transformations
  • faceting (finding all unique values in a column)

Also, more sophisticated tools:

And, finally, reconciliation to FreeBase types (read the details above).

Compared to Excel data transforming, Google Refine allows you to save and re-apply the undo/redo transformation history to multiple data sets. The list of editing operations could be extracted (in JSON format) and be applied to any similar data set.

The developers claim Google Refine works fine for up to a few hundred thousand rows, if a computer memory allows.

Google Refine works for a wide variety of data sets. TSV, CSV, *SV, Excel (.xls and .xlsx), JSON, XML, RDF as XML, and Google Data documents are all supported. Support for other formats can be added with Google Refine extensions (RDF and Stat extensions).

Conclusion

Data extraction, data transformation and mining (Google Refine) steadily migrate from the professional’s objectives into the common user hands. The traditional way of using the transforming algorithms would require writing some code, with that implementation inevitably tied to a particular data set. Google Refine works well for a wide variety of data sets without the input of any code at all (don’t forget GREL). The resulting operations are portable, so the process used to clean up one data set works fine for similar format data sets.

No doubt, Google Refine has already been noticed by investigative journalists (see Scraping for journalists) and people who have to deal with often messy government data sets. One example of it is a group of investigative journalists who took the open web data from several pharmaceutical companies and assembled them into a single, comprehensive database with help of Google Refine (read details here).

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.