How to scrape CSV data files

This short post in to guide you in how to scrape CSV data files. You may ask, why do we need this scrape if those data are already in files? The answer is that you might need to spend quite a lot of time in downloading the files into one place and sorting or merging them.

Basic steps

Python’s CSV library is well able to do a lot of the work for you.For test purposes I’ve published in the web a simple CSV file. In Python we need to use ‘urllib2′ library:

import urllib2
import csv
url = 'https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdENjMmZ2cm5VQXJJMWlQVENIek5Ta2c&output=csv'
data = urllib2.urlopen(url)

The whole script

import urllib2
import csv
url = 'https://docs.google.com/spreadsheet/pub?key=0AmNIZgbwy5TmdENjMmZ2cm5VQXJJMWlQVENIek5Ta2c&output=csv'
data = urllib2.urlopen(url)
reader = csv.DictReader(data)
for record  in reader:
    print record

Storing data in database

For the storing the data, whether in MySql database or in SQLite, you need to take heed of data encoding that you scrape. For storing in the database, the most fitting encoding is UTF-8. Non-ANSII characters might be misencoded when inserted into/retrieved from the database.

If the data source is from Western Europe or the Americas, it might be fitting to decode from ‘cp1252‘ or ‘latin-1‘ encodings directly into UTF-8 encoding. Do it by adding the method .decode(‘<encoding name>’) to the field in question. For example, if the field <name> is in ‘latin-1‘ encoding try to add this line inside the loop prior to storing into database:

record['name'] = record['name'].decode('latin-1')

Read a referrence to non-UTF-8 encodings handling in Python.

Dictionary reader for CSV library

The .DictReader method (instead of .reader) can be used to create a dictionary of your CSV data, the values in the first row of the CSV file will be used as keys. This eliminates a need for naming each field in the code.

Additional tip:

If you want to manually to copy all CSV files into one file in Windows, go to the command line (Start->Run), move to the folder where files are located: cd <path to a folder> and execute the command in console: copy *.csv <name>.csv . Instead of <name> paste a name of the new CSV file.

2 replies on “How to scrape CSV data files”

Thats good for getting data that is html with .cvs at end, but how do you parse a page with multiple csv’s and select the one you want to open?

Cliff, the secret is file unique uri. Find them (for example by BeautifulSoup library) and apply the article’s method to extract data from multiple files or any.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))