My goal was to retrieve data from a web business directory.
Since the business directories scrape is the most challenging task (beside SERP scrape) there are some basic questions for me to answer:
- Is there any scrape protection set at that site?
- How much data is in that web business directory?
- What kind of queries can I run to find all the directory’s items?
Scrape protection
Let’s answer the first question: is there any scrape protection set on this site?
- The first scrape protection measure I found was a restriction for making recurring paginated search requests like e.g. www.firmas.lv?srch=object&page=1, e.g. www.firmas.lv?srch=object&page=2, etc. The service is apt to detect repeated requests, and it returns goofy info as the search result. This is a kind of tar pit.
- The second scrape protection appeared in the limiting of requests per day from a single IP. When doing regular scraping from the same IP, I got a warning from the service: “You’ve exceeded the publicly allowed profiles limit for a single IP address.” The limit turned out to be about 180 profiles/companies a day. No ban, just a strong limit.
You might want to jump to the whole code right away.
How big?
Originally, the size of the business directory is not clear. The limitation set made it impossible to query consecutive pages of the directory, eg. www.firmas.lv?page=1. The server responses with “You’ve reached the limit of publicly available query results.”
So another approach is needed here. Since the directory allows one to make a context search and responds with the total results number, I decided to query the service of all vowels of an alphabet: a, o, e, i, u, y. I assumed that a company name must fit well in the market and thus be easily pronounced; so it should contain at least one vowel. When summing up results for each vowel there should be a certain overlap. Since the number of vowels is significantly small, we may be certain of the degree of the magnitude of the directory items which are contained by this service.
Vowel | Service search results amount |
a | 10772 |
o | 1532 |
i | 5215 |
e | 2803 |
u | 712 |
y | 38 |
So total directory pool might be estimated as 12K to 18K records, average being 15K. Eventually I succeeded in scraping over 16K records.
After the scrape of 16.4K company names and links, there were only 2 company names without any vowel. The estimation error is therefore 2/16400*100% = 0.012%, neglectfully small.
How to query
Now we come to our third question: What kind of queries can I run to find the directory’s items?
1. Preparation
So due to the first above mentioned limit, it is not possible to query many consecutive requests of the same search word. Then let us get a pool of different query words and query the service on them. I grabbed some wiki articles (you might address any online dictionary), stripped them of an html markup & special characters and stored all the words (with length of 4 and more) into a service db table called query_word.
CREATE TABLE query_word( id INT NOT NULL AUTO_INCREMENT, word VARCHAR(255) NOT NULL, total_links INT(11) DEFAULT NULL, scraped_links INT(11) DEFAULT NULL, scraped SMALLINT(6) NOT NULL, updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (id), UNIQUE (word) );
Now using this service table, I can go over the table on schedule and get a query word to query the web service.
Note that the querying for search results is being made at this service by HTTP GET request method, requiring the simple request url. In most cases though, the web service querying is only possible by HTTP POST method, thus involving CSFR protection and other additional security measures. I plan to review the HTTP POST querying in our later posts.
2. Get companies data from their detail pages
The resulting html contains the companies’ links to their corresponding detail pages. We’ll grab them using the Beautiful soup library.
Scraping algorithm
The scraping policy for fetching search results pages is the following:
- Query all words (from query_word db) with [pagination] offset = 1, first 20 links (1 to 20). When that scrape is complete, we change the scraped_links table field to identify how many links we’ve already scraped. The only pagination step that service provides is 20 links.
- Then query all words with offset = 2, second 20 links (21 to 40).
- Then query all words with offset = 3, third 20 links (41 to 60) and so on.
- When we have the scraped_links parameter equal to or more than the total_links parameter for a given query word, then we set the scraped field flag to 1 to exclude this query word from further use in search queries.
See the code of the first file: scrape_firmas_links.py that makes search a request at firmas.lv service.
import MySQLdb, sys import requests from bs4 import BeautifulSoup import re import smtplib # Import smtplib for the actual sending function from email.mime.text import MIMEText db=MySQLdb.connect( host='igorsavinkin.mysql.pythonanywhere-services.com', user='igorsavinkin', passwd='mysqlgodman', db='igorsavinkin$firma') pointer = db.cursor() def query_web(url): try: # this code works firmas_list=requests.get(url) print('web query status code: {}.'.format(firmas_list.status_code)) if (firmas_list.status_code!=200): send_mail('igor.savinkin@gmail.com', 'igor.savinkin@gmail.com', 'Unsuccessful web request', 'Unsuccessful web request to '+ url) return firmas_list.text except: # catch *all* exceptions e = sys.exc_info()[0] print( "Web query error: %s" % e ) return 'false' url = 'https://www.firmas.lv/results?of=4&srch=' bulk_query = "INSERT INTO firma(name,link) VALUES (%s,%s) ON DUPLICATE KEY UPDATE link=link" range_number_regex = re.compile("no\s(\d+)\s.*?\s(\d+)") def fetch_word(): try: pointer.execute("""SELECT word FROM query_word WHERE scraped=0 and scraped_links < 81 LIMIT 1""") db.commit() #print('Successful selection from db of {}'.format(pointer.fetchone())) #for row in pointer: # print('we query "'.join(row) + '"') word= pointer.fetchone() if(word): return ''.join(word) else: return '' except: db.rollback() def send_mail(to_, from_, subject, content): msg = MIMEText(content) msg['Subject'] = subject msg['From'] = from_ msg['To'] = to_ # Send the message via google SMTP server # s = smtplib.SMTP('smtp.gmail.com') # s = smtplib.SMTP_SSL("smtp.gmail.com", 465) s = smtplib.SMTP('smtp.gmail.com:587') s.ehlo() s.starttls() s.ehlo() s.login('igor.savinkin', 'godmanig') s.sendmail(from_, [to_], msg.as_string()) #print ('Successfully sent the mail to %s' % to_) s.quit() def save_item(name, link): try: pointer.execute("""INSERT INTO firma (name, link) VALUES (%s,%s)""", (name,link)) db.commit() #print('Successful commit') except: db.rollback() #print('Commit failed') db.close() word = fetch_word() if (word==''): send_mail('igor.savinkin@gmail.com', 'igor.savinkin@gmail.com', 'Empty query word', 'Empty query word in /home/igorsavinkin/firmas.lv/scrape_firmas_links.py for offset = 4 and scraped_links < 81') sys.exit() print('we query "{}"'.format(word)) url = url + word # the url already contains an offset #print('url: '+url) html_doc = query_web(url)
3. Some pitfalls
Encoding
The problem in the above mentioned code was with coding. When I tried to store the company name beside a company detail page link, the script could not store it properly into utf-8 encoded db table. Even though the page is technically of utf-8 encoding as defined in meta:
<meta http-equiv=”Content-Type” content=”text/html; charset=UTF-8“>
The workaround suggested by pythonanywhere support was to store the data first into a temporary file and then to feed from the file into db.
I found bs encoding docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings
4. Restrictions and more
This code is made with a simple data aggregator service. The restrictions to a web scraper a similar business directory service may apply:
- the same IP banning (where proxies needed to be applied)
- log in requirement, that makes requester to be transparent to the service
- the query frequencies watched/tracked by the service thru cookies
5. Whole code
import MySQLdb, sys
import requests
import random, string
from bs4 import BeautifulSoup
import re
import smtplib # Import smtplib for the actual sending function
from email.mime.text import MIMEText
def randomword(length):
return ''.join(random.choice(string.lowercase) for i in range(length))
db=MySQLdb.connect(
host='igorsavinkin.mysql.pythonanywhere-services.com',
user='igorsavinkin',
passwd='mysqlgodman',
db='igorsavinkin$firma')
pointer = db.cursor()
def query_web(url):
try: # this code works
firmas_list=requests.get(url)
print('web query status code: {}.'.format(firmas_list.status_code))
if (firmas_list.status_code!=200):
send_mail('igor.savinkin@gmail.com', 'igor.savinkin@gmail.com',
'Unsuccessful web request', 'Unsuccessful web request to '+ url)
return firmas_list.text
except: # catch *all* exceptions
e = sys.exc_info()[0]
print( "Web query error: %s" % e )
return 'false'
url = 'https://www.firmas.lv/results?srch='
bulk_query = "INSERT INTO firma(name,link) VALUES (%s,%s) ON DUPLICATE KEY UPDATE link=link"
range_number_regex = re.compile("no\s(\d+)\s.*?\s(\d+)")
def fetch_word():
try:
pointer.execute("""SELECT word FROM query_word WHERE scraped=0 and scraped_links < 1 LIMIT 1""")
db.commit()
#print('Successful selection from db of {}'.format(pointer.fetchone()))
#for row in pointer:
# print('we query "'.join(row) + '"')
except:
db.rollback()
return ''.join(pointer.fetchone())
def send_mail(to_, from_, subject, content):
msg = MIMEText(content)
msg['Subject'] = subject
msg['From'] = from_
msg['To'] = to_
# Send the message via google SMTP server
# s = smtplib.SMTP('smtp.gmail.com')
# s = smtplib.SMTP_SSL("smtp.gmail.com", 465)
s = smtplib.SMTP('smtp.gmail.com:587')
s.ehlo()
s.starttls()
s.ehlo()
s.login('igor.savinkin', 'godmanig')
s.sendmail(from_, [to_], msg.as_string())
#print ('Successfully sent the mail to %s' % to_)
s.quit()
def save_item(name, link):
try:
pointer.execute("""INSERT INTO firma (name, link) VALUES (%s,%s)""", (name,link))
db.commit()
#print('Successful commit')
except:
db.rollback()
#print('Commit failed')
db.close()
word = fetch_word()
print('we query "{}"'.format(word))
url = url + word
#print('url: '+url)
html_doc = query_web(url)
if(html_doc):
soup = BeautifulSoup(html_doc, 'html.parser')
try:
total = soup.select("div.found > b")[0].text
print('total links: ' + total)
# update the total number of links for a word
pointer.execute("""UPDATE query_word SET total_links={} WHERE word ='{}' """.format(total, word))
db.commit()
found_expression = soup.select("div.found")[0].text
#print('found_expression: '+found_expression)
range_number = range_number_regex.search(found_expression)
if range_number:
#print(range_number.group(0))# whole match
#print(range_number.group(1)) # links index from
#print(range_number.group(2)) # links index untill
# update the scraped links for a given word
pointer.execute("""UPDATE query_word SET scraped_links=scraped_links+{} WHERE word ='{}' """.format(int(range_number.group(2)), word))
db.commit()
if (total == range_number.group(2)):
# we set scraped to '1' since this is maximum number of links we get on this word
pointer.execute("""UPDATE query_word SET scraped=1 WHERE word ='{}' """.format(word))
db.commit()
# here we parse data from the soup
company_names = soup.select("a > h1")
items_list=[]
for name in company_names:
if (name.parent['href']!='#'): # we store in link is not #
single_tuple = name.text.encode('utf8'), name.parent['href']
items_list.append(single_tuple)
print('list: ', items_list)
try:
pointer.executemany(bulk_query, items_list)
db.commit()
print("Affected rows = {}".format(pointer.rowcount))
except ValueError as e:
print('Error: ', e)
except IndexError:
# none - write to db that there is no companies
print('None of firmas links for the word "{}"'.format(word))
pointer.execute("""UPDATE query_word SET scraped=1 and total_links=0 WHERE word ='{}' """.format(word))
db.commit()
print('Now "{}" is marked as scraped.'.format(word))