Categories
Development

A Simple Email Crawler in Python

I often receive requests asking about email crawling. It is evident that this topic is quite interesting for those who want to scrape contact information from the web (like direct marketers), and previously we have already mentioned GSA Email Spider as an off-the-shelf solution for email crawling. In this article I want to demonstrate how easy it is to build a simple email crawler in Python. This crawler is simple, but you can learn many things from this example (especially if you’re new to scraping in Python).

I purposely simplified the code as much as possible to distill the main idea and allow you to add any additional features by yourself later if necessary. However, despite its simplicity, the code is fully functional and is able to extract for you many emails from the web. Note also that this code is written on Python 3.

If you want a Java email crawler, please check out this.

Ok, let’s move from words to deeds. I’ll consider it portion by portion, commenting on what’s going on. If you need the whole code you can get it at the bottom of the post.
Jump to the full-code.

Let’s import all necessary libraries first. In this example I use BeautifulSoup and Requests as third party libraries and urllib, collections and re as built-in libraries. BeautifulSoup provides a simple way for searching an HTML document, and the Request library allows you to easily perform web requests.

from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
import re

The following piece of code defines a list of urls to start the crawling from. For an example I chose “The Moscow Times” website, since it exposes a nice list of emails. You can add any number of urls that you want to start the scraping from. Though this collection could be a list (in Python terms), I chose a deque type, since it better fits the way we will use it:

# a queue of urls to be crawled
new_urls = deque(['http://www.themoscowtimes.com/contact_us/'])

Next, we need to store the processed urls somewhere so as not to process them twice. I chose a set type, since we need to keep unique values and be able to search among them:

# a set of urls that we have already crawled
processed_urls = set()

In the emails collection we will keep the collected email addresses:

# a set of crawled emails
emails = set()

Let’s start scraping. We’ll do it until we don’t have any urls left in the queue. As soon as we take a url out of the queue, we will add it to the list of processed urls, so that we do not forget about it in the future:

# process urls one by one until we exhaust the queue
while len(new_urls):
# move next url from the queue to the set of processed urls
url = new_urls.popleft()
processed_urls.add(url)

Then we need to extract some base parts of the current url; this is necessary for converting relative links found in the document into absolute ones:

# extract base url and path to resolve relative links
parts = urlsplit(url)<br >base_url = "{0.scheme}://{0.netloc}".format(parts)
path = url[:url.rfind('/')+1] if '/' in parts.path else url

The following code gets the page content from the web. If it encounters an error it simply goes to the next page:

# get url's content
print("Processing %s" % url)
try:
    response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
    # ignore pages with errors 
    continue

When we have gotten the page, we can search for all new emails on it and add them to our set. For email extraction I use a simple regular expression for matching email addresses:

# extract all email addresses and add them into the resulting set 
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)

After we have processed the current page, let’s find links to other pages and add them to our url queue (this is what the crawling is about). Here I use the BeautifulSoup library for parsing the page’s html:

# create a beutiful soup for the html document
soup = BeautifulSoup(response.text)

The find_all method of this library extracts page elements according to the tag name (<a> in our case):

# find and process all the anchors in the document
for anchor in soup.find_all("a"):

Some of <a> tags may not contain a link at all, so we need to take this into consideration:

# extract link url from the anchor
link = anchor.attrs["href"] if "href" in anchor.attrs else ''

If the link address starts with a hash, then we count it as a relative link, and it is necessary to add the base url to the beginning of it:

# add base url to relative links
if link.startswith('/'):
link = base_url + link

Now, if we have gotten a valid link (starting with “http”) and we don’t have it in our url queue, and we haven’t processed it before, then we can add it to the queue for further processing:

# add the new url to the queue if it's of HTTP protocol, not enqueued and not processed yet
if link.startswith('http') and not link in new_urls and not link in processed_urls:
     new_urls.append(link)

That’s it. Here is the complete code of this simple email crawler.

A Simple Email Crawler (full code)
from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
import re

# a queue of urls to be crawled
new_urls = deque(['http://www.themoscowtimes.com/contact_us/index.php'])

# a set of urls that we have already crawled
processed_urls = set()

# a set of crawled emails
emails = set()

# process urls one by one until we exhaust the queue
while len(new_urls):
	# move next url from the queue to the set of processed urls
	url = new_urls.popleft()
	processed_urls.add(url)

	# extract base url to resolve relative links
	parts = urlsplit(url)
	base_url = "{0.scheme}://{0.netloc}".format(parts)
	path = url[:url.rfind('/')+1] if '/' in parts.path else url

	# get url's content
	print("Processing %s" % url)
	try:
		response = requests.get(url)
	except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
		# ignore pages with errors
		continue

	# extract all email addresses and add them into the resulting set
	new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
	emails.update(new_emails)

	# create a beutiful soup for the html document
	soup = BeautifulSoup(response.text)

	# find and process all the anchors in the document
	for anchor in soup.find_all("a"):
		# extract link url from the anchor
		link = anchor.attrs["href"] if "href" in anchor.attrs else ''
	# resolve relative links
	if link.startswith('/'):
		link = base_url + link
	elif not link.startswith('http'):
		link = path + link
	# add the new url to the queue if it was not enqueued nor processed yet
	if not link in new_urls and not link in processed_urls:
		new_urls.append(link)

This crawler is simple and is deficient in several features (like saving found emails into a file), but it gives you some basic principles of email crawling. I give it to you for further improvement.

If you want to study more on handling sessions, cookies and auth token in the requests library in Python, please refer to here.

And of course, if you have any questions, suggestions or corrections feel free to comment on this post below.

Have a nice day!

64 replies on “A Simple Email Crawler in Python”

Hi, what would you do regarding project honeypot: http://www.projecthoneypot.org/about_us.php

“We also work with law enforcement authorities to track down and prosecute spammers. Harvesting email addresses from websites is illegal under several anti-spam laws, and the data resulting from Project Honey Pot is critical for finding those breaking the law.”

Do you think rotating proxies is sufficient?

Great blog by the way

Hi, what would you do regarding project honeypot: http://www.projecthoneypot.org/about_us.php

“We also work with law enforcement authorities to track down and prosecute spammers. Harvesting email addresses from websites is illegal under several anti-spam laws, and the data resulting from Project Honey Pot is critical for finding those breaking the law.”

Do you think rotating proxies is sufficient?

Great blog by the way

If you put legality issues aside, then yes, rotating proxies is a good means of keeping privacy in this case as well. Also it’s a good thing to make a pause between requests to the same website.

If you put legality issues aside, then yes, rotating proxies is a good means of keeping privacy in this case as well. Also it’s a good thing to make a pause between requests to the same website.

For now the best option is Python cause it’s having multiple web scraping libraries avail.
As far as the speed is concerned, it’s not the language but rather a server (incl. its confuguration), which requests the web pages, that plays the main role in a fast content extraction.

Is there a way I can use something like raw_input variable to deque such as:
URL = raw_input()
new_urls = deque([URL])

When I do it says:
Processing URL — whatever the user inputs
set([])

After that it just stops. I’m assuming I need to add something after ([URL]) to tell it to process further.

Any thoughts?

Thanks,

Is there a way I can use something like raw_input variable to deque such as:
URL = raw_input()
new_urls = deque([URL])

When I do it says:
Processing URL — whatever the user inputs
set([])

After that it just stops. I’m assuming I need to add something after ([URL]) to tell it to process further.

Any thoughts?

Thanks,

Anybody know to build one script to extract emails from eBay ? For ex: ebay.co.uk and to collect only domains i need for ex: @btinternet.com? This is something like eBay scraper linux script. If anyone have any idea please respond to my comment. Thanks

Anybody know to build one script to extract emails from eBay ? For ex: ebay.co.uk and to collect only domains i need for ex: @btinternet.com? This is something like eBay scraper linux script. If anyone have any idea please respond to my comment. Thanks

Hallo

ich habe keine Ahnung vom Programmieren, jedoch suche ich ein Tool um an spezielle Email Adressen zu gelangen.
Habe bislang sehr viele Programme und Tools ausprobiert, jedoch waren die Resultate nicht gerade ein Erfolg.
Meine Frage wäre, hat man die Möglichkeit von einer bestimmten Website die Email Adresse zu erhalten??

Für eine Antwort vielen Dank.
Gruß
Matthias

Hallo

ich habe keine Ahnung vom Programmieren, jedoch suche ich ein Tool um an spezielle Email Adressen zu gelangen.
Habe bislang sehr viele Programme und Tools ausprobiert, jedoch waren die Resultate nicht gerade ein Erfolg.
Meine Frage wäre, hat man die Möglichkeit von einer bestimmten Website die Email Adresse zu erhalten??

Für eine Antwort vielen Dank.
Gruß
Matthias

Hello can you please explain to me how to use the code i tried python following the code its says
line 4, in
from urllib.parse import urlsplit
ImportError: No module named parse
please help me understand this more am still very new to coding Thanks.

Hello can you please explain to me how to use the code i tried python following the code its says
line 4, in
from urllib.parse import urlsplit
ImportError: No module named parse
please help me understand this more am still very new to coding Thanks.

2 import requests
3 import requests.exceptions
—-> 4 from urllib.parse import urlsplit
5 from collections import deque
6 import re

ImportError: No module named parse

2 import requests
3 import requests.exceptions
—-> 4 from urllib.parse import urlsplit
5 from collections import deque
6 import re

ImportError: No module named parse

Thanks for providing this code–it’s exactly what I was looking for!! I have some very newbish questions for you guys:

(1) Where do the e-mail save once it’s done crawling?
(2) It seems to have a never ending set of URLs. Is there a way I can stop it once it gets off the ‘path’ I want it be on and still collect the emails?

(1) You might save emails in database. If you do not need a fast assosiative search then you might save them in file.
(2) You might stop by check every received email if it matches a criterium.
Does it make clear?

Sorry, no. I’m completely new to this. When I run the code I see that it is processing sites, but I have absolutely no clue where the results populate (command, text, csv, etc.) Do I need to add additional code to print it somewhere?

Sire. Results are populated into stack here:

# extract all email addresses and add them into the resulting set
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)

So you might process new_emails set as you wish.
emails.update(new_emails) only pushes it into emails stack.
If you want the whole html text for each processed url, get use response.text.

hi i want to gather the emails addresses only, of those who make posts , ask questions etc in various FORUMS in on line gambling related websites, will the above web crawler do this , please dvise , don

hi i want to gather the emails addresses only, of those who make posts , ask questions etc in various FORUMS in on line gambling related websites, will the above web crawler do this , please dvise , don

Is there a way to speed up the process to go straight into the contact page? Also, I would you achieve extracting social profiles too apart from the email.

Regards

Is there a way to speed up the process to go straight into the contact page? Also, I would you achieve extracting social profiles too apart from the email.

Regards

“mailto” and “tel” or other prefixes contained in anchor tags cause this crawler to loop infinitely, it mistakes them for part of the relative URL path and adds them to the queue then it keeps combining these compromised URLs to these wrong paths.

“mailto” and “tel” or other prefixes contained in anchor tags cause this crawler to loop infinitely, it mistakes them for part of the relative URL path and adds them to the queue then it keeps combining these compromised URLs to these wrong paths.

Hello, I really like the code yu have just put together. I have been processing for the last 12 hours lol! Erm how to I get the emails after processing has finished? I’m a bit confused?

Hello, I really like the code yu have just put together. I have been processing for the last 12 hours lol! Erm how to I get the emails after processing has finished? I’m a bit confused?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.