I often receive requests asking about email crawling. It is evident that this topic is quite interesting for those who want to scrape contact information from the web (like direct marketers), and previously we have already mentioned GSA Email Spider as an off-the-shelf solution for email crawling. In this article I want to demonstrate how easy it is to build a simple email crawler in Python. This crawler is simple, but you can learn many things from this example (especially if you’re new to scraping in Python).
I purposely simplified the code as much as possible to distill the main idea and allow you to add any additional features by yourself later if necessary. However, despite its simplicity, the code is fully functional and is able to extract for you many emails from the web. Note also that this code is written on Python 3.
Ok, let’s move from words to deeds. I’ll consider it portion by portion, commenting on what’s going on. If you need the whole code you can get it at the bottom of the post.
Jump to the full-code.
Let’s import all necessary libraries first. In this example I use BeautifulSoup and Requests as third party libraries and urllib, collections and re as built-in libraries. BeautifulSoup provides a simple way for searching an HTML document, and the Request library allows you to easily perform web requests.
from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
import re
The following piece of code defines a list of urls to start the crawling from. For an example I chose “The Moscow Times” website, since it exposes a nice list of emails. You can add any number of urls that you want to start the scraping from. Though this collection could be a list (in Python terms), I chose a deque type, since it better fits the way we will use it:
# a queue of urls to be crawled
new_urls = deque(['http://www.themoscowtimes.com/contact_us/'])
Next, we need to store the processed urls somewhere so as not to process them twice. I chose a set type, since we need to keep unique values and be able to search among them:
# a set of urls that we have already crawled
processed_urls = set()
In the emails collection we will keep the collected email addresses:
# a set of crawled emails
emails = set()
Let’s start scraping. We’ll do it until we don’t have any urls left in the queue. As soon as we take a url out of the queue, we will add it to the list of processed urls, so that we do not forget about it in the future:
# process urls one by one until we exhaust the queue
while len(new_urls):
# move next url from the queue to the set of processed urls
url = new_urls.popleft()
processed_urls.add(url)
Then we need to extract some base parts of the current url; this is necessary for converting relative links found in the document into absolute ones:
# extract base url and path to resolve relative links
parts = urlsplit(url)<br >base_url = "{0.scheme}://{0.netloc}".format(parts)
path = url[:url.rfind('/')+1] if '/' in parts.path else url
The following code gets the page content from the web. If it encounters an error it simply goes to the next page:
# get url's content
print("Processing %s" % url)
try:
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
# ignore pages with errors
continue
When we have gotten the page, we can search for all new emails on it and add them to our set. For email extraction I use a simple regular expression for matching email addresses:
# extract all email addresses and add them into the resulting set
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)
After we have processed the current page, let’s find links to other pages and add them to our url queue (this is what the crawling is about). Here I use the BeautifulSoup library for parsing the page’s html:
# create a beutiful soup for the html document
soup = BeautifulSoup(response.text)
The find_all method of this library extracts page elements according to the tag name (<a> in our case):
# find and process all the anchors in the document
for anchor in soup.find_all("a"):
Some of <a> tags may not contain a link at all, so we need to take this into consideration:
# extract link url from the anchor
link = anchor.attrs["href"] if "href" in anchor.attrs else ''
If the link address starts with a hash, then we count it as a relative link, and it is necessary to add the base url to the beginning of it:
# add base url to relative links
if link.startswith('/'):
link = base_url + link
Now, if we have gotten a valid link (starting with “http”) and we don’t have it in our url queue, and we haven’t processed it before, then we can add it to the queue for further processing:
# add the new url to the queue if it's of HTTP protocol, not enqueued and not processed yet
if link.startswith('http') and not link in new_urls and not link in processed_urls:
new_urls.append(link)
That’s it. Here is the complete code of this simple email crawler.
A Simple Email Crawler (full code)
from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urllib.parse import urlsplit
from collections import deque
import re
# a queue of urls to be crawled
new_urls = deque(['http://www.themoscowtimes.com/contact_us/index.php'])
# a set of urls that we have already crawled
processed_urls = set()
# a set of crawled emails
emails = set()
# process urls one by one until we exhaust the queue
while len(new_urls):
# move next url from the queue to the set of processed urls
url = new_urls.popleft()
processed_urls.add(url)
# extract base url to resolve relative links
parts = urlsplit(url)
base_url = "{0.scheme}://{0.netloc}".format(parts)
path = url[:url.rfind('/')+1] if '/' in parts.path else url
# get url's content
print("Processing %s" % url)
try:
response = requests.get(url)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
# ignore pages with errors
continue
# extract all email addresses and add them into the resulting set
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)
# create a beutiful soup for the html document
soup = BeautifulSoup(response.text)
# find and process all the anchors in the document
for anchor in soup.find_all("a"):
# extract link url from the anchor
link = anchor.attrs["href"] if "href" in anchor.attrs else ''
# resolve relative links
if link.startswith('/'):
link = base_url + link
elif not link.startswith('http'):
link = path + link
# add the new url to the queue if it was not enqueued nor processed yet
if not link in new_urls and not link in processed_urls:
new_urls.append(link)
This crawler is simple and is deficient in several features (like saving found emails into a file), but it gives you some basic principles of email crawling. I give it to you for further improvement.
And of course, if you have any questions, suggestions or corrections feel free to comment on this post below.
Have a nice day!
64 replies on “A Simple Email Crawler in Python”
Hi, what would you do regarding project honeypot: http://www.projecthoneypot.org/about_us.php
“We also work with law enforcement authorities to track down and prosecute spammers. Harvesting email addresses from websites is illegal under several anti-spam laws, and the data resulting from Project Honey Pot is critical for finding those breaking the law.”
Do you think rotating proxies is sufficient?
Great blog by the way
Hi, what would you do regarding project honeypot: http://www.projecthoneypot.org/about_us.php
“We also work with law enforcement authorities to track down and prosecute spammers. Harvesting email addresses from websites is illegal under several anti-spam laws, and the data resulting from Project Honey Pot is critical for finding those breaking the law.”
Do you think rotating proxies is sufficient?
Great blog by the way
If you put legality issues aside, then yes, rotating proxies is a good means of keeping privacy in this case as well. Also it’s a good thing to make a pause between requests to the same website.
If you put legality issues aside, then yes, rotating proxies is a good means of keeping privacy in this case as well. Also it’s a good thing to make a pause between requests to the same website.
What is the fastest or best language for a web scraping?
What is the fastest or best language for a web scraping?
For now the best option is Python cause it’s having multiple web scraping libraries avail.
As far as the speed is concerned, it’s not the language but rather a server (incl. its confuguration), which requests the web pages, that plays the main role in a fast content extraction.
Can’t read the ends of the longer lines of code.
@Joe fixed
Can’t read the ends of the longer lines of code.
@Joe fixed
Is there a way I can use something like raw_input variable to deque such as:
URL = raw_input()
new_urls = deque([URL])
When I do it says:
Processing URL — whatever the user inputs
set([])
After that it just stops. I’m assuming I need to add something after ([URL]) to tell it to process further.
Any thoughts?
Thanks,
Is there a way I can use something like raw_input variable to deque such as:
URL = raw_input()
new_urls = deque([URL])
When I do it says:
Processing URL — whatever the user inputs
set([])
After that it just stops. I’m assuming I need to add something after ([URL]) to tell it to process further.
Any thoughts?
Thanks,
N/M I just didn’t read the code correctly.
I changed it to:
print “Enter a file name:”,
URL = raw_input()
new_urls = deque([‘http://’+(URL)])
N/M I just didn’t read the code correctly.
I changed it to:
print “Enter a file name:”,
URL = raw_input()
new_urls = deque([‘http://’+(URL)])
Amazing code, how would I export the results into an excel table with the emails + urls?
I’d recommend you to save results into CSV file that might be read by Excel (and converted from it into .xslx format).
Amazing code, how would I export the results into an excel table with the emails + urls?
I’d recommend you to save results into CSV file that might be read by Excel (and converted from it into .xslx format).
How do I save this?
where do i insert this code to start it working?
Can i have a list of websites, and then get this code to crawl it for phone numbers?
thanks
Si
Anybody know to build one script to extract emails from eBay ? For ex: ebay.co.uk and to collect only domains i need for ex: @btinternet.com? This is something like eBay scraper linux script. If anyone have any idea please respond to my comment. Thanks
Anybody know to build one script to extract emails from eBay ? For ex: ebay.co.uk and to collect only domains i need for ex: @btinternet.com? This is something like eBay scraper linux script. If anyone have any idea please respond to my comment. Thanks
Hallo
ich habe keine Ahnung vom Programmieren, jedoch suche ich ein Tool um an spezielle Email Adressen zu gelangen.
Habe bislang sehr viele Programme und Tools ausprobiert, jedoch waren die Resultate nicht gerade ein Erfolg.
Meine Frage wäre, hat man die Möglichkeit von einer bestimmten Website die Email Adresse zu erhalten??
Für eine Antwort vielen Dank.
Gruß
Matthias
Sicher, Sie können erhalten Email Adresse von einen Bestimmten website. Werfen Sie einen Blick auf diese Artikel. Web Data Extractor ist ein gut software für das.
Hallo
ich habe keine Ahnung vom Programmieren, jedoch suche ich ein Tool um an spezielle Email Adressen zu gelangen.
Habe bislang sehr viele Programme und Tools ausprobiert, jedoch waren die Resultate nicht gerade ein Erfolg.
Meine Frage wäre, hat man die Möglichkeit von einer bestimmten Website die Email Adresse zu erhalten??
Für eine Antwort vielen Dank.
Gruß
Matthias
Sicher, Sie können erhalten Email Adresse von einen Bestimmten website. Werfen Sie einen Blick auf diese Artikel. Web Data Extractor ist ein gut software für das.
This will be faster if you have multiple threads running simultaneously. This is kind of a joke project, but here’s how I did it: https://github.com/gojefferson/email-crawler
This will be faster if you have multiple threads running simultaneously. This is kind of a joke project, but here’s how I did it: https://github.com/gojefferson/email-crawler
Hello can you please explain to me how to use the code i tried python following the code its says
line 4, in
from urllib.parse import urlsplit
ImportError: No module named parse
please help me understand this more am still very new to coding Thanks.
Graham, if *ImportError: No module named parse* then you should check *parse* module in your *urllib*.
What version of Python do you use? *Note also that this code is written on Python 3*
Hello can you please explain to me how to use the code i tried python following the code its says
line 4, in
from urllib.parse import urlsplit
ImportError: No module named parse
please help me understand this more am still very new to coding Thanks.
Graham, if *ImportError: No module named parse* then you should check *parse* module in your *urllib*.
What version of Python do you use? *Note also that this code is written on Python 3*
2 import requests
3 import requests.exceptions
—-> 4 from urllib.parse import urlsplit
5 from collections import deque
6 import re
ImportError: No module named parse
2 import requests
3 import requests.exceptions
—-> 4 from urllib.parse import urlsplit
5 from collections import deque
6 import re
ImportError: No module named parse
Thanks for providing this code–it’s exactly what I was looking for!! I have some very newbish questions for you guys:
(1) Where do the e-mail save once it’s done crawling?
(2) It seems to have a never ending set of URLs. Is there a way I can stop it once it gets off the ‘path’ I want it be on and still collect the emails?
(1) You might save emails in database. If you do not need a fast assosiative search then you might save them in file.
(2) You might stop by check every received email if it matches a criterium.
Does it make clear?
Sorry, no. I’m completely new to this. When I run the code I see that it is processing sites, but I have absolutely no clue where the results populate (command, text, csv, etc.) Do I need to add additional code to print it somewhere?
Sire. Results are populated into stack here:
# extract all email addresses and add them into the resulting set
new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
emails.update(new_emails)
So you might process new_emails set as you wish.
emails.update(new_emails) only pushes it into emails stack.
If you want the whole html text for each processed url, get use response.text.
hi i want to gather the emails addresses only, of those who make posts , ask questions etc in various FORUMS in on line gambling related websites, will the above web crawler do this , please dvise , don
Dear Don, we do not do any illegal activity.
hi i want to gather the emails addresses only, of those who make posts , ask questions etc in various FORUMS in on line gambling related websites, will the above web crawler do this , please dvise , don
Dear Don, we do not do any illegal activity.
Hi!
Great script! Just one question, how to tell the script not to follow external links?
Have a good day 🙂
H
You might find useful this comment.
Is there a way to speed up the process to go straight into the contact page? Also, I would you achieve extracting social profiles too apart from the email.
Regards
Is there a way to speed up the process to go straight into the contact page? Also, I would you achieve extracting social profiles too apart from the email.
Regards
“mailto” and “tel” or other prefixes contained in anchor tags cause this crawler to loop infinitely, it mistakes them for part of the relative URL path and adds them to the queue then it keeps combining these compromised URLs to these wrong paths.
“mailto” and “tel” or other prefixes contained in anchor tags cause this crawler to loop infinitely, it mistakes them for part of the relative URL path and adds them to the queue then it keeps combining these compromised URLs to these wrong paths.
I edited your script https://pastebin.com/0pJ2MdgV
it doesn’t get stuck with non-URL anchors and it doesn’t crawl outside your domain
A solution of 5 lines:
http://buklijas.info/blog/2018/03/15/find-emails-on-web-page/
Hello, I really like the code yu have just put together. I have been processing for the last 12 hours lol! Erm how to I get the emails after processing has finished? I’m a bit confused?
Hello, I really like the code yu have just put together. I have been processing for the last 12 hours lol! Erm how to I get the emails after processing has finished? I’m a bit confused?
how about if we search the emails through a keyword or multiple keywords?
how about if we search the emails through a keyword or multiple keywords?
Hi,
I try to run this program to test it on my web site (jerusalemprogrammer.com) which only has one e-mail.
Unfortunately, the program does not work.
Hi,
I try to run this program to test it on my web site (jerusalemprogrammer.com) which only has one e-mail.
Unfortunately, the program does not work.
Woah! I thought it would work but it somehow brings no results. Can anybody lend in some help?
Woah! I thought it would work but it somehow brings no results. Can anybody lend in some help?