Categories
Development

How to extract emails, phones, links (urls) from text fragments?

Recently I noticed the question about extracting emails, phones, links(urls) from text fragments and immediately I decided to write this short post.

Regex comes to rescue

Each of the following: email, phones, link, form a category that falls under/matches a certain text pattern. What are the text patterns ? These are regexes, aka regex patterns, short for regular expressions. Eg. most emails fit into the following regex pattern: 

^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$

Categories
Development

A Simple Email Crawler in Python

I often receive requests asking about email crawling. It is evident that this topic is quite interesting for those who want to scrape contact information from the web (like direct marketers), and previously we have already mentioned GSA Email Spider as an off-the-shelf solution for email crawling. In this article I want to demonstrate how easy it is to build a simple email crawler in Python. This crawler is simple, but you can learn many things from this example (especially if you’re new to scraping in Python).