Recently I noticed the question about extracting emails, phones, links(urls) from text fragments and immediately I decided to write this short post.
Regex comes to rescue
Each of the following: email, phones, link, form a category that falls under/matches a certain text pattern. What are the text patterns ? These are regexes, aka regex patterns, short for regular expressions. Eg. most emails fit into the following regex pattern:
^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$
The phones match slightly easier patterns, yet there are international phone notations, so you have to pay attention to this. Phone regex:
^[+]*[(]{0,1}[0-9]{1,4}[)]{0,1}[-\s\./0-9]*$
The urls are usually wrapped up inside the a tag: <a href=”https://www.test-site.com” >Test site</a> For a simple link consider the following regex:
(https?:\/\/)?([\w\-])+\.{1}([a-zA-Z]{2,63})([\/\w-]*)*\/?\??([^#\n\r]*)?#?([^\n\r]*)
View the separate post on email regexes.
1. Identify regex patterns
So, your first step is to identify the proper regexes that would be used for extracting emails/phones/links from text.
2. The tools to extract
Having needed regexes you might apply them using the following options.
Option A. Scripting language.
Note, that in the post on email regexes one could see that regex patterns differ depending on the language. PHP being the most used language, we give a short example of extracting emails in php:
<?php
$text = "...some text with couple emails: iron.sane56@gmail.com and toom-kelly@none.tv";
$pattern='/[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+/';
preg_match_all($pattern, $text, $matches);
foreach($matches[0] as $match){
echo $match . PHP_EOL;
}
See an extensive post on using regexes in PHP.
Option B. Online extraction tools
The simplest extraction tools are the Regex testers.
Their main goal is for the user to compose and improve their regexes, yet if you have a non-repetitive extraction task, you can use them.
Read the Ten+ best online regex testers post to view some of them.