Simple JAVA email crawler – webscraping.pro

In this post we share the code of a simple Java email crawler. It crawls emails of a given website, with an infinite crawling depth. A previous post showed us Python simple email crawler.

Init settings

In our crawler we used the following libraries. For extracting web pages, I use the third-party JSoup library. It has many methods for extracting and modifying web data. We also use the java.io, java.util, and java.net libraries.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;
import java.net.URL;

Init page and its host name

We will use the host name of the init page in order for our code to work with pages of only a given site.

String givenURL = "https://www.xing.com/companies";
String authority; // host name		
{
        URL mainURL = new URL(givenURL);
        authority = mainURL.getAuthority();
}

Before the main cycle starts, we create a regex email template and arrays for data (email and URLs) storage.

String regex = "[a-zA-Z0-9\\.\\-\\_]+@[a-zA-Z]+[\\.]{1}[a-zA-Z]{2,4}";
Pattern pattern = Pattern.compile(regex);
		
ArrayList<String> listOfURL = new ArrayList<String>();
ArrayList<String> listOfEmail = new ArrayList<String>();
listOfURL.add(givenURL);

Main loop

In the main loop we iterate over an array with links to pages of a given site. The init link was saved in the previous piece of code.

for(int i = 0; i < listOfURL.size(); i++)
{
       //...
}

Note that this listOfURL array will be appended with new found links within each loop/cycle.
The cycle consists of 3 parts:

The URL is extracted from the array, and we fetch the html page at its address. The page html code is stored as the Document object of the org.jsoup.nodes.Document class.
```
givenURL = listOfURL.get(i);
Document doc = Jsoup.connect(givenURL).get();
```
The following piece of code is used to find and save email addresses. We transfer the text of the page to the object of the matcher, which finds matches for the pattern. Found email addresses are saved into the array without duplication.
```
String siteText = doc.text();
Matcher matcher = pattern.matcher(siteText);
while(matcher.find())
{
	if(!listOfEmail.contains(matcher.group()))
	       listOfEmail.add(matcher.group());
}
```
The last part of the code looks for and stores the links in the listOfURL array. Links of new pages having the same host are to be added without duplication. First, we look in the Document object for the tags ‘a’ with the attribute ‘href’. The ‘select’ method of the Document object accepts CSS or jQuery selectors to search. The found ‘a’ tags are saved in the Elements object. Next, we iterate over the found links of an Element object. We store the URLs as strings into the listOfURL array (not the entire ‘a’ tag notation).
To get the full address (resolve relative path) in the attr method we use not just ‘href’ notation, but one with the prefix ‘abs:’. The resulting address is saved as an instance of the URL class for comparison with the host name of the init page. If the names match, the URL is checked for duplication before being saved to the listOfURL array.
```
Elements scrapedUrls = doc.select("a[href]");
for(Element tag_a : scrapedUrls)
{
	String str = tag_a.attr("abs:href");
	URL url = new URL(str);
	if(authority.equals(url.getAuthority()))
	{
		if(!listOfURL.contains(str))
			listOfURL.add(str);
	}
}
```

Whole code

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.ArrayList;
import java.net.URL;

class JEmailCrawler
{
	public static void main(String[] args) throws IOException
	{
		String givenURL = "https://www.xing.com/companies";
		String authority; // host name
		
		{
			URL mainURL = new URL(givenURL);
			authority = mainURL.getAuthority();
		}
		
		String regex = "[a-zA-Z0-9\\.\\-\\_]+@[a-zA-Z]+[\\.]{1}[a-zA-Z]{2,4}";
		Pattern pattern = Pattern.compile(regex);
		
		ArrayList<String> listOfURL = new ArrayList<String>();
		ArrayList<String> listOfEmail = new ArrayList<String>();
		listOfURL.add(givenURL);
		
		// main process/crawler loop
		for(int i = 0; i < listOfURL.size(); i++)
		{
			givenURL = listOfURL.get(i);
			System.out.print("Connect to " + givenURL + " ");
				
			Document doc = Jsoup.connect(givenURL).get();
			
			// search and save emails and their location
			String siteText = doc.text();
			Matcher matcher = pattern.matcher(siteText);
			while(matcher.find())
			{
				if(!listOfEmail.contains(matcher.group()))
					listOfEmail.add(matcher.group());
			}
			
			// search and save URLs without duplication
			Elements scrapedUrls = doc.select("a[href]");
			for(Element tag_a : scrapedUrls)
			{
				String str = tag_a.attr("abs:href");
				URL url = new URL(str);
				if(authority.equals(url.getAuthority()))
				{
					if(!listOfURL.contains(str))
						listOfURL.add(str);
				}
			}
			System.out.print(" --- Found links (" + scrapedUrls.size() + ") ");
			System.out.println("Saved links (" + listOfURL.size() + ") ");
		}
		System.out.println("Total links : " + listOfURL.size());
		System.out.println("Total Email Address : " + listOfEmail.size());
	}
}

For those who want to check the result (URLs), a small piece of code can be added to the end of the main method. The list with links is sorted and only first 50 links are displayed. For this we import the Arrays class.

//import java.util.Arrays;

//checking for duplicate links
Object[] checkList = listOfURL.toArray();
Arrays.sort(checkList);
for(int i = 0; i < checkList.length && i < 50; i++)
	System.out.println(checkList[i]);
			
//email output and their location
for(int i = 0; i < listOfEmail.size() && i < 50; i++)
	System.out.println(listOfEmail.get(i));

Now you can use the code, and we welcome your feedback. You may fork the project from github.

Свежие записи

Свежие комментарии

Архивы

Рубрики

Init settings

Init page and its host name

Main loop

Whole code

Leave a Reply Cancel reply