In this post we share with you how to perform web scraping of a JS-rendered website. The tools as seen in the header are JAVA with Selenium library driving headless Chrome instances (download driver) and JSoup as parser to fetch data of the acquired HTML.
You can view the code in GitHub
ChromeDriver initialization
System.setProperty("webdriver.chrome.driver", path to the driver);
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setBinary(путь_до_браузера);
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--enable-javascript");
chromeOptions.addArguments("lang=en");
ChromeDriver driver = new ChromeDriver(chromeOptions);
driver.manage().timeouts().implicitlyWait(20, TimeUnit.SECONDS);
I have added some arguments to chromeOptions in the code. The driver threw exceptions without them.
chromeOptions.addArguments("--disable-dev-shm-usage"); // overcome limited resource problems
chromeOptions.addArguments("start-maximized"); // open Browser in maximized mode
chromeOptions.addArguments("disable-infobars"); // disabling infobars
chromeOptions.addArguments("--disable-extensions"); // disabling extensions
chromeOptions.addArguments("--disable-gpu"); // applicable to windows os only
chromeOptions.addArguments("--no-sandbox"); // Bypass OS security model
Getting an instance of the Document class using JSoup.
Document html = Jsoup.parse(HTML);
// or
Document html = Jsoup.parse(driver.getPageSource()); // getting HTML code from ChromeDriver
We can get the page in ChromeDriver using the following command:
driver.navigate().to(String url);
driver.navigate().to(URL url);
Main class (ScrapeData class)
The main work is done in the ScrapeData class, which implements the Runnable interface. Basic actions in the method run:
- visit category pages
- get links to program pages
- scrape data from program pages
- save data in database
The class constructor accepts a link to a site category page and a page number to start from.
private static class ScrapeData implements Runnable
{
private final String subCategory;
private int page;
public ScrapeData(String subCategory) { this(subCategory, 0); }
public ScrapeData(String subCategory, int page)
{
this.subCategory = subCategory;
this.page = page;
}
@Override
public void run()
{
int fail = 0; // fail counter
final int maxFail = 30;
RemoteWebDriver driver = getNewDriver();
StudyPortalsData data; // The class for saving data
DataBase.Status status;
while(fail < maxFail)
{
// get links to program page
List<String> links = null;
try
{
driver.navigate().to(subCategory + "&start=" + page);
log.info("SCRAPE LINKS on page: \n\t" + subCategory + "&start=" + page);
links = ScrapeStudyPortals.scrapeLinksOnUniversityPage(driver);
}
catch (Exception e)
{
log.log(Level.SEVERE, "", e);
fail += 3;
}
// follow the links and get the data
if(links != null && links.size() > 0)
{
for (String l : links)
{
log.info("Scrape page: " + l);
try {
// check if the program is exists in database
if(DataBase.isExistUniversityByLink(l))
{
log.info(l + " %%% This page is exist in DB %%%");
continue;
}
driver.navigate().to(l);
data = ScrapeStudyPortals.scrapeAllDataJSoup(driver);
status = DataBase.insertStudyPortalsData(data);
// save the data
if (status != DataBase.Status.SUCCESSFUL)
{
if(status != DataBase.Status.UNIVERSITY_ALREADY_IN_BD)
{
log.warning(subCategory + "\n\t[" + status + "]");
saveFaleLink(l + " [" + status + "]");
fail++;
}
else if(fail > 0)
fail--;
}
else if (fail > 0)
fail--;
}
catch (Exception e)
{
log.log(Level.WARNING, "", e);
saveFaleLink(l + " Exception: " + Arrays.toString(e.getStackTrace()));
fail += 1;
}
}
page += 10;
}
else
{
log.warning("<<<<<<<<<< Links don't exist >>>>>>>>>>");
fail += 5;
}
}
driver.quit();
// write last page down
saveFinalPage(subCategory, page);
}
}
Class StudyPortalsData
StudyPortalsData class is for storing single page data.
public class StudyPortalsData
{
private final String link;
private String programName;
private List<String> disciplines;
private List<String> attendance;
...
public StudyPortalsData(String link) { this.link = link; }
// getters and setters
// methods toString(), hashCode(), equals()
}
Class ScrapeStudyPortals
The ScrapeStudyPortals class and its main method scrapeAllDataJSoup are to retrieve data from the current page.
public class ScrapeStudyPortals
{
public static StudyPortalsData scrapeAllDataJSoup(RemoteWebDriver driver)
{
if(driver == null)
throw new NullPointerException("RemoteWebDriver = null");
StudyPortalsData result = new StudyPortalsData(driver.getCurrentUrl());
Document html = Jsoup.parse(driver.getPageSource());
/*
Get data using the select method of the Document class, which accepts css selectors
public Elements select (String cssQuery)
We save the received data into an instance of the StudyPortalsData class
*/
}
}
Class DataBase
The DataBase class is to save data from an instance of StudyPortalsData to the database using the insertStudyPortalsData method. The third-party MySQL Connector / J library was used to connect to the database.
public class DataBase
{
private static Connection connection;
private static Statement statement;
static
{
try
{
connection = DriverManager.getConnection(String url, String user, String password);
statement = connection.createStatement();
}
catch (SQLException e) { e.printStackTrace(); }
}
public static synchronized Status insertStudyPortalsData(StudyPortalsData studyPortalsData) throws SQLException
{
if(!checkConnection())
return Status.ERROR_CONNECT_TO_DB;
// check that all required fields are filled
Status result;
if((result = isDataFull(studyPortalsData)) != Status.SUCCESSFUL)
return result;
String link = studyPortalsData.getLink().trim();
...
StringBuilder request = new StringBuilder("INSERT INTO data (link, program_name, attendance_id) VALUES (");
request.append("'").append(link).append("', ");
request.append("'").append(programName).append("', ");
request.append(attendance).append(", ");
...
request.append(")");
statement.executeUpdate(request.toString());
}
}
Methods of the DataBase class basically return DataBase.Status as a result.
public enum Status
{
SUCCESSFUL,
NOT_HAVE_PROGRAM_NAME,
NOT_HAVE_DISCIPLINES,
NOT_HAVE_ATTENDANCE,
...
UPDATE_RETURN_NEGATIVE_NUMB,
ERROR_CONNECT_TO_DB;
}
Many of their Status instances are used in the isDataFull (StudyPortalsData studyPortalsData) method, which checks for data in all required fields.
private static Status isDataFull(StudyPortalsData studyPortalsData)
{
if(studyPortalsData.getProgramName() == null) return Status.NOT_HAVE_PROGRAM_NAME;
if(studyPortalsData.getDisciplines() == null || studyPortalsData.getDisciplines().size() == 0) return Status.NOT_HAVE_DISCIPLINES;
...
return Status.SUCCESSFUL;
}
.