Categories
Development

JAVA, Selenium, headless Chrome, JSoup to scrape data of the web

In this post we share with you how to perform web scraping of a JS-rendered website. The tools as seen in the header are JAVA with Selenium library driving headless Chrome instances (download driver) and JSoup as parser to fetch data of the acquired HTML.

You can view the code in GitHub

ChromeDriver initialization

System.setProperty("webdriver.chrome.driver", path to the driver);
ChromeOptions chromeOptions = new ChromeOptions();
chromeOptions.setBinary(путь_до_браузера);
chromeOptions.addArguments("--headless");
chromeOptions.addArguments("--enable-javascript");
chromeOptions.addArguments("lang=en");
ChromeDriver driver = new ChromeDriver(chromeOptions);
driver.manage().timeouts().implicitlyWait(20, TimeUnit.SECONDS);

I have added some arguments to chromeOptions in the code. The driver threw exceptions without them.

chromeOptions.addArguments("--disable-dev-shm-usage"); // overcome limited resource problems
chromeOptions.addArguments("start-maximized"); // open Browser in maximized mode
chromeOptions.addArguments("disable-infobars"); // disabling infobars
chromeOptions.addArguments("--disable-extensions"); // disabling extensions
chromeOptions.addArguments("--disable-gpu"); // applicable to windows os only
chromeOptions.addArguments("--no-sandbox"); // Bypass OS security model

Getting an instance of the Document class using JSoup.

Document html = Jsoup.parse(HTML);
// or
Document html = Jsoup.parse(driver.getPageSource()); // getting HTML code from ChromeDriver

We can get the page in ChromeDriver using the following command:

driver.navigate().to(String url);
driver.navigate().to(URL url);

Main class (ScrapeData class)

The main work is done in the ScrapeData class, which implements the Runnable interface. Basic actions in the method run:

  • visit category pages
  • get links to program pages
  • scrape data from program pages
  • save data in database

The class constructor accepts a link to a site category page and a page number to start from.

private static class ScrapeData implements Runnable
{
    private final String subCategory;
    private int page;

    public ScrapeData(String subCategory) { this(subCategory, 0); }

    public ScrapeData(String subCategory, int page) 
    {
        this.subCategory = subCategory;
        this.page = page;
    }

    @Override
    public void run() 
    {
        int fail = 0; // fail counter
        final int maxFail = 30;
        RemoteWebDriver driver = getNewDriver();
        StudyPortalsData data; // The class for saving data
        DataBase.Status status;

        while(fail < maxFail) 
        {
            // get links to program page
            List<String> links = null;
            try
            {
                driver.navigate().to(subCategory + "&start=" + page);
                log.info("SCRAPE LINKS on page: \n\t" + subCategory + "&start=" + page);
                links = ScrapeStudyPortals.scrapeLinksOnUniversityPage(driver);
            }
            catch (Exception e) 
            {
                log.log(Level.SEVERE, "", e);
                fail += 3;
            }

            // follow the links and get the data
            if(links != null && links.size() > 0) 
            {
                for (String l : links) 
                {
                    log.info("Scrape page: " + l);
                    try {
                        // check if the program is exists in database
                        if(DataBase.isExistUniversityByLink(l))
                        {
                            log.info(l + " %%% This page is exist in DB %%%");
                            continue;
                        }

                        driver.navigate().to(l);
                        data = ScrapeStudyPortals.scrapeAllDataJSoup(driver);
                        status = DataBase.insertStudyPortalsData(data); 
                        // save the data
                        if (status != DataBase.Status.SUCCESSFUL) 
                        {
                            if(status != DataBase.Status.UNIVERSITY_ALREADY_IN_BD) 
                            {
                            	log.warning(subCategory + "\n\t[" + status + "]");
                            	saveFaleLink(l + " [" + status + "]");
                            	fail++;
                            }
                            else if(fail > 0)
                                fail--;
                        } 
                        else if (fail > 0)
                            fail--;
                    } 
                    catch (Exception e) 
                    {
                        log.log(Level.WARNING, "", e);
                        saveFaleLink(l + " Exception: " + Arrays.toString(e.getStackTrace()));
                        fail += 1;
                    }
                }
                page += 10;
            }
            else 
            {
                log.warning("<<<<<<<<<< Links don't exist >>>>>>>>>>");
                fail += 5;
            }
        }

        driver.quit();
        // write last page down
        saveFinalPage(subCategory, page);
    }
}

Class StudyPortalsData

StudyPortalsData class is for storing single page data.

public class StudyPortalsData 
{
    private final String link;
    private String programName;
    private List<String> disciplines;
    private List<String> attendance;
    ...

    public StudyPortalsData(String link) { this.link = link; }

    // getters and setters

    // methods toString(), hashCode(), equals()
}

Class ScrapeStudyPortals

The ScrapeStudyPortals class and its main method scrapeAllDataJSoup are to retrieve data from the current page.

public class ScrapeStudyPortals 
{
    public static StudyPortalsData scrapeAllDataJSoup(RemoteWebDriver driver) 
    {
        if(driver == null)
            throw new NullPointerException("RemoteWebDriver = null");

        StudyPortalsData result = new StudyPortalsData(driver.getCurrentUrl());
        Document html = Jsoup.parse(driver.getPageSource());

        /*
        Get data using the select method of the Document class, which accepts css selectors
        public Elements select (String cssQuery)
        We save the received data into an instance of the StudyPortalsData class
        */
    }
}

Class DataBase

The DataBase class is to save data from an instance of StudyPortalsData to the database using the insertStudyPortalsData method. The third-party MySQL Connector / J library was used to connect to the database.

public class DataBase 
{
    private static Connection connection;
    private static Statement statement;

    static 
    {
        try 
        {
            connection = DriverManager.getConnection(String url, String user, String password);
            statement = connection.createStatement();
        } 
        catch (SQLException e) { e.printStackTrace(); }
    }

    public static synchronized Status insertStudyPortalsData(StudyPortalsData studyPortalsData) throws SQLException
    {
        if(!checkConnection())
            return Status.ERROR_CONNECT_TO_DB;

        // check that all required fields are filled
        Status result;
        if((result = isDataFull(studyPortalsData)) != Status.SUCCESSFUL)
            return result;

        String link = studyPortalsData.getLink().trim();
        ...

        StringBuilder request = new StringBuilder("INSERT INTO data (link, program_name, attendance_id) VALUES (");
        request.append("'").append(link).append("', ");
        request.append("'").append(programName).append("', ");
        request.append(attendance).append(", ");

        ...

        request.append(")");

        statement.executeUpdate(request.toString());
    }
}

Methods of the DataBase class basically return DataBase.Status as a result.

public enum Status 
{
    SUCCESSFUL,
    NOT_HAVE_PROGRAM_NAME,
    NOT_HAVE_DISCIPLINES,
    NOT_HAVE_ATTENDANCE,
    ...
    UPDATE_RETURN_NEGATIVE_NUMB,
    ERROR_CONNECT_TO_DB;
}

Many of their Status instances are used in the isDataFull (StudyPortalsData studyPortalsData) method, which checks for data in all required fields.

private static Status isDataFull(StudyPortalsData studyPortalsData) 
{
    if(studyPortalsData.getProgramName() == null) return Status.NOT_HAVE_PROGRAM_NAME;
    if(studyPortalsData.getDisciplines() == null || studyPortalsData.getDisciplines().size() == 0) return Status.NOT_HAVE_DISCIPLINES;
    ...
    return Status.SUCCESSFUL;
}

.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.