Categories
Web Scraping Software

Mozenda web scraping and publishing of data to cloud storage

Mozenda is a cloud web scraping service (SaaS), and we’ve already reviewed it. Since our last review, Mozenda has provided more useful utility features for data extraction. Besides multi-threaded extraction & smart data aggregation, Mozenda allows users to publish extracted data to cloud storage such as Dropbox, Amazon, and Microsoft Azure. In this post we will try to explain the new Mozenda extraction and integration capabilities.

Workflow

The general testing workflow is as follows:

  1. Scrape data from two sites having different structures (aliexpress.com & boohoo.com)
  2. Combine data sets
  3. Upload results to two cloud storage services: AWS and Dropbox

Cloud Web Console vs Desktop Agent Builder

You have to understand the nature of the Mozenda scraping service (SaaS). The Mozenda service consists of 2 parts, its “two wings”.

  • The first part is a cloud part, where the actual data extraction (based on created agents) takes place, extracted data are stored (as collections) and all other chemistry happens. This part is faced to you as a browser Web console, which you access after a successful Mozenda login.
  • The second part is an Agent Builder that is to be installed to your Windows PC (only). Its name speaks for itself, pointing to its functionality.

You may read more on those “wings” from the vendor site – Web consoleAgent Builder.

Get Mozenda account

To create a Mozenda account you need to fill in a “new customer” form and wait for Mozenda staff [customer support] to reply to you. Fill in the form and wait until your account is created and credentials are sent to your mail box.

mozenda-new-customer-form

Web Console

Now you have access to the Mozenda web console interface with your profile available. Choose to install agent builder to your PC. mozenda-build-agent

 Creating Mozenda scraping agent

After you’ve installed the Agent Builder, in your PC tray right-click the Mozenda icon mozenda and select Start Mozenda Agent Builder.

Inside of the Agent Builder navigate to the target web page. Choose Start a new agent from this page.mozenda-agent-builder

Now you will be able to see the agent opened by you. The left side pane is the agent workflow design.

Start to build an agent through the point-&-click Mozenda interface. It is easy to add or modify actions for the scraping agent.

For each click that you perform over the loaded web content, you’ll get an interactive option menu where you can choose what to perform for this item or a group of them – the popup window titled Create Action (on the shot below) allows you to choose a particular action associated with a particular content to scrape.mozenda-interactive-list building

mozenda-agent-builder-action-listMozenda Agent Builder contains a rich set of the actions performed in an agent. There are a couple of actions commonly used when building agents. Those are Capture list and Click item commands to extract product detail information.

Capture list is used for creating a column for a list and Click item is mainly used for navigating to a product detail page. In this way, you will be able to extract price, description, images and so on.

 Also you can include the Add Action dropdown list to add actions (see an image at right).

[box style=’info’] The user can create up to 5 agents in a trial account. If you’ve reached the limit and you want to create a new agent, you only need to delete a previous agent or upgrade to a paid plan.[/box]

Running agent

After you’ve made and tested an agent in the Agent Builder, it is time to run it. Switch to the Mozenda Web Console. Here the system has loaded all your created agents and scraped results (if present). Choose (tick up/checkmark) an agent and press a run button:mozenda-agent-run_new

If your agent ran successfully you will see a green sign in the top panel. You will also be able to see a “time of last run” of the agent at the top right corner of the page. After a successful run, all scraped data are stored at Mozenda’s own servers and available to you; the data amount is shown at the bottom of the page.

Refining Data Sets

To perform scraped data refining in the Web Console, you can either remove all blank fields or create a new view (see how to do it below in the post) and have it show only the data with no empty cells.

Refine in Agent Builder

At the agent building stage you can also select non-empty items. In the workflow designer (left pane) select Item List and choose Refine List -> Only capture items that are not empty.mozenda-list-refine

Combining Data Sets

After scraping by aliexpress and boohoo agents, we have 2 different data sets corresponding to each agent. Agents might have different column headers, but we still want to combine two data sets into one. Mozenda completely provides the means to do it.

The interesting matter is that if the column titles of the agents are similar, then you will be able to combine the data sets seamlessly and place all the data in one spot. To do this, follow the procedure according to the directions illustrated in the images below:

  • On your Mozenda account page, open your existing agents. Each agent has its corresponding data set, scraped at the agents’ last runs. You choose agents (checkmark the boxes in the left column) whose data sets you want to combine.
  • In the top right corner click the “Combine data” button:mozenda-combine-data-sets-1
  • The pop up image will appear for you to name the collection and add its description: mozenda-combine-data-sets-2
  • Now you need to define the fields for the new combined collection by adding them through the following editor:mozenda-combine-data-sets-3

We’ve supposed both agents to have similar column names. Therefore after combining them, the corresponding titles, prices, images and descriptions will form one ordered collection. We were able to create a combined data set from various agents’ collections into a new one. A service field like ItemId is automatically added to each data set.

I named my collection AliExpress-BooHoo. See how it looks:mozenda-combine-data-sets-result

Rebuilding collection

If you are going to scrape more data with your agents, collection Rebulding is what you need. Open a collection of interest and click the Tools icon by your collection, then choose Rebuild from a nice Tool Menu dropdown:mozenda-rebuild-collection

View from Collectionmozenda-view-from-collection

Applicable to a data set, you can create a “view”, a visual representation of a given collection. For this you need to do the following:

  • Checkmark a data collection from which you want to make a “view”
  • Choose the gear button in the upper-right corner of the Mozenda Web Console
  • Create a new view and set it up.
  • Criteria settings can help you to refine results or filter according to a certain condition

See below how simple it is:

mozenda-api-as-rss-feed

A “view” can instantly be turned into an RSS Feed hosted at Mozenda. This way you can make a Mozenda API endpoint for everyone to have access. See an example feed.

Publishing to cloud

AWS

First of all you must register in AWS and create a bucket for data. To do this in AWS go to the main web console; in the Storage & Content Delivery section you choose the S3 option and proceed to make a new bucket.

Now back to the Mozenda Web Console. Choose an appropriate collection and open the Tools Menu mozenda-tools to unfold actions for the collection. Now select the Publish option. A self-explaining interface appears:mozenda-publish-to-cloud

Fill in all the form fields and choose Region from the drop-down. ‘Access key ID’ and ‘Secret access key’ are your AWS credentials. To get them, go to your AWS Security Credentials page and generate a new set of them.

[box style=”info”]To have broader management to your AWS resources access you might want to create an IAM user in AWS account, but that is outside of this post’s scope.[/box]

After filling in all the fields for Amazon publishing settings, I hit ‘Save & publish now‘ and soon I have my data upload into my AWS S3 bucket. The service had no indicator of successful or failed transfer though :-(.

Dropbox

First of all you login to your Dropbox account. Again open the collection Publish Data window from the Tools menu. As you press Save & publish, the data set will be published to your linked Dropbox account :-). From now on, the data from that Mozenda collection will automatically be published to the linked Dropbox account.

Support

Mozenda support turned out to be really responsive. When I sent any request to the Mozenda support team, it responded to me within an hour. The Support staff encourages the user to ask them again if the problem isn’t solved.

To Mozenda’s credit, even after my trial period was over, the account was still up. I was allowed to manage my collections in the Web Console, but not to run agents. That was adequate enough for me to complete views creation and publish previously scraped data to cloud storage.

Conclusion

As a whole, Mozenda provides a usable, easily-learned functionality for managing extraction, joining data sets, publishing to the cloud and making API data access points. This web service provides a “gentleman set” and even more for seamless data hunting and storage. The twofold nature of the Mozenda scraper (desktop Agent Builder and cloud scraping part) and the excellent support make Mozenda stand out from the other cloud scraping platforms.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.