Octoparse is an easy and powerful visual web scraper enabling anyone, even those without much programming background, to collect and extract data from the web. Octoparse is designed in a way to help users easily deal with complex website structures, such as those with JavaScript; it can be compared to other web scraping tools such as Import.io and Mozenda.
While Octoparse has already been reviewed in an earlier post, in this post we’ll look into a few of the key features provided by Octoparse that substantially ease scraping pains for non-programmers. Watch this 2-min video of how to extract data with Octoparse.
Get an account to scrape for free
Sign-up for a new account on Octoparse’s website. It is not necessary to choose the premium plan, since the free plan offered is pretty generous with unlimited pages to scrape and unlimited storage. Once you sign up, download the App from the activation page or from the website (keep an eye on the coupons sent to your account if you are interested in upgrading).
Visual workflow designer – Scrape easily with ‘click’ and ‘drag’
The visual operation pane provided in the App enables user to create workflow with “point” and “click”. Similar to many other automatic scrapers, Octoparse works in a way that simulates a real person interacting with the webpages.
Dealing with dynamic websites
For more complicated scraping, e.g. when data are loaded with javascript on an interactive site, Octoparse provides the solution to all the following cases,
- Scrape behind a login
- Search-based extraction
- Scrape data loaded with Ajax
- Infinitive scroll
- Pagination with no ‘Next’ button
- Nested dropdowns
- Fill out forms
- Capture data hidden in html
- And more…
While designed to enable data crawling for all users, Octoparse enables developers AND non-developers easily to gain full control over every single element on the web page, utilizing the built-in XPath and RegEx tools.
Xpath and RegEx Tools – Taking web scraping to the next level
Regular Expression and XPath are essential techniques for handling complicated web scraping, yet they can be a little tricky for newbies to apply. The Octoparse Team has been considerate enough to provide the XPath tool and RegEx tool to help anyone easily generate the XPath/Regular Expression needed for accurate web scraping.
The XPath Tool
The Octoparse XPath Tool is constructed of four parts
- The Browser – Enter the target URL in the built-in browser and click on the “Go” button. The content of the web page will be displayed here.
- The Source Code – View the source code of the web page.
- The XPath Setting part. Check the options and fill in some parameters to generate XPath expression by hitting the “Generate” button.
- The XPath Result part. After the XPath is generated, click on the “Match” button to see if the current XPath finds elements on the webpage.
For more information about Octoparse XPath, find out here.
The RegEx tool
Regular expressions are patterns used to match character combinations in strings. In any scraping scenarios, where CSS selectors or XPath fail to work, one can quickly target the information needed using a regular expression syntax. Similar to the XPath tool, Octoparse provides a built-in RegEx tool. With the RegEx tool, users won’t need to struggle with character or string match, simply input a few conditions, then the RegEx will be automatically generated.
For more information about Octoparse RegEx Tool, find out here.
Tool for Data-Reformat
Okay, now you have successfully extracted the data desired, but they are not in the form you would like to have them, e.g. incorrect date format, unwanted spaces in between words, unwanted prefix/suffix, etc. Octoparse has made the required data transformation easy via a built-in data re-reformat tool. There are eight different transformations supported:
- Replace: replace strings or keywords of extracted data.
- Replace with regular expression: replace the content matched by a certain regular expression.
- Match with regular expression: select the aimed keywords among messy words.
- Trim spaces: delete the spaces before and after the extracted data.
- Add prefix: add what you need (number, character, signals, etc.) to the beginning of the extracted data.
- Add suffix: add something to the end of the data, which is just the opposite of “Add prefix”.
- Re-format extracted date/time: get date/time in the form you want.
- Html transcoding: decode html encoded characters into an unencoded text when extracting out the html source.
For more information about re-formating captured data in Octoparse, find out here.
The Octoparse Cloud Service
So, after the crawler has been properly set up, it’s time to run it.
Octoparse provides a cloud service (for paid users) to further enhance user’s scraping experience. The cloud service enables the following 4 options.
- Automatic data scraping on schedule
Users can schedule the crawlers to run at any time, or even in real time.
- Connect via API for real time extraction
Connect to the Restful API for retrieving extracted data in any desired frequency including in real-time.
- IP Rotation prevents IP Blocking
Does it ever drive you crazy that your IP address has been banned and you cannot access a website if you scrape a website frequently? Yes, it especially happens if you are extracting from high-profile websites such as social platforms or business directories. However, Octoparse enables you to scrape these websites by rotating anonymous HTTP proxy servers, minimizing the chances of being blocked.
- Automatically export data into database
Octoparse’s cloud service also supports automatic export to database including SQLserver, MySQL, and Oracle. Read the instructions here and follow the steps to connect your database to Octoparse.
Octoparse support
The Octoparse team is a friendly bunch and is open to any questions including task configuring problems. Users can reach the support team at support@octoparse.com
Conclusion
Octoparse is a feature-rich visual web scraping tool that definitely deserves credit for making web scraping easy for non-technical users. The software itself is powerful and versatile enough to handle the scraping of most dynamic sites in rather straight-forward ways. And its pricing is notably “friendly” too, with the free plan supporting unlimited web page scraping – definitely worth trying out.