Categories
Web Scraping Software

Web Scraper Shortcode Plugin Issue

Since we’ve reviewed the Web Scraper Shortcode, we consider now some issues with this Word Press plugin. It is the Word Press plugin for extracting a web page or a part of it and inserting it into a custom Word Press driven page.

Some users have pointed out that the issue with this plugin is not being able to extract specific elements of a web page. They wanted to get some finance info from this page http://ca.finance.yahoo.com/q?s=rab.v&ql=1, and particularly this element: <div class=yfi_rt_quote_summary>.

To scrape this element I inserted element=’div#yfi_rt_quote_summary’ into the plugin shortcode and it worked fine:

But from the same page the other DOM (Data Object Model) element: <span id=”yfs_j10_rab.v”>62.94M</span> was not scraped by the plugin.

The figure below shows the element in question inspected through Chrome Developers’ Tools:

So inside of the web scraper shortcode, when I defined element=’span#yfs_j10_rab.v’, the plugin’s logic didn’t reveal the corresponding elements.

I suppose that’s because the yfs_j10_rab.v notation (dot in between) could be considered by the plugin as consecutive elements, like node1.node2 .

For this issue, when the class or id name of a DOM element is a compound one, consisting of several parts with the dot (.) delimiter, the Web Scraper Shortcode is not suitable for extracting those elements.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.