This short post is on the WP-plugin called Web Scraper Shortcode, that enables one to retrieve a portion of a web page or a whole page and insert it directly into a post. This plugin might be used for getting fresh data or images from web pages for your WordPress driven page without even visiting it. More scraping plugins and sowtware you can find in here.
To install it in WordPress go to Plugins -> Add New.
The plugin scrapes the page content and applies parameters to this scraped page if specified. To use the plugin just insert the
shortcode into the HTML view of the WordPress page where you want to display the excerpts of a page or the whole page. The parameters are as follows:
- url (self explanatory)
- element – the dom navigation element notation, similar to XPath.
- limit – the maximum number of elements to be scraped and inserted if the element notation points to several of them (like elements of the same class).
The use of the plugin is of the dom (Data Object Model) notation, where consecutive dom nodes are stated like node1.node2; for example: element = ‘div.img’. The specific element scrape goes thru ‘#notation’. Example: if you want to scrape several ‘div’ elements of the class ‘red’ (<div class=’red’>…<div>), you need to specify the element attribute this way: element = ‘div#red’.
How to find DOM notation?
But for inexperienced users, how is it possible to find the dom notation of the desired element(s) from the web page? Web Developer Tools are a handy means for this. I would refer you to this paragraph on how to invoke Web Developer Tools in the browser (Google Chrome) and select a single page element to inspect it. As you select it with the ‘loupe’ tool, on the bottom line you’ll see the blue box with the element’s dom notation:
The plugin content
As one who works with web scraping, I was curious about the means that the plugin uses for scraping. As I looked at the plugin code, it turned out that the plugin acquires a web page through ‘simple_html_dom‘ class:
$html = file_get_html($url);
- then the code performs iterations over the designated elements with the set limit
- Be careful if you put two or more [web-scraper] shortcodes on your website, since downloading other pages will drastically slow the page load speed. Even if you want only a small element, the PHP engine first loads the whole page and then iterates over its elements.
- You need to remember that many pictures on the web are indicated by shortened URLs. So when such an image gets extracted it might be visible to you in this way: , since the URL is shortened and the plugin does not take note of its base URL.
- The error “Fatal error: Call to a member function find() on a non-object …” will occur if you put this shortcode in a text-overloaded post.
I’d recommend using this plugin for short posts to be added with other posts’ elements. The use of this plugin is limited though.