Data Mining

Testing the Filter by TheWebMiner for advanced web content filtering

thewebminer_logoRecently I came across an interesting new tool from TheWebMiner called Filter. The Filter is an attempt by TheWebMiner to sort (categorize) indexed websites and deliver them to users as a content filtering service.
In a nutshell, the Filter project is something between search engine results and business directories. TheWebMiner does its own web crawling (similar to search engines), yet they having only crawled above 17M web pages (Nov. 2019). Compared to regular business directories the Filter categories crawled websites on their tech, web and social parameters (besides business parameters):

  • LOCALISATION (Top level domain and language)
  • TECH – technologies used to built site, e.g. jquery, google analytics, etc.
  • CALL TO ACTION (email presence on a web page – very useful for web marketing)
  • SOCIAL PRESENCE (FB or Twitter page presence)


thewebminer search settigns I’ve tested this tool on the web scraping companies (since I actively interact with most of them). So I’ve made simple setting (see it in the figure to the right).

The results were interesting; keyword: web scraping.

Results analysisthewebminer_filter_results

My conclusion to this particular search with filtering is:

  1. These are all companies, which is better than to google’s search results which deliver you everything as a miscellanies basket.
  2. I couldn’t filter out the real software or SaaS web scraping companies, which was annoying because more than 50% of the results were just custom data delivery companies.
  3. Results are small in number (only 24). Probably because of the nature of this service (17M sites only indexed as opposite to billions indexed by large search engines).
  4. The Filter service is very good in defining site technologies. Even though I did not define technology in my search settings, the Filter has still output them. Ex., for the Filter has recognised the following technologies: Google Font API, jQuery, PHP, Twitter Bootstrap, W3 Total Cache, WordPress, Font Awesome, WooCommerce.


I like this simple tool for finding things on the web with particular tech or business parameters. The huge drawback so far is limit of indexed info, yet they keep accumulating it. Let’s see what it’ll be like in couple years. Good luck, TheWebMiner.



Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.