Are you thinking of protecting your website content from theft and nonlegal scraping? Are you suspecting that some ‘innocent bots’ are continually visiting your web pages for data retrieval? Now we come to the anti scraping bot software and services. In this post we want to briefly review the new anti scrape bot service called Distil.
Distil started out helping digital publishers to protect their content from scrape, and now it claims an increase in scope resulting in it becoming a global service in this niche. Yes, illegal scraping creates the problem of data leakage and content farming. The latter is very much wrapped up with web publishing, but other industries might also need to be protected from data theft. These are E-commerce, Airline and Housing industry, Classifieds and others.
Distil is devoted to real-time scrape bot protecting, thus guarding multiple users with multiple domains. Once you plug in your domains, the service goes on alert to monitor HTTP traffic and detect both positive and negative visitors. Once some malicious bot is detected Distil may block it or put out a captcha or drop the bot request. The service allows you to manage as few as 5 domains up to an unlimited number with the Enterprise plan.
In the next post we review the results of the live-testing that we put Distil to.
Set up
After registering, you’ll get an email with your username/email and password. You should also read the instructions at the bottom of the letter for placing the domain under protection from scrapers… but initially I could not find the stated side panel item. So, as you login, at the left side panel go to Configuration -> Edit Subdomains and Server IP and follow the instructions. (Ensure that the IP address in the “Origin Server IP” field is the correct IP address for your server. After that you just do the changes in A and CNAME records of your domain with your DNS provider as stated in the email.) At this config option you might also add subdomains for bot protection.
Protection
Distil uses a very professional approach to scrape bot protection. It works to profile the guarded site with date info, bandwidth load and geo access. This makes it possible to develop custom profiles for the business site under guard and thus sift off all suspected malicious access.
If your classified web portal gets loaded (not crushed :-)) on a Friday night all the way until Monday morning with millions of queries (even from various IP addresses), why not suspect some scrapers’ assault? Distil’s strategy involves a layer of data mining and machine learning to effectively protect the websites.
The service tracks the users’ pages-per-minute speed, session length and pages per session for bot analysis. The detected scrapers are stored in a database, and Distil bans these from each of its clients’ sites.
Some Bot Protection features applied
- Uses statistics for bot behavior detection
- Good Bot, Bad Bot and browser’s identity checks (tries to execute JavaScript test to check if an actual browser is hitting your site). As we know, scrapers can leverage Selenium to automate innocent browsers for web scraping.
- Sets time thresholds for pages and maximum session length
- Tracks usage patterns by individual users to better highlight bots
- Blocks access to the protected site from some world segments.
- SSL Encryption
Performance change while being protected
Distil joins your domains to the cloud network, Content Delivery Network; thus they claim you’ll maintain and even get better site traffic performance while being in the scrape protection mode. With this system your site’s content or data get spread (cashed) geographically over multiple data centers for users’ convenience. They claim even to provide backup datacenter support (about 15 units worldwide) when your primary center goes offline (looks like a global backup server).
The service allows dynamic content caching to store frequently accessed content for increased performance. Automatically compressing content for faster delivery is another feature for maintaining performance. No doubt such options are because the startup is partnered with one of the worldwide internet infrastructure leader (Dyn).
Market opportunities
The anti-bot market related to web data scraping problems seems not to have been very developed until now. So the advance of the startup largely depends on if it can kindle companies’ interest in this anti scrape bot service. Some advice was recently given to Distil CEO Rami Essaid : “…These markets are very new and require quite a bit of education for your potential clients” by Elana Fine, Dingman Center for Entrepreneurship (source). Essaid’s reply: “We’ve been educating potential clients on the danger of bots through such activities as sending informative e-mails and holding educational webinars.” So market growth will provide more opportunities for this company and other similar services.