Being the biggest scraper Google itself doesn’t like when somebody scrapes it. This makes life of google scrapers difficult.
In this post I offer you several hints on how to scrape Google in a safe way (if you still decided to do this).
The first thing Google scrapers need to have is a proxy source that is reliable. This will allow you to change your IP address. It goes without saying that any proxy that you choose needs to be of the high anonymous variety. You also need to be certain that the proxy is extremely fast and that it has not been guilty of any Google abuse previously.
You can utilize any proxy solution (choose any of them) or view some reviews of proxy services. They are able to deliver quality IPs that have never been utilized for the purpose of accessing Google before.
A person should use anywhere from 50 to 150 proxies for their continued scraping activity. This will depend on what the average result set is for all individual search queries. There will inevitably be some projects that require additional proxies.
Make sure you choose the right time to change your IP. This is critical if you are going to scrape successfully. Always change your IP following every keyword switch if you are receiving 300-1,000 results for each keyword. If you are receiving less than 300 results, a single IP can be used to scrape several keywords. However, you may need to add a delay or increase the amount of proxies you are using.
Be certain that you clear all of your cookies following every IP change or totally disable them.
Google scrapers should never utilize threads unless they are needed. Threads are multiple scraping processes that are done at the same time. It is possible for you to scrape millions of results every day without the use of threads.
Add &num=100 to the search URL in order to set the maximum amount of search results to 100.
Your main search should have other keywords appended to it. Google makes it difficult to obtain more than 1,000 results for a single topic. However, it is possible to obtain almost all URLs.
Avoid gray or blacklisting for reliable scraping. Google scrapers should never scrape more than 500 requests during a 24-hour period for each IP address.
In the event that you get a captcha or virus warning, you need to stop what you are doing right away. Captcha indicates that they have detected your scraping activities. Increase the amount of proxies. If you are using more than 100, it might be necessary to utilize a different source for your IPs. Use the private proxy source listed above. It is possible to scrape Google constantly without them ever detecting you.
5 replies on “Google Scraper Hints”
Very Nice Post It Give us a more information on Google scraper and thanks for explaining safe way to do it. Thanks for sharing valuable information.
Google doesn’t follow its own rules on scraping. Nice post, Thanks for Sharing.
Good post Michael.
Often when reading your posts I’d like to just vote up/down quickly or rank a previous comment vs this comment system here.
For garnering more feedback from readers – have you looked at something like a http://disqus.com/websites/ or similar as well?
It seems to spur a wider community contribution and I’d like to see more feedback from your nicely constructed reviews.
Well, if you need to scrape google a lot, you should check scrapebox.
Now it is useless, google find that you are spamming your web page try it for free and you will see your page how is going to the 80-90 position immediately.