Categories
Uncategorized

Search queries in a search engine for scraping

Recently I’ve got a note with the question on search engine queries through the web scraping software.

“I’m looking for a scraper program that can initiate search queries in a search engine automatically, using proxies would be an added benefit if possible.”  – Mike

Answer:

The search engine queries are standard URLs toward the search engines. For example a white doves query in google engine would be the following URL:

https://www.google.ru/search?sourceid=chrome-psyapi2&ion=1&espv=2&ie=UTF-8&q=white%20doves&oq=white%20doves.

So you might compose those urls according to your needs, store them in file(s) (since there could be millions of them) and then feed them into a scraper program. Consider an external URLs feed feature when choosing a scraper.

Most of the scraping software can receive multiple URLs input for scraping. You just have to pick one you like: VWR, Mozenda, Content Grabber and Import.io support it. With some scrapers you might need to use masks (Regex) to generate those URLs.

Scrape limit

With querying search engines the main thing to watch out for is the limit they impose on search queries. So if, for example, Google’s engine detects someone is requesting too many (thousands per second) pages at a time, eventually it will put forth a CAPTCHA. So using proxies is essential. See the post on the using of software scraper in conjunction with a proxy service.

Good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *


The reCAPTCHA verification period has expired. Please reload the page.

This site uses Akismet to reduce spam. Learn how your comment data is processed.