Recently I’ve got a note with the question on search engine queries through the web scraping software.
The search engine queries are standard URLs toward the search engines. For example a white doves query in google engine would be the following URL:
So you might compose those urls according to your needs, store them in file(s) (since there could be millions of them) and then feed them into a scraper program. Consider an external URLs feed feature when choosing a scraper.
Most of the scraping software can receive multiple URLs input for scraping. You just have to pick one you like: VWR, Mozenda, Content Grabber and Import.io support it. With some scrapers you might need to use masks (Regex) to generate those URLs.
With querying search engines the main thing to watch out for is the limit they impose on search queries. So if, for example, Google’s engine detects someone is requesting too many (thousands per second) pages at a time, eventually it will put forth a CAPTCHA. So using proxies is essential. See the post on the using of software scraper in conjunction with a proxy service.