I want to share how I’ve done the audio captcha recognize-er. The audio captcha recognize-er was designed to solve captcha at xbox.com back in 2012.
Category: Development
Web scraping with JavaScript
Is it possible to scrape an HTML page with JavaScript from inside of a web browser?
To be perfectly honest I wasn’t sure so I decided to try it out.
Full disclaimer here, I didn’t actually succeed. However, it was a great learning experience for me and I think you guys could benefit from seeing what I did and where I went wrong. Who knows, maybe you can take what I’ve done and figure it out for yourself!
How to parse messy encoded HTML
Let’s suppose you want to extract a price with a currency sign from a web page (eg. £220.00), but its HTML code is this:
which is obviously encoded HTML.
In this post, I’d like to demonstrate how to leverage the Dexi.io (CloudScrape) API along with its PHP Client library (also avail in Ruby and C#).
Some of you may be wondering if it’s possible to extract a web browser’s local storage by web scraping?
I’ve already written about how the new No CAPTCHA ReCaptcha works, and even had some success breaking it with an iMacros’ browser automation. But, the latest scraping tools are – for most part – driven by Python, so now I want to try the same experiment with Selenium + Python.
Recently I’v been getting requests for a tutorial showing how to solve Google’s No CAPTCHA ReCaptcha. I’ve introduced it before and promised to work out a script to automate solving it. And here’s what I’ve come up with.
Professional data extraction requires adequate proxying to keep anonymity of scraping robots. When attempting to extract large data sets (over 1M records, ex. business directories) reliable and fast proxy service is needed.
Sequentum has released the Nohodo proxy service integration for Content Grabber. Nohodo provides a free account for Content Grabber users (up to 5000 requests monthly for free). The feature is available for both trial users and regular customers. Here’s how it works…
Suppose I run a query to import.io API:
$url = "https://query.import.io/store/connector/" . $connectorGuid . "/_query?_user=" . urlencode($userGuid) . "&_apikey=" . urlencode($apiKey);
“HI there can you please tell me that what are connector-guid, user-guid and api key in below given code and how to get them for any website?”
I came across this question on StackOverflow, and as an avid import.io user I thought I’d answer it here as well, in case any of you have the same issue.
Recently Import.io introduced a new extraction technique called Magic. The Magic scraping method works be attempting to scrape all the information off the page automatically and in one shot. We covered it in another post early last year. When we covered it back then, we noted a few issues:
- The scraper only works on pages with more than one row of data like a search results page, category pages and etc.
- It seems to have trouble with some javascript pages.
But now Import.io has released a second version of Magic which seems to have dealt with those obstacles. Not only that, but they have released an API for Magic that lets you see what’s going on behind the scenes.