We’ve already stated some Tips and Tricks of scraping business directories or data aggregators sites. Yet recently someone has asked us to do aggregators’ scraping in the context of Google Sheets and/or MS Excel.
Indeed, both Google Sheets & MS Excel provide inbuilt functions for extraction custom data of external web and even appliying regex to them. See the ancquiry we’ve got:
I need a REGEX for Google sheets to extract Social Media URLs starting with:
Both Linkedin and Xing.com are business directories. They don’t expose their data that easily. Both Google sheets and Excel only allow you to fetch data with simple queries. These queries do not handle custom User-Agents, store cookies for persisted sessions, neither proxying requests… There will be no real result with that.
Browser automation with Selenuim or Puppeteer comes to resque.
We recommend you to read some of the posts on the browser automation
- Scraping a JS dependent website with Puppeteer
- JAVA, Selenium, headless Chrome, JSoup to scrape data of the web
- Xing.com scrape
- Linkedin scrape
- Linkeding scrape legal issue
- Selenuim posts