Categories
Challenge Development

Scraping a Javascript-dependent website with puppeteer

Support us by purchasing the book (under $5) on this topic.

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

Categories
Development

JAVA library to scrape Linkedin & its data affiliates

In this post we want to share with you a new useful JAVA library that helps to crawl and scrape Linkedin companies. Get business directories scraped!

If you are considering the Linkedin data scrape legal issues, please refer to the following post: Linkedin lost in court to data analytic company that scrapes Linkedin’s public profiles info
Categories
Development

Scrape a JS Lazy load page by Python requests

The JS loading page is usually scraped by Selenium or another browser emulator. Yet, for a certain shopping website we’ve
found a way to perform a pure Python requests scrape.

Categories
Development

Bulk db prepared insert with rollback even if 1 record fails, PHP

Recently I needed to make a bulk insert into db with   prepared statement query. The task was to do it so that if one record failed one can rollback all records and return an error. That way no data is affected by faulty code and/or wrong data provided.

Categories
Development

Problem scraping javascript site – help needed

Problem

I am trying to scrape the page https://tienda.mercadona.es/categories/112 and I have installed the docker and followed all the required steps given in the post. Splash works well, but the spyder does not and I don’t know why. The IP of the splash_url is correct but I can’t see in the response object when I write scrapy shell “webpage” the complete page, ie, the page has not rendered correctly.

Categories
Development

Scrape text, parse it with BeautifulSoup and save it as Pandas data frame

We want to share with you how to scrape text and store it as Pandas data frame using BeautifulSoup (Python). The code below works to store html li items in the ‘engine, ‘trans’, ‘colour’ and ‘interior’ columns.

from bs4 import BeautifulSoup
import pandas as pd
import requests

main_url = "https://www.example.com/"

def getAndParseURL(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text, 'html.parser')
    return(soup)

soup = getAndParseURL(main_url) 
ul = soup.select('ul[class="list-inline lot-breakdown-list"] li', recursive=True)
lis_e = []
for li in ul:
    lis = []
    lis.append(li.contents[1])
    lis_e.extend(lis)

engine.append(lis_e[0])
trans.append(lis_e[1])
colour.append(lis_e[2])
interior.append(lis_e[3])

scraped_data = pd.DataFrame({'engine': engine, 
'transmission': trans, 'colour': colour,
'interior': interior})
By default, Beautiful Soup searches through all of the child elements. So, setting recursive = False (line 13) will restrict the search to the first found element and its child only.

The code was provided by Ahmed Soliman.

Categories
Development

Download a file from a link in Python

I recently got a question and it looked like this : how to download a file from a link in Python?

“I need to go to every link which will open a website and that would have the download file “Export offers to XML”. This link is javascript enabled.”

Let us consider how to get a file from a JS-driven weblink using Python :
Categories
Development

Using DOMXPath for parsing page content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
After I’ve done a simple PHP/cURL scraper using Regex some have reasonably mentioned a request for a more efficient scrape with XPath. So, instead of parsing the content with Regex, I used DOMXPath class methods.

Categories
Development

Python LinkedIn downloader

We’ve done the Linkedin scraper that downloades the free study courses. They include text data, exercise files and 720HD videos. The code does not represent the pure Linkedin scraper, a business directory data extractor. Yet, you might grasp the main thoughts and useful techniques for your Linkedin scraper development.

Categories
Challenge Development

Scrape with Google App Script

In this post I want to let you how I’ve managed to complete the challenge of scraping a site with Google Apps Script (GAS).