Categories
Development

Problem scraping javascript site – help needed

Problem

I am trying to scrape the page https://tienda.mercadona.es/categories/112 and I have installed the docker and followed all the required steps given in the post. Splash works well, but the spyder does not and I don’t know why. The IP of the splash_url is correct but I can’t see in the response object when I write scrapy shell “webpage” the complete page, ie, the page has not rendered correctly.

Categories
Development

Scrape text, parse it with BeautifulSoup and save it as Pandas data frame

We want to share with you how to scrape text and store it as Pandas data frame using BeautifulSoup (Python). The code below works to store html li items in the ‘engine, ‘trans’, ‘colour’ and ‘interior’ columns.

from bs4 import BeautifulSoup
import pandas as pd
import requests

main_url = "https://www.example.com/"

def getAndParseURL(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text, 'html.parser')
    return(soup)

soup = getAndParseURL(main_url) 
ul = soup.select('ul[class="list-inline lot-breakdown-list"] li', recursive=True)
lis_e = []
for li in ul:
    lis = []
    lis.append(li.contents[1])
    lis_e.extend(lis)

engine.append(lis_e[0])
trans.append(lis_e[1])
colour.append(lis_e[2])
interior.append(lis_e[3])

scraped_data = pd.DataFrame({'engine': engine, 
'transmission': trans, 'colour': colour,
'interior': interior})
By default, Beautiful Soup searches through all of the child elements. So, setting recursive = False (line 13) will restrict the search to the first found element and its child only.

The code was provided by Ahmed Soliman.

Categories
Development

Download a file from a link in Python

I recently got a question and it looked like this : how to download a file from a link in Python?

“I need to go to every link which will open a website and that would have the download file “Export offers to XML”. This link is javascript enabled.”

Let us consider how to get a file from a JS-driven weblink using Python :
Categories
Development

Using DOMXPath for parsing page content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
After I’ve done a simple PHP/cURL scraper using Regex some have reasonably mentioned a request for a more efficient scrape with XPath. So, instead of parsing the content with Regex, I used DOMXPath class methods.

Categories
Development

Python LinkedIn downloader

We’ve done the Linkedin scraper that downloades the free study courses. They include text data, exercise files and 720HD videos. The code does not represent the pure Linkedin scraper, a business directory data extractor. Yet, you might grasp the main thoughts and useful techniques for your Linkedin scraper development.

Categories
Challenge Development

Scrape with Google App Script

In this post I want to let you how I’ve managed to complete the challenge of scraping a site with Google Apps Script (GAS).

Categories
Development Miscellaneous

Extracting sequential HTML elements with XPath and Regex

Often, we need to extract some HTML elements ordered sequentially rather than in hierarhical order.

Categories
Development

Python: submit authenticated form using cookie and session

Recently, I was challenged to do bulk submits through an authenticated form. The website required a login. While there are plenty of examples of how to use POST and GET in Python, I want to share with you how I handled the session along with a cookie and authenticity token (CSRF-like protection).

In the post, we are going to cover the crucial techniques needed in the scripting web scraping:

  • persistent session usage
  • cookie finding and storing [in session]
  • “auth token” finding, retrieving and submitting in a form
Categories
Development

Bypass Distil

The Distil scrape protection is a prominent one in the modern anti-scrape techniques. So, now we want to share with you some tips of how to bypass it. If you are interested, please make an inquiry to the following email: igor[dot]savinkin[at]gmail[dot]com


Categories
Development

Scraping JavaScript protected content

Here we come to one new milestone: the JavaScript-driven or JS-rendered websites scrape.

Recently a friend of mine got stumped as he was trying to get content of a website using PHP simplehtmldom library. He was failing to do it and finally found out the site was being saturated with JavaScript code. The anti-scrape JavaScript insertions do a tricky check to see if the page is requested and processed by a real browser and only if that is true, will it render the rest of page’s HTML code.