Business directory simple scraper (python) at pythonanywhere

business directoryMy goal was to retrieve data from a web business directory.

Since the business directories scrape is the most challenging task (beside SERP scrape) there are some basic questions for me to answer:

  1. Is there any scrape protection set at that site?
  2. How much data is in that web business directory?
  3. What kind of queries can I run to find all the directory’s items?

Scraping a Javascript-dependent website with puppeteer

In today’s web 2.0 many business websites utilize JavaScript to protect their content from web scraping or any other undesired bot visits. In this article we share with you the theory and practical fulfillment of how to scrape js-dependent/js-protected websites.

JAVA library to scrape Linkedin & its data affiliates

In this post we want to share with you a new useful JAVA library that helps to crawl and scrape Linkedin companies. Get business directories scraped!

If you are considering the Linkedin data scrape legal issues, please refer to the following post: Linkedin lost in court to data analytic company that scrapes Linkedin’s public profiles info

Scrape a JS Lazy load page by Python requests

The JS loading page is usually scraped by Selenium or another browser emulator. Yet, for a certain shopping website we’ve
found a way to perform a pure Python requests scrape.

Bulk db prepared insert with rollback even if 1 record fails, PHP

Recently I needed to make a bulk insert into db with   prepared statement query. The task was to do it so that if one record failed one can rollback all records and return an error. That way no data is affected by faulty code and/or wrong data provided.

Problem scraping javascript site – help needed

Problem

I am trying to scrape the page https://tienda.mercadona.es/categories/112 and I have installed the docker and followed all the required steps given in the post. Splash works well, but the spyder does not and I don’t know why. The IP of the splash_url is correct but I can’t see in the response object when I write scrapy shell “webpage” the complete page, ie, the page has not rendered correctly.

Scrape text, parse it with BeautifulSoup and save it as Pandas data frame

We want to share with you how to scrape text and store it as Pandas data frame using BeautifulSoup (Python). The code below works to store html li items in the ‘engine, ‘trans’, ‘colour’ and ‘interior’ columns.

from bs4 import BeautifulSoup
import pandas as pd
import requests

main_url = "https://www.example.com/"

def getAndParseURL(url):
    result = requests.get(url)
    soup = BeautifulSoup(result.text, 'html.parser')
    return(soup)

soup = getAndParseURL(main_url) 
ul = soup.select('ul[class="list-inline lot-breakdown-list"] li', recursive=True)
lis_e = []
for li in ul:
    lis = []
    lis.append(li.contents[1])
    lis_e.extend(lis)

engine.append(lis_e[0])
trans.append(lis_e[1])
colour.append(lis_e[2])
interior.append(lis_e[3])

scraped_data = pd.DataFrame({'engine': engine, 
'transmission': trans, 'colour': colour,
'interior': interior})
By default, Beautiful Soup searches through all of the child elements. So, setting recursive = False (line 13) will restrict the search to the first found element and its child only.

The code was provided by Ahmed Soliman.

Download a file from a link in Python

I recently got a question and it looked like this : how to download a file from a link in Python?

“I need to go to every link which will open a website and that would have the download file “Export offers to XML”. This link is javascript enabled.”

Let us consider how to get a file from a JS-driven weblink using Python :

Using DOMXPath for parsing page content in PHP

The DOMXPath class is a convenient and popular means to parse HTML content with XPath.
After I’ve done a simple PHP/cURL scraper using Regex some have reasonably mentioned a request for a more efficient scrape with XPath. So, instead of parsing the content with Regex, I used DOMXPath class methods.

Python LinkedIn downloader

We’ve done the Linkedin scraper that downloades the free study courses. They include text data, exercise files and 720HD videos. The code does not represent the pure Linkedin scraper, a business directory data extractor. Yet, you might grasp the main thoughts and useful techniques for your Linkedin scraper development.