We’ve done the Linkedin scraper that downloades the free study courses. They include text data, exercise files and 720HD videos. The code does not represent the pure Linkedin scraper, a business directory data extractor. Yet, you might grasp the main thoughts and useful techniques for your Linkedin scraper development.
Some LinkedIn scraper important points
1. LinkedIn only allows to download courses to Premium account users.
2. LinkedIn does not allow a desktop courses download. It takes a special Android or IOS app to do it.
3. Even though the LinkedIn in-build JS operates to confirm user login, the developed code logins without JS, that login has worked well.
4. Request calls were made to both video, features and pages.
5. Overall it took 120 requests for a single course.
6. The system alarm message was often set by LinkedIn since proxies were not in use.
Tech stack
- Python 3.6
- requests
- lxml
Login process
Step 1. Using xPath we get the CSRF-token to use in the session data.
# Looking for CSRF Token
html = lxml.html.fromstring(body)
csrf = html.xpath("//input[@name='loginCsrfParam']/@value").pop()
sIdString = html.xpath("//input[@name='sIdString']/@value").pop()
parentPageKey = html.xpath("//input[@name='parentPageKey']/@value").pop()
pageInstance = html.xpath("//input[@name='pageInstance']/@value").pop()
loginCsrfParam = html.xpath("//input[@name='loginCsrfParam']/@value").pop()
fp_data = html.xpath("//input[@name='fp_data']/@value").pop()
d = html.xpath("//input[@name='_d']/@value").pop()
controlId = html.xpath("//input[@name='controlId']/@value").pop()
Data for session:
data = { "session_key": username, "session_password": password, "csrfToken": csrf, 'ac': 0, 'sIdString': sIdString, 'parentPageKey':parentPageKey, 'pageInstance': pageInstance, 'trk': '', 'authUUID': '', 'session_redirect': '', 'loginCsrfParam': loginCsrfParam, 'fp_data': fp_data, '_d': d, 'controlId': controlId }
After the script gets the session data it uses them to login:
URL = urljoin('https://www.linkedin.com', 'checkpoint/lg/login-submit'session.post(URL, data=data,headers={'user-agent': 'Mozilla/5.0'})
Step 2. Makes actual login request and returns session with cookies set.
Download process
The rest of the downloading process is smooth. See the download procedure:
def download_file(url, output):
with session.get(url, headers=HEADERS, allow_redirects=True) as r:
try:
open(output, 'wb').write(r.content)
except Exception as e:
logging.exception(f"[!] Error while downloading: '{e}'")
if os.path.exists(output):
os.remove(output)
Acknowledgement
The code was provided by Ahmed Soliman, see the full project at Github.