Use Python to Scrape LinkedIn Profiles

Logo of LinkedIn

Categories

  • web scraping

Tags

  • linkedin
  • linkedin profiles
  • linkined links
  • linkedin list
  • google scrape

Use Python to Scrape LinkedIn Profiles

LinkedIn is a great place to find leads and engage with prospects. In order to engage with potential leads, you’ll need a list of users to contact. However, getting that list might be difficult because LinkedIn has made it difficult for web scraping tools. That is why I made a script to search Google for potential LinkedIn user and company profiles.

Tools Required

You’ll need Python 2.7+ and some packages to get started. Once you have Python installed you can run the following command to install the necessary packages.

pip install requests

LinkedIn Scraper Script

First we need to import all the packages that we need. These packages are used for randomizing the user-agent and making the requests. Then regex is used to parse out the LinkedIn profiles and links.

import random
import argparse
import requests
import re

We create a LinkedinScraper class that tracks and hold the data for each of the requests. The class requires two parameters keyword and limit. The keyword parameter specifies the search term. The limit parameter specifies max amount of links to search for.

class LinkedinScraper(object):
  def __init__(self, keyword, limit):
      """
      :param keyword: a str of keyword(s) to search for
      :param limit: number of profiles to scrape
      """
      self.keyword = keyword.replace(' ', '%20')
      self.all_htmls = ""
      self.quantity = '100'
      self.limit = int(limit)
      self.counter = 0

The LinkedinScraper class has three main functions, search , parse_links , and parse_people.

The search function will perform the requests based on the keywords. It first generates a URL that is Google specific query based on the keyword and limit. Then it makes the requests and save all the HTML into self.all_htmls

def search(self):
    """
    perform the search
    :return: a list of htmls from Google Searches
    """
    
    # choose a random user agent
    user_agents = [
        'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1464.0 Safari/537.36',
        'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0) chromeframe/10.0.648.205',
        'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36',
        'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.19 (KHTML, like Gecko) Ubuntu/11.10 Chromium/18.0.1025.142 Chrome/18.0.1025.142 Safari/535.19',
        'Mozilla/5.0 (Windows NT 5.1; U; de; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 Opera 11.00'
    ]
    while self.counter < self.limit:
        headers = {'User-Agent': random.choice(user_agents)}
        url = 'http://google.com/search?num=100&start=' + str(self.counter) + '&hl=en&meta=&q=site%3Alinkedin.com/in%20' + self.keyword
        resp = requests.get(url, headers=headers)
        if ("Our systems have detected unusual traffic from your computer network.") in resp.text:
            print("Running into captchas")
            return
    
        self.all_htmls += resp.text
        self.counter += 100

The parse_links function will search the HTML and perform regex parsing to extract out all the LinkedIn links.

def parse_links(self):
    reg_links = re.compile(r"url=https:\/\/www\.linkedin.com(.*?)&")
    self.temp = reg_links.findall(self.all_htmls)
    results = []
    for regex in self.temp:
      final_url = regex.replace("url=", "")
      results.append("https://www.linkedin.com" + final_url)
    return results

Similarly, parse_people function will search the HTML for their name and title.

def parse_people(self):
    """
    :param html: parse the html for Linkedin Profiles using regex
    :return: a list of
    """
    reg_people = re.compile(r'">[a-zA-Z0-9._ -]* -|\| LinkedIn')
    self.temp = reg_people.findall(self.all_htmls)
    print(self.temp)
    results = []
    for iteration in (self.temp):
        delete = iteration.replace(' | LinkedIn', '')
        delete = delete.replace(' - LinkedIn', '')
        delete = delete.replace(' profiles ', '')
        delete = delete.replace('LinkedIn', '')
        delete = delete.replace('"', '')
        delete = delete.replace('>', '')
        delete = delete.strip("-")
        if delete != " ":
            results.append(delete)
    return results

This is an example of using the class to search for 500 profiles for the Tesla company.

ls = LinkedinScraper(keyword="Tesla",limit=500)
ls.search()
links = ls.parse_links()
profiles = ls.parse_people()

This is quite a simple script, but it should be a good starting point. It is missing some error and captcha handling when making too many requests to Google. I recommend using a Google Search API such as https://goog.io or RapidAPI Google Search API to perform unlimited searches.

You can find the full code at https://github.com/googio/linkedin_scraper.git

This code is fast. Making too many requests to Google will result in getting your IP blocked. Please us proxies when running this script. Or checkout goog.io API docs https://goog.io/docs on performing searches without worrying about getting blocked