Scraping Backlinks With Advance Google Search Operators and Python

Magnifying class zoomed into backlink pointing back to main website

Categories

  • seo
  • backlink
  • python

In Googio’s Useful Google Advanced Search Operators For SEO we discuss how to identify and find gust post backlinking opportunities. One tactic to quickly find high Domain Authority sites that have a history of linking to pages discussing your keyword is using advance Google search operators. The same method can be used to find your competitors’ backlinks. The manual process is covered in-depth by guides found at Googio’s SEO Blog.

In this article, I’ll show how to automatically use Google search operators to find backlinks with Python. We will create a Python script to scrape and export a list of guest post opportunities that you can use to build a list of high quality backlinks. This guide assumes you’ve already completed keyword research and have identified competitors that rank well in the search results for these queries. This method can also be used to gather competitors backlinks for analysis.

This guide builds on top of the how to scrape Google with Python and reuses some of the code especially for the parsing of Google search results. The requirements are Python3+, two Python libraries,requests and bs4.

To install the two libraries run the command:

pip install requests bs4

We will call our script scrape.py In order to use both libraries in out script we need to import them.

import requests
from bs4 import BeautifulSoup

When we want to find backlinks opportunities for guest posting, we need to find websites that have allow guest posting or user created content. We will need to use a combination of search operators and footprints to find such website. Some examples of footprint are:

"write for us"
"guest blogger"
"become a guest blogger"
inurl:guestbook.html

To find related website to your topic you can combine the footprint with your keywords. The repo contains a complete list of Google Footprints for guest post and many other platforms.

So the next part of our script will build a Google formatted query that combines footprint with our keyword. The function will take in two arguments, a footprint and keyword. A huge part of this function is just parings Google results and returning it into a list.

def query(footprint: str, keyword: str):
    """
    return the results based on footprint and keywords
    :param footprint:
    :param keyword:
    :return:
    """
    # desktop user-agent
    USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
    # mobile user-agent
    MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
    HEADERS = {"user-agent": USER_AGENT}

    step = 0
    results = []

    # find the top 300 results for this particular footprint and keyword
    # Google limits the results to 300
    for start in range(0, 300, 100):
        query = {
            "q": "{} {}".format(footprint, keyword),
            "num": 100,
            "start": start,
        }

        query = urllib.parse.urlencode(query)
        url = f"https://google.com/search?{query}"

        resp = requests.get(url, headers=HEADERS)

        # check if we got captcha
        if resp.status_code == 200:
            # use beautifulsoup to parse the html code
            soup = BeautifulSoup(resp.content, "html.parser")
            gs = soup.find_all("div", class_="rc")
            # grab all results title, link, and description
            for g in gs:
                anchors = g.find_all("a")
                if anchors:
                    link = anchors[0]["href"]
                    title = g.find("h3").text
                    item = {"title": title, "link": link}
                    results.append(item)
            if len(gs) < 10:
                return results
        elif (resp.status_code == 429) or (
            "Our systems have detected unusual traffic from your computer network."
            in resp.content
        ):
            results.append(ValueError("Ran into captcha. Please use a proxy."))
            return results

    return results

That’s it. Simply call the function query to grab the top 300 results for any particular combination of footprint and keywords. The next step will be to create a large list of footprints and loop through the list and find thousands of backlinks from guest posting, directory, bookmark, edu, pingback, microblogs, indexers, and many more.

This advance Google search operators list is powerful if used correctly. The above Python scripts are especially useful for SEO. However these advance operators come at a cost. After a few queries with Google advance search operators you will be faced with CAPTCHAs which gets annoying really fast and really limits the amount of backlinks you can scrape using the above method. If you plan on making a lot of advance searches I recommend using proxies or services like Google Search API Alternative, and RapidAPI’s Google Search API for bulk searches. Googio and RapidAPI provides a service for unlimited Google Search without having to deal with annoying CAPTCHAs.