Skip to main content

How To Use IMPORTXML & Google Sheets to Scrape Sites

By December 31, 2017October 3rd, 2019SEO, Technical SEO

IMPORTXML is a very helpful function that can be used in Google Sheets to effectively crawl and scrape website data in small quantities (especially useful for grabbing titles and meta descriptions, etc.). It can be faster and more convenient that using Screaming Frog or other tools, especially if you only need to pull data for a handful of URLs. This post will show you how to use IMPORTXML with XPath to crawl website data including: metadata, Open Graph markup, Twitter Cards, canonicals and more.

Skip Ahead: Get the free template.

Setting Up The IMPORTXML Formula

This is the IMPORTXML formula:

=IMPORTXML(url,xpath_query)

You can see there are two parts and they’re both quite simple:

The first half of the formula just indicates what URL is going to be crawled. This can be an actual URL – but it’s much easier to reference a cell in the spreadsheet and paste the URL there.

The second half of the formula is going to use XPath to tell the formula what data is going to be scraped. XPath is essentially a language that is used to identify specific parts of a document (like a webpage). Subsequent paragraphs will provide different XPath formulas for different pieces of information you might want to scrape.

Crawling Metadata with IMPORTXML

The following XPath formulas will scrape some of the most commonly desired SEO data like metadata, canonical tags, and H headings. Note that you can scrape any level of H heading by replacing the “h1” with whichever heading you want to scrape (h2, h3, etc.)

Title Tags: //title/text()
Meta Descriptions: //meta[@name='description']/@content
Canonical Tags: //link[@rel='canonical']/@href
H1 Heading(s): //h1/text()
H2 Heading(s): //h2/text()

Social Markup

While social markup has no immediate SEO benefit, it is very important for sites that have active audiences on social media, and implementation of social markup often falls under the umbrella of SEO because of its technical nature. The following XPath formulas will allow you to scrape Open Graph and Twitter Card markup.

Open Graph Markup

Open Graph is used by Facebook, LinkedIn and Pinterest, so all the more reason to make sure it’s implemented correctly.

OG Title: //meta[@property='og:title']/@content
OG Description: //meta[@property='og:description']/@content
OG Type: //meta[@property='og:type']/@content
OG URL: //meta[@property='og:url']/@content
OG Image: //meta[@property='og:image']/@content
OG Site Name: //meta[@property='og:site_name']/@content
OG Locale: //meta[@property='og:locale']/@content

Twitter Card Data

Twitter Card markup is only for….Twitter. Still important though!

Twitter Title: //meta[@name='twitter:title']/@content
Twitter Description: //meta[@name='twitter:description']/@content
Twitter Image: //meta[@name='twitter:image']/@content
Twitter Card Type: //meta[@name='twitter:card']/@content
Twitter Site: //meta[@name='twitter:site']/@content
Twitter Creator: //meta[@name='twitter:creator']/@content

Limitations

Unfortunately, IMPORTXML & Sheets cannot be used to scrape large quantities of data at scale, or it will stop functioning. For more than a handful of URLs, it’s recommended to use a more robust program like Screaming Frog (Screaming Frog does not have a URL limit when using it in list mode).

IMPORTXML Google Sheets Template

You can see how this works firsthand by making a copy of this Sheets Scraper Template and entering the URL of your choice in cell B6. To add additional URLs, copy & paste row 6, then enter a different URL.

Questions? Contact me here or reach out on Twitter!

Chris Berkley

Chris is a digital marketing consultant specializing in SEO and Analytics across industries including healthcare, education, finance and others.

6 Comments

  • I’ve found that ImportXML was unusable with too many formulas in one sheet so I decided to roll up my sleeves and make a much better function.

    My IMPORTFROMURL function does the same thing but, because it uses another engine and caches the data, you can easily make hundreds of calls in less than a minute. I’ve also added options to apply regex on the results.

    If you’re interested, Ive put it available in the chrome store

  • Vanessa says:

    Hi Chris! Wow, you really know your stuff! This is a great blog, love all the insight. I was wondering, would it be possible to pull multiple headings within a page? I’m trying to pull blog titles from a site but it looks like a PHP function is used to publish these blog posts on a category page, multiple post titles within their own div sections in the body of the webpage. I’d like to pull a list of these h2 titles but I can only pull the first h2 title. Have you ever faced any sort of similar challenge or would you have a workaround to recommend? Much appreciated, thank you! -V

    • Chris Berkley says:

      Yes, it’s possible. Just use =importxml(“https://domain.com/blog”,”//h2/text()”) and leave the cells underneath empty. It should pull all the H2s without issue.

  • jayesh says:

    Hi chris,

    Thank you for nice information, I want to scrap product price from the given url, Can you please guide me how can i do that?
    Thanks

  • Winston says:

    Hi Chris,

    Great article! Really helpful for beginners (like me) to get started with IMPORT XML function.

    Just a quick question – right now, I’m pulling the meta description from certain Facebook posts, however it does not show the entirety of it’s contents and has “…”

    Is that a limitation of IMPORT XML or is there something I could fix on my end?

    Thanks!
    Winston

    • Chris Berkley says:

      Did you confirm the meta description is actually longer? It’s possible that Facebook truncates the actual description in the HTML tag itself.

Leave a Reply