IMPORTXML is a very helpful function that can be used in Google Sheets to effectively crawl and scrape website data in small quantities (especially useful for grabbing titles and meta descriptions, etc.). It can be faster and more convenient that using Screaming Frog or other tools, especially if you only need to pull data for a handful of URLs. This post will show you how to use IMPORTXML with XPath to crawl website data including: metadata, Open Graph markup, Twitter Cards, canonicals and more.
Skip Ahead: Get the free template.
Setting Up The IMPORTXML Formula
This is the IMPORTXML formula:
You can see there are two parts and they’re both quite simple:
The first half of the formula just indicates what URL is going to be crawled. This can be an actual URL – but it’s much easier to reference a cell in the spreadsheet and paste the URL there.
The second half of the formula is going to use XPath to tell the formula what data is going to be scraped. XPath is essentially a language that is used to identify specific parts of a document (like a webpage). Subsequent paragraphs will provide different XPath formulas for different pieces of information you might want to scrape.
Crawling Metadata with IMPORTXML
The following XPath formulas will scrape some of the most commonly desired SEO data like metadata, canonical tags, and H headings. Note that you can scrape any level of H heading by replacing the “h1” with whichever heading you want to scrape (h2, h3, etc.)
- Title Tags: //title/text()
- Meta Descriptions: //meta[@name=’description’]/@content
- Canonical Tags: //link[@rel=’canonical’]/@href
- H1 Heading(s): //h1/text()
- H2 Heading(s): //h2/text()
While social markup has no immediate SEO benefit, it is very important for sites that have active audiences on social media, and implementation of social markup often falls under the umbrella of SEO because of its technical nature. The following XPath formulas will allow you to scrape Open Graph and Twitter Card markup.
Open Graph Markup
Open Graph is used by Facebook, LinkedIn and Pinterest, so all the more reason to make sure it’s implemented correctly.
- OG Title: //meta[@property=’og:title’]/@content
- OG Description: //meta[@property=’og:description’]/@content
- OG Type: //meta[@property=’og:type’]/@content
- OG URL: //meta[@property=’og:url’]/@content
- OG Image: //meta[@property=’og:image’]/@content
- OG Site Name: //meta[@property=’og:site_name’]/@content
- OG Locale: //meta[@property=’og:locale’]/@content
Twitter Card Data
Twitter Card markup is only for….Twitter. Still important though!
- Twitter Title: //meta[@name=’twitter:title’]/@content
- Twitter Description: //meta[@name=’twitter:description’]/@content
- Twitter Image: //meta[@name=’twitter:image’]/@content
- Twitter Card Type: //meta[@name=’twitter:card’]/@content
- Twitter Site: //meta[@name=’twitter:site’]/@content
Unfortunately, IMPORTXML & Sheets cannot be used to scrape large quantities of data at scale, or it will stop functioning. For more than a handful of URLs, it’s recommended to use a more robust program like Screaming Frog (Screaming Frog does not have a URL limit when using it in list mode).
IMPORTXML Google Sheets Template
You can see how this works firsthand by making a copy of this Sheets Scraper Template and entering the URL of your choice in cell B6. To add additional URLs, copy & paste row 6, then enter a different URL.
Questions? Contact me here or reach out on Twitter!