Categories
Episodes

SEO for PDF and other non-HTML files

Do PDF files have any SEO value?

PDF files often hold lots of great original content, but they’re hard to navigate in a web browser and for users to find the information that they want.

Google does index them – they can actually read PDF files and the content in there. They can also follow the links within those files. So if they find a new URL that they don’t know about already, they will add it into their crawl queue and then crawl/index it later on.

But even this year, Google stated that PDF files with links in them, don’t pass SEO value. They say that this is because it’s impossible to add a “nofollow” attribute onto PDF links, to signify that you don’t give your trust/authority to those links. Because of that, they attribute all the links as “nofollow”. But saying that, back in 2015 Google actually said that PDF files with links do pass authority and PageRank. So they’ve almost turned a 180 on what they said before.

What I would say about this is, treat them just like any other web page that has nofollowed links. If we take Wikipedia, for example, every external link on Wikipedia has a nofollow attribute, saying not to pass authority/trust to the website. But because Wikipedia is so well edited these days, and the spam links do not stay on there for very long, I think Google probably does count Wikipedia links as a trust signal. So I would say that nofollow links are more of a suggestion to Google, rather than a command to not pass on value. There could still be some value in these PDF files, they can even rank in Google just as well as their HTML equivalents. Quite often, I see PDF files outranking webpages. Especially on the technical and academic searches, where someone might be looking for a hardware/electronics manual. But I wouldn’t say that they’re great for SEO.

If PDFs can rank in Google, why are they so bad for SEO?

They’re not entirely bad, but it’s just very unlikely that the PDF file will pass any link authority onto other pages. So the PDF file is basically an SEO blackhole, link authority goes in but it can’t come out again.

It’s also harder to manage specific niche content topics, when you’ve got a PDF file – all of the information is within a single document. So if we take, for example, a TV manual that might be uploaded as a PDF, there could be some very specific details about how that TV works. The file could rank well for that kind of thing but is hard to navigate (and convert). If the content was moved to HTML pages, it could be split into very niche topic pages and the link authority going into those pages, could then flow back out to your other pages.

So it’s more of a case of these PDF files not helping the rest of your site or converting, rather than being a huge SEO issue for the content itself.

How can you track PDF file views in Google Analytics?

Direct PDF views aren’t tracked in Google Analytics. The only way to track views of them in GA, is to add “on click” tracking events to your internal links, but that doesn’t track any traffic from other sites or directly to that PDF. So if you have a PDF file ranking in Google and someone clicks through to that PDF, you wouldn’t really know about it in Google Analytics. It uses JavaScript to track people and recognize that a page has been looked at, but PDF files can rarely use JavaScript. So there’s an issue there again, they become a blackhole where traffic goes in, but you don’t know if it ever comes out again. And if it does come out again, you don’t know where it has been. I think that it would just be seen as a new session, a new user.

You can only really track PDF files with precision in your server log files, which not many people look at or have access to.

Alternatively, a friend of mine, Carl Hendy has a plugin for WordPress (and therefore WooCommerce) that can track PDF views. It works by using the Google Analytics API to record the fact that someone’s viewed that PDF file. It works completely server-side and doesn’t rely on any JavaScript. So it’s a bit like server logs, but with the advantage of it being logged in Google Analytics itself.

What about other file formats, such as Microsoft Word?

Google can crawl/index Microsoft Word documents as well, if they’ve been uploaded to your website and a link to. They will also follow any links in the document as well and index the content within it.

But the rules for PDF files also apply to Microsoft Word files. The content can be indexed, it can rank in search results, but they won’t necessarily pass any SEO value.

You also need to think about compatibility as well. Not everyone uses Microsoft Office these days and only a certain percentage of devices can open MS Word files. If you have an iPhone, for example, it’s very hard to then view a Microsoft Word file on that phone. You need to maybe look at converting those into a PDF or HTML format.

Excel spreadsheets can even be indexed and appear in Google search results. So can PowerPoint presentations, plain text files etc. But should you be using these file formats? If it’s something that somebody will only print out – then maybe yes. But in all other situations, absolutely not.

What should you do, if you already use PDF files?

I worked on a website called Screwfix a few years ago, which is a large DIY / Home Improvement supplier. They sell direct to consumer, as well as to trades (such as plumbers, builders and electricians).

They had thousands of PDF files for each of the drills, power saws and other equipment that they sold. And for some reason, this was hosted on a totally different website/domain. But it had really strong links going into it and decent traffic as well. I advised them to move all of the PDF files onto the main Screwfix.com website and 301 redirect all of the old URLs over to the new location. Plus, if possible, add a link to Screwfix within each PDF file (can be automated). In some cases, they had the only online copy of these manuals – the manufacturers didn’t offer them online, so Screwfix was the only source. But it was on a completely white-labelled domain name, which didn’t really help them in any way.

I think it was too big of a task for them at the time and hard for us to estimate the ROI to justify development resource. It’s probably still in their queue of work to be done, but it definitely is something that I would recommend to other people as well, if you do use PDFs.

Can you upload content as both a PDF file and a web page, on the same site?

That wouldn’t be recommended. If you copy the PDF content and publish it on a webpage, you’re going to have an issue with duplicate content. So you want to do one of two things:

  1. “noindex” the PDF URL in a Robots HTTP Header or your robots.txt file, telling Google not to index that copy.
  2. Add a Canonical HTTP Header on the PDF URL, telling Google that the main version is the HTML page.

Is it quite simply better to have a web page with this content on?

There’s definitely more value in having a HTML page. If you transfer all the content out of that PDF and into individual topic pages, you’ve got so many more options. Plus if you transfer the images as well, let’s take a power drills manual for example, all of those images can also rank in Google images if you transfer them from the PDF. So you’ve got a lot more options when it comes to HTML pages.

The old advantage of having a PDF file was to download them to read/print offline, in a consistent cross-device format. You can do this with HTML web pages now as well though. You can even have a dedicated CSS stylesheet for web pages, to format your print version nicely.


Please Note: The content above is a semi-automated transcription of the podcast episode. We recommend listening (and subscribing) to the podcast, in case any of the content above is unclear.

close

Join our newsletter to find out as soon as a new episode goes live and for updates on the show.