2 Conceptual solution: a ‘canonical’ URL
As determined above, the fact that several URLs lead to the same content is a problem, but it can be solved. A human working at a publication will normally be able to tell you quite easily what the ‘correct’ URL for a certain article should be. The funny thing is, though, sometimes when you ask three people in the same company, they’ll give three different answers…
That’s a problem that needs solving in those cases, because in the end, there can be only one (URL). That ‘correct’ URL for a piece of content has been dubbed the Canonical URL by the search engines.
3 Identifying duplicate contents issues
You might not know whether you have a duplicate content issue on your site or with your content. Let me give you some methods of finding out whether you do.
3.1 Google Webmaster Tools
Google Webmaster Tools is a great tool for identifying duplicate content. If you go into Google Webmaster Tools for your site, check under Search Appearance » HTML Improvements, and you’ll see this:
If pages have duplicate titles or duplicate descriptions, that’s almost never a good thing. Clicking on it will reveal the URLs that have duplicate titles or descriptions and will help you identify the problem. The issue is that if you have an article like the one about keyword X, and it shows up in two categories, the titles might be different. They might, for instance, be ‘Keyword X – Category X – Example Site’ and ‘Keyword X – Category Y – Example Site’. Google won’t pick those up as duplicate titles, but you can find them by searching.
3.2 Searching for titles or snippets
There are several search operators that are very helpful for cases like these. If you’d want to find all the URLs on your site that contain your keyword X article, you’d type the following search phrase into Google:
site:example.com intitle:"Keyword X"
Google will then show you all pages on example.com that contain that keyword. The more specific you make that
intitle part, the easier it is to weed out duplicate content. You can use the same method to identify duplicate content across the web. Let’s say the full title of your article was ‘Keyword X – why it is awesome’, you’d search for:
intitle:"Keyword X - why it is awesome"
And Google would give you all sites that match that title. Sometimes it’s worth even searching for one or two complete sentences from your article, as some scrapers might change the title. In some cases, when you do a search like that, Google might show a notice like this on the last page of results:
This is a sign that Google is already ‘de-duping’ the results. It’s still not good, so it’s worth clicking the link and looking at all the other results to see whether you can fix some of those.
4 Practical solutions for duplicate content
Once you’ve decided which URL is the canonical URL for your piece of content, you have to start a process of canonicalization (yeah I know, try to say that three times out loud fast). This basically just means we have to let the search engine know about the canonical version of a page and let it find it ASAP. There are four methods of solving the problem, in order of preference:
- Not creating duplicate content
- Redirecting duplicate content to the canonical URL
- Adding a canonical link element to the duplicate page
- Adding an HTML link from the duplicate page to the canonical page
4.1 Avoiding duplicate content
Some of the above causes for duplicate content have very simple fixes to them:
- Session ID’s in your URLs?
These can often just be disabled in your system’s settings.
- Have duplicate printer friendly pages?
These are completely unnecessary: you should just use a print style sheet.
- Using comment pagination in WordPress?
This feature should just be disabled (under settings » discussion) on 99% of sites.
- Parameters in a different order?
Tell your programmer to build a script to always order parameters in the same order (this is often referred to as a so-called URL factory).
- Tracking links issues?
In most cases, you can use hash tag based campaign tracking instead of parameter based campaign tracking.
- WWW vs non-WWW issues?
Pick one and stick with it by redirecting the one to the other. You can also set a preference in Google Webmaster Tools, but you’ll have to claim both versions of the domain name.
If your problem isn’t that easily fixed, it might still be worth it to put in the effort and to prevent the duplicate content from appearing altogether. It’s by far the best solution to the problem.
4.2 301 Redirecting duplicate content
In some cases, it’s impossible to entirely prevent the system you’re using from creating wrong URLs for content, but sometimes it is possible to redirect them. If this isn’t logical to you (which I can understand) do keep it in mind while talking to your developers. Also, if you do get rid of some of the duplicate content issues altogether, make sure that you redirect all the old duplicate content URLs to the proper canonical URLs.
4.3 Using rel=”canonical” links
Sometimes you don’t want to or can’t get rid of a duplicate version of an article, but you do know that it’s the wrong URL. For that specific issue, the search engines have introduced the canonical link element. It’s placed in the
<head> section of your site and it looks like this:
<link rel="canonical" href="http://example.com/wordpress/seo-plugin/">
href section of the canonical link you place the correct canonical URL for your article. When Google (or any other search engine that supports it) finds this link element, it does what is basically a soft 301 redirect: it transfers most of the link value gathered by that page to your canonical page.
This process is a bit slower than the 301 redirect though, so if you can do a 301 redirect that would be preferable, as mentioned by Google’s John Mueller.
Read more: ‘ rel=canonical • What it is and how (not) to use it ’ »
4.4 Linking back to the original content
If you can’t do any of the above, possibly because you don’t control the <head> section of the site your content appears on, adding a link back to the original article on top of or below the article is always a good idea. This might be something you want to do in your RSS feed: add a link back to the article in it. Some scrapers will filter that link out, but some others might leave it in. If Google encounters several links pointing to your article it will figure out soon enough that that’s the actual canonical version of the article.
5 Conclusion: duplicate content is fixable, and should be fixed
Duplicate content happens everywhere. I have yet to encounter a site of more than 1,000 pages that hasn’t got at least a tiny duplicate content problem. It’s something you need to keep an eye on at all times. It is fixable though, and the rewards can be plentiful. Your quality content might soar in the rankings by just getting rid of duplicate content on your site. Of course, if you need help with identifying these issues, helping your developers come up with solutions to your duplicate content issues or even solving them for you, you can always order a Website Review.
Keep reading: ‘ Ask Yoast: webshops and duplicate content’ »