Duplicate content: Causes and solutions

Search engines like Google have a problem – it’s called ‘duplicate content’. Duplicate content means that similar content appears at multiple locations (URLs) on the web, and as a result search engines don’t know which URL to show in the search results. This can hurt the ranking of a webpage, and the problem only gets worse when people start linking to the different versions of the same content. This article will help you to understand the various causes of duplicate content, and to find the solution to each of them.

What is duplicate content?

Duplicate content is content which is available on multiple URLs on the web. Because more than one URL shows the same content, search engines don’t know which URL to list higher in the search results. Therefore they might rank both URLs lower and give preference to other webpages.

In this article, we’ll mostly focus on the technical causes of duplicate content and their solutions. If you’d like to get a broader perspective on duplicate content and learn how it relates to copied or scraped content or even keyword cannibalization, we’d advise you to read this post: What is duplicate content.

Let’s illustrate this with an example

Duplicate content can be likened to being at a crossroads where road signs point in two different directions for the same destination: Which road should you take? To make matters worse, the final destination is different too, but only ever so slightly. As a reader, you don’t mind because you get the content you came for, but a search engine has to pick which page to show in the search results because, of course, it doesn’t want to show the same content twice.

Let’s say your article about ‘keyword x’ appears at http://www.example.com/keyword-x/ and the same content also appears at http://www.example.com/article-category/keyword-x/. This situation is not fictitious: it happens in lots of modern Content Management Systems. Then let’s say your article has been picked up by several bloggers and some of them link to the first URL, while others link to the second. This is when the search engine’s problem shows its true nature: it’s your problem. The duplicate content is your problem because those links both promote different URLs. If they were all linking to the same URL, your chances of ranking for ‘keyword x’ would be higher.

If you don’t know whether your rankings are suffering from duplicate content issues, these duplicate content discovery tools will help you find out!

Causes of duplicate content

There are dozens of reasons for duplicate content. Most of them are technical: it’s not very often that a human decides to put the same content in two different places without making clear which is the original. Unless you’ve cloned a post and published it by accident of course. But otherwise, it feels unnatural to most of us.

There are many technical reasons though and it mostly happens because developers don’t think like a browser or even a user, let alone a search engine spider – they think like a programmer. Take that article we mentioned earlier, that appears on http://www.example.com/keyword-x/ and http://www.example.com/article-category/keyword-x/. If you ask the developer, they will say it only exists once.

Misunderstanding the concept of a URL

No, that developer hasn’t gone mad, they are just speaking a different language. A CMS will probably power the website, and in that database there’s only one article, but the website’s software just allows for that same article in the database to be retrieved through several URLs. That’s because, in the eyes of the developer, the unique identifier for that article is the ID that article has in the database, not the URL. But for the search engine, the URL is the unique identifier for a piece of content. If you explain that to a developer, they will begin to get the problem. And after reading this article, you’ll even be able to provide them with a solution right away.

Session IDs

You often want to keep track of your visitors and allow them, for instance, to store items they want to buy in a shopping cart. In order to do that, you have to give them a ‘session.’ A session is a brief history of what the visitor did on your site and can contain things like the items in their shopping cart. To maintain that session as a visitor clicks from one page to another, the unique identifier for that session – called the Session ID – needs to be stored somewhere. The most common solution is to do that with cookies. However, search engines don’t usually store cookies.

At that point, some systems fall back to using Session IDs in the URL. This means that every internal link on the website gets that Session ID added to its URL, and because that Session ID is unique to that session, it creates a new URL, and therefore duplicate content.

URL parameters used for tracking and sorting

Another cause of duplicate content is using URL parameters that do not change the content of a page, for instance in tracking links. You see, to a search engine, http://www.example.com/keyword-x/ and http://www.example.com/keyword-x/?source=rss are not the same URL. The latter might allow you to track what source people came from, but it might also make it harder for you to rank well – very much an unwanted side effect!

This doesn’t just go for tracking parameters, of course. It goes for every parameter you can add to a URL that doesn’t change the vital piece of content, whether that parameter is for ‘changing the sorting on a set of products’ or for ‘showing another sidebar’: all of them cause duplicate content.

Scrapers and content syndication

Most of the reasons for duplicate content are either the ‘fault’ of you or your website. Sometimes, however, other websites use your content, with or without your consent. They don’t always link to your original article, and therefore the search engine doesn’t ‘get’ it and has to deal with yet another version of the same article. The more popular your site becomes, the more scrapers you’ll get, making this problem bigger and bigger.

Order of parameters

Another common cause is that a CMS doesn’t use nice clean URLs, but rather URLs like /?id=1&cat=2, where ID refers to the article and cat refers to the category. The URL /?cat=2&id=1 will render the same results in most website systems, but they’re completely different for a search engine.

Comment pagination

 In my beloved WordPress, but also in some other systems, there is an option to paginate your comments. This leads to the content being duplicated across the article URL, and the article URL + /comment-page-1/, /comment-page-2/ etc.

Printer-friendly pages

If your content management system creates printer-friendly pages and you link to those from your article pages, Google will usually find them, unless you specifically block them. Now, ask yourself: Which version do you want Google to show? The one with your ads and peripheral content, or the one that only shows your article?

WWW vs. non-WWW

This is one of the oldest in the book, but sometimes search engines still get it wrong: WWW vs. non-WWW duplicate content, when both versions of your site are accessible. Another, less common situation but one I’ve seen as well is HTTP vs. HTTPS duplicate content, where the same content is served out over both.

Conceptual solution: a ‘canonical’ URL

As we’ve already seen, the fact that several URLs lead to the same content is a problem, but it can be solved. One person who works at a publication will normally be able to tell you quite easily what the ‘correct’ URL for a certain article should be, but sometimes when you ask three people within the same company, you’ll get three different answers…

That’s a problem that needs addressing because, in the end, there can be only one (URL). That ‘correct’ URL for a piece of content is referred to as the Canonical URL by the search engines.

canonical_graphic_1024x630

Ironic side note

Canonical is a term stemming from the Roman Catholic tradition, where a list of sacred books was created and accepted as genuine. They were known as the canonical Gospels of the New Testament. The irony is it took the Roman Catholic church about 300 years and numerous fights to come up with that canonical list, and they eventually chose four versions of the same story

Identifying duplicate contents issues

You might not know whether you have a duplicate content issue on your site or with your content. Using Google is one of the easiest ways to spot duplicate content.

There are several search operators that are very helpful in cases like these. If you’d want to find all the URLs on your site that contain your keyword X article, you’d type the following search phrase into Google:

site:example.com intitle:"Keyword X"

Google will then show you all pages on example.com that contain that keyword. The more specific you make that intitle part of the query, the easier it is to weed out duplicate content. You can use the same method to identify duplicate content across the web. Let’s say the full title of your article was ‘Keyword X – why it is awesome’, you’d search for:

intitle:"Keyword X - why it is awesome"

And Google would give you all sites that match that title. Sometimes it’s worth even searching for one or two complete sentences from your article, as some scrapers might change the title. In some cases, when you do a search like that, Google might show a notice like this on the last page of results:

This is a sign that Google is already ‘de-duping’ the results. It’s still not good, so it’s worth clicking the link and looking at all the other results to see whether you can fix some of them.

Read more: DIY: duplicate content check »

Practical solutions for duplicate content

Once you’ve decided which URL is the canonical URL for your piece of content, you have to start a process of canonicalization (yeah I know, try saying that three times out loud fast). This means we have to tell search engines about the canonical version of a page and let them find it ASAP. There are four methods of solving the problem, in order of preference:

  1. Not creating duplicate content
  2. Redirecting duplicate content to the canonical URL
  3. Adding a canonical link element to the duplicate page
  4. Adding an HTML link from the duplicate page to the canonical page

Avoiding duplicate content

Some of the above causes for duplicate content have very simple fixes to them:

  • Are there Session ID’s in your URLs?
    These can often just be disabled in your system’s settings.
  • Have you got duplicate printer friendly pages?
    These are completely unnecessary: you should just use a print style sheet.
  • Are you using comment pagination in WordPress?
    You should just disable this feature (under settings » discussion) on 99% of sites.
  • Are your parameters in a different order?
    Tell your programmer to build a script to always put parameters in the same order (this is often referred to as a URL factory).
  • Are there tracking links issues?
    In most cases, you can use hash tag based campaign tracking instead of parameter-based campaign tracking.
  • Have you got WWW vs. non-WWW issues?
    Pick one and stick with it by redirecting the one to the other. You can also set a preference in Google Webmaster Tools, but you’ll have to claim both versions of the domain name.

If your problem isn’t that easily fixed, it might still be worth putting in the effort. The goal should be to prevent duplicate content from appearing altogether, because it’s by far the best solution to the problem.

301 Redirecting duplicate content

In some cases, it’s impossible to entirely prevent the system you’re using from creating wrong URLs for content, but sometimes it is possible to redirect them. If this isn’t logical to you (which I can understand), do keep it in mind while talking to your developers. If you do get rid of some of the duplicate content issues, make sure that you redirect all the old duplicate content URLs to the proper canonical URLs.

 Sometimes you don’t want to or can’t get rid of a duplicate version of an article, even when you know that it’s the wrong URL. To solve this particular issue, the search engines have introduced the canonical link element. It’s placed in the <head> section of your site, and it looks like this:

<link rel="canonical" href="http://example.com/wordpress/seo-plugin/" />

In the href section of the canonical link, you place the correct canonical URL for your article. When a search engine that supports canonical finds this link element, it performs a soft 301 redirect, transferring most of the link value gathered by that page to your canonical page.

This process is a bit slower than the 301 redirect though, so if you can just do a 301 redirect that would be preferable, as mentioned by Google’s John Mueller.

Keep reading: rel=canonical • What it is and how (not) to use it »

Linking back to the original content

If you can’t do any of the above, possibly because you don’t control the <head> section of the site your content appears on, adding a link back to the original article on top of or below the article is always a good idea. You might want to do this in your RSS feed by adding a link back to the article in it. Some scrapers will filter that link out, but others might leave it in. If Google encounters several links pointing to your original article, it will figure out soon enough that that’s the actual canonical version.

Conclusion: duplicate content is fixable, and should be fixed

Duplicate content happens everywhere. I have yet to encounter a site of more than 1,000 pages that hasn’t got at least a tiny duplicate content problem. It’s something you need to constantly keep an eye on, but it is fixable, and the rewards can be plentiful. Your quality content could soar in the rankings, just by getting rid of duplicate content from your site!

Read on: Rel=canonical: The ultimate guide »


20 Responses to Duplicate content: Causes and solutions

  1. shanhaider shan
    shanhaider shan  • 11 months ago

    There shouldn’t be a problem if you use primary categories in Yoast SEO. Yoast SEO automatically sets a canonical tag to tell search engines which permalink is the primary one to use for indexing. Hope this helps!

  2. Caden O'Rourke
    Caden O'Rourke  • 11 months ago

    On my site it counts each one of my posts page–a page with posts–as a separate page. So it says that when a person scrolls down and clicks “More Posts” they are entering “Page/2” instead of staying on my homepage. The problem is I have 70 pages like this, and I cannot edit them. Hence, 70 pages with duplicate meta descriptions I cannot edit.

    I’ve contacted WordPress Support, Changed Themes, set it to “Load More Posts” when the reader scrolls down so that they don’t press “Load More”. Still. Nothing.

    What can I do?

  3. Mark Gotchall
    Mark Gotchall  • 11 months ago

    If “Make Primary” in Yoast Premium does not work to have one URL, for multiple categories, for a product in WooCommerce. Isn’t that causing duplicate content?

    • Willemien Hallebeek
      Willemien Hallebeek  • 11 months ago

      Hi Mark, There shouldn’t be a problem if you use primary categories in Yoast SEO. Yoast SEO automatically sets a canonical tag to tell search engines which permalink is the primary one to use for indexing. Hope this helps!

  4. Russ Michaels
    Russ Michaels  • 11 months ago

    So what about SEO optimised landing pages?
    These are copies of exisitng pages, but will have the keywords changed.

    e.g.
    page 1 will be optimised for “blah blah blah Manchester”
    page 2 will be optimised for “blah blah blah Wigan”
    etc

    So they are not identical, but probably close enough for the search engine to think so.

  5. Marja
    Marja  • 11 months ago

    Hello
    Does it also count for a website who has several countries? I mean have the same content on http://www.example.com/sg and on http://www.example.com/en and on http://www.example.com/au and translated on http://www.example.com/pl and on http://www.example.com/es and so on
    How to solve this? The content is always the same.

    • Willemien Hallebeek
      Willemien Hallebeek  • 11 months ago

      Hi Marja, The solution here is implementing hreflang: https://yoast.com/hreflang-ultimate-guide/ This will help search engines understand which region/language you’re targetting with your content. It won’t be easy, but it will be worth the effort. Good luck!

  6. GharPeShiksha
    GharPeShiksha  • 11 months ago

    I have some duplicate content issues on my website. When I have audit my website on Moz. It showed some content duplicacy problem, but I don’t know how to fix it.
    After reading your post, I got some ideas (not completely). I think it will help me.
    Can you tell me, should I redirect or put a canonical tag on each URL. I am very confused about both. Please, reply soon….

    • Willemien Hallebeek
      Willemien Hallebeek  • 11 months ago

      Hi! Whether you should use a redirect or canonical depends on the situation. If you, for some reason, need to keep the duplicate article, you can use a canonical link. If you can get rid of the duplicate article, it’s best to redirect. Hope this helps!

  7. Jenn Summers
    Jenn Summers  • 11 months ago

    This is really helpful although I admit I’m a bit confused on how to solve. I’ll try out the tools to make but I’m pretty sure I have many on my site. When I check my google search engine there are a lot of pages not valid I’m guessing that might be these. I also switched from http to https and then from non-www to www when switching hosts so I’m cringing right now haha. Thanks for all the great info I’ll be referencing back here while I try to sort this.

    • Willemien Hallebeek
      Willemien Hallebeek  • 11 months ago

      Great, Jenn. Good luck! Let us know if you have any further questions :-)

  8. Per Karlsson
    Per Karlsson  • 11 months ago

    From most of the examples you have given I would rather say that the issue is faulty logic in search engines rather than mistakes (or oversights) by the web designer / owner / publisher.

    It is entirely natural for a human to have two different entry points to one and the same article. Just think of library cataloguing systems. You can look things up in many different ways but it all leads to one single book.

    It is search engines that should learn to adapt to how humans think, not humans who should adapt to search engines faulty logic. Is that not the basic credo of what Google say about search engines and SEO. And you too. Don’t write for search engines, write for humans. That gives the best SEO. It should be the same for indexing and referring to contents.

    But… in the mean time, I guess we poor humans need to take into consideration the deficiencies of search engines.

    • Willemien Hallebeek
      Willemien Hallebeek  • 11 months ago

      Hi Per, That’s a fair point. For now, it remains difficult for search engines to determine the original source, which is understandable in some cases. They’ll probably, and hopefully, get better at it over time! Until then, they’ll need some help from us, I suppose.

  9. marni
    marni  • 11 months ago

    What about duplicating an article that was written by an author on an author’s website that was also published online elsewhere? How best to share information on a business website that was also published in the press? The fear is that the offsite landing pages might disappear at some point in time and then the content would be lost. Keeping it on the business site seems to make a lot of sense, but then it is duplicate content? What is the best practice here? Inquiring minds want to know! :) Thanks!

    • Willemien Hallebeek
      Willemien Hallebeek  • 11 months ago

      Hey Marni, There are two options here, I guess. If you choose to use the exact same article, always add a canonical link to the original post. This works for external articles too. But this means that the copied article won’t rank, of course. If you’d like to rank on the topic of the article too, you should write your own content about it. Give your own twist to it and write something original that people would like to read!

  10. Kingsley Felix
    Kingsley Felix  • 11 months ago

    Does this mean we should noindex category pages?

  11. Iva Ursano
    Iva Ursano  • 11 months ago

    So how do you deal with the scrapers? I found one site who took 14 of my blog posts and I could not find their host to report a DCMA complaint. What do you do with those guys?