Duplicate Content
Duplicate content across different URLs on your website leads to poor placement in search results.
- Search engine crawlers follow links, capture content, and attempt to index all known URLs on your site. When multiple URLs serve the same page, search engines waste resources collecting and processing identical content. They may also penalize placement of your pages in search results. Common types of duplicate pages are printable or text-only versions of the main page, or redirects to login pages intended for your site’s visitors that also return a “You must log in” page to crawlers.
Recommendations
- Eliminate as many of your site’s URLs with duplicate content as possible.
- Modify your robots.txt file to exclude duplicate pages from crawler access.
- If you are not familiar with ways to use a robots.txt file to your advantage, you can find information about this tool on Wikipedia or The Web Robots Pages.
- Locate all duplicate versions of your pages (printable, text-only, etc.) in a sub-directory of your site, to which you restrict crawler access. Blocking crawlers from http://yoursite.com/printable will keep them from accessing duplicate pages.
- When restricting crawlers’ access to duplicate pages on your site, take care to maintain the availability of popular pages to crawlers. Search engines use these popular pages for ranking purposes, so if the printable version is most popular, consider removing a less popular duplicate.
- Point crawlers to the ‘canonical’ version of the page using link tags.
- You can tell search engine crawlers that the current page is a duplicate version of another page. Just add a link tag in the head section of the HTML of the web page. The URL in the href should point to the canonical page.<head>
<link rel=”canonical” href=”http://www.yoursite.com/canonicalurl” />
</head>
- Advanced: Configure your servers to alert search engines to duplicate pages.
- An elegant, yet technically complex way of alerting crawlers to duplicate pages can be implemented at the server level. By including a Content-Location entity-header field in the HTTP headers, you can communicate to a crawler that a given URL is a duplicate version of another page on your site. Your webmaster can find additional information about this approach on the w3 website.