Finding and Auditing Thin Content to Improve SEO

“Thin content” is one of the most misinterpreted on-site optimization buzzwords bar none. These are supposedly pages that have word counts of 250 or lower. Having too many of these, they say, could get you dinged with Panda or manual action penalties from Google. The popular recourse is to find all low word count pages in a site and beef them up with extra text.

While the notion makes some sense, it only represents a fraction of the truth about what thin content really is. As a matter of fact, low word count isn’t event at the heart of the thin content issue if we go by Google’s characterization of it. In the video below, Matt Cutts states that thin content is more about user experience and unique value rather than the amount of text on a given page.

The video explains why Google would index and rank news briefs, dictionary entries, online yellow pages, infographics and videos. These pages don’t need too many words by nature, but they can deliver as much value as any full-length article or wiki entry. Conversely, some text-heavy pages may not be indexed or ranked due to quality and duplication issues.

So What Exactly is Thin Content?

Having said that, thin content can be found on any page that doesn’t have inherent and unique value. This type of content leaves the user dissatisfied and makes no contribution to further the broader discussion about a given topic. Common examples include:

Doorway pages – These are webpages created for the sole purpose of ranking for a specific keyword, then driving the visits it receives to another page. They often have minimal unique content and they never deliver on the promises that they issue users with their title tags and meta descriptions. Doorway pages tend to frustrate the average visitor with the amount of clicking involved in getting to the object of their ontent. Naturally, this is something that Google disfavors and they want to crack down on this activity as much as possible.

Some SEOs create multiple doorway pages to saturate the SERPs for target keywords. If successful, they can use the traffic to make money off of affiliate campaigns. This is a reason why some folks have the false belief that Google hates affiliates even though the Big G really only hates the tactics that affiliates use.

Spun articles – Spun articles are reworded content pieces created by regurgitating information from an external source. They’re not bad per se, but they tend to offer little to no contributions to the state of knowledge on a given topic. A spun article is a cheap knock-off of someone else’s work and it begs the question of “how does this page distinguish itself from all the other content I can find on Google?”

Article spinning is usually done manually by less skilled “writers” working for content sweatshops. In some of the worst cases of spinning, software is used to sub out words in scraped content stolen from other sites. With automated spinners, the output is often a grotesque clone of the source that someone else invested time and brain power on.

Aggregated content – Sites that collect content snippets from other Web locations based on a common theme and distribute them in their own domains are called aggregators. Most of the time, they present themselves as content hubs where communities with common interests can form and interact. They’re not inherently bad or spammy (think Digg or Reddit), but when they offer no value of their own, Google sees that as a problem.
Indexed internal search results – In some cases, internal search pages can get indexed by Google. When this happens, Google sees the pages as low-quality, redundant and even spammy. In large sites with tens of thousands of users, this can become a big problem real fast.
Filtered pages – Sticking to large ecommerce sites, it can be difficult for users to find exactly what they’re looking for just by navigating through category trees. To address this usability challenge, developers came up with ways for users to filter the pages they’re seeing. The filtered result pages have their own URLs but all their content is pulled from other sections of the site. If technical SEO precautions are not applied, search engines could crawl and index these non-static pages.by the bulk, giving way to possible penalties.
Sparse category and tag pages – The category and tag pages of a site are supposed to help visitors find what they’re looking for. They’re essentially mini-portals where that users can use as springboards to pages that cover smaller topics. If they’re indexed and Google doesn’t see any unique value in them, they could be viewed as thin pages that could hurt your site’s performance in the SERPs.
Boilerplate product descriptions – Another issue that ecommerce sites face is the proliferation of pages with boilerplate content. When retailers decide to carry a product from a manufacturer, they often copy the standard descriptions and specs of each item. As a result, similar products end up having very similar copy. On a bigger scale, the manufacturer site and all its retail partners end up carrying the same content, creating a massive duplication mess that search engines are bound to take action against through filters and penalties.
Location-targeted pages – In the interest of being found using localized queries, webmasters sometimes create landing pages specifically for each of their service areas. While that’s not a bad thing in itself, using practically the same content and just switching out the names of the places creates thin content with no unique value. Going after locations won’t get you in trouble, but being lazy with it and blatantly recycling your stuff can certainly put your site in hot water.

Here’s an example: if I’m running a pest control business in Dallas, Texas, it would make sense for me to also offer my services to nearby Fort Worth.

The suburbs of Addison, Balch Springs, Carrollton, Cedar Hill are also bound to have potential customers for me and it would be good if my site ranked for keywords like “pest control Addison,” “pest control Balch Springs,” “pest control Carrollton,” and so on.

If I create pages to rank for all those keywords, I’d end up talking about the same company and the same services every single time. It also means I’ll be spinning my own content several times, killing its uniqueness. Google is smart enough to see through phrasing variations, synonyms and paraphrasing techniques. If I do this on a massive scale spanning dozens or hundreds of pages, it would only make sense for Google to take action against me.

And Yes, Low Word Count Pages – While low word count in itself doesn’t necessarily mean that content is thin or low-value, there are some content types that need to have a degree of length in order to be effective. Expository articles, blog posts, wiki entries and guides usually warrant more than 250 words to provide a satisfying experience for a reader.

If the text in a page is scant, it could be indicative that the content is shallow and unrewarding. It’s also an indicator that the author’s expertise on the topic isn’t high especially if other similar sites are delivering content with more breadth.

Detecting Thin Content

Finding thin content in your site can be a tricky proposition. On one hand, doing a manual review of your pages based on the issues discussed in the previous section is more accurate. However, it does take a lot of time and it’s not practical when you’re dealing with sites that have hundreds of thousands of pages. On the other hand, you can do an automated check using a few tools but you tend to lose the accuracy that a human reviewer brings to the table.

Personally, I like doing a combination of the two. I use the Screaming Frog SEO spider tool and Google Analytics to give me clues on where to start looking. When I get the list down to a manageable level, I perform a manual review for better accuracy.

Detecting Thin Pages with Screaming Frog

Screaming Frog is a desktop application that crawls a site’s pages and returns SEO-related data which can be exported to a CSV file. To see which pages in a site have low word counts, follow these simple steps:

1. Open the app and enter the home page URL of the site you want to check.

2. Hit Start and wait for the crawl to finish.

Screaming Frog Crawl

3. In the Filter dropdown menu, choose HTML so you only see rows for webpages.

4. Once the crawl is completed, click Export.

5. Hide column B to W for now. In column X, you’ll see the Word Count data.

Screaming Frog Word Count sheet

6. Sort the values from smallest to largest. Look for pages that have word counts that seem small for the topics that they’re supposed to cover

The weakness of this technique is that it only accounts for the length of the content. It doesn’t consider the content’s value and type. This is good for finding thin blog, article, resource and wiki pages – all of which are usually text-intensive. It’s not quite applicable to pages where graphics, videos, audio or news updates are the object of user intent.

Finding Thin Content with Google Analytics

If thin content is defined by its low quality, your site’s engagement numbers are not going to lie about it. Bounce rates and average times on pages will show which ones are performing poorly, giving you strong clues on which pages to look into for thin content. With a simple process that involves Google Analytics, you can extract this data and use it to guide your audit.

Marcus Taylor wrote a nice post on how to do this. He suggested cracking down on pages that fall under these two categories:

Having bounce rates of 95%-99.99%
Average times on page of 0.1-5 seconds.

To get the data, go to Google Analytics, access the data for the site you’re auditing, then follow these steps:

1. Go to Behavior>Site Content>All Pages.

2. Adjust the number of rows displayed according to the number of pages in your site. For small sites that have 5000 pages or less, you can set the number of rows to display at the bottom right hand corner of the page.

For big sites with more than 5,000 pages, you can get more rows by applying a little ninja trick from Annielytics. Basically, click on the browser’s address box and check the UIRL of this section of Google Analytics. At the end of it, you’ll see table.rowCount%3Dxx where xx at the end stands for the number of rows to display. Change the number you see to your desired number of rows and hit Enter. You should be able to see more than 5,000 rows now.

Export the table to CSV and open it with MS Excel. You can then use Conditional Formatting or Filter functions to separate the pages that have bounce rates and average times on pages that fall under the ranges that Marcus prescribed.

Once you’re done narrowing down the list of pages using Screaming Frog and GA, you and your team can begin a manual review effort to see if the pages might be doorways, duplicates, auto-generated by the CMS, etc.

Addressing Thin Content

Dealing with thin content is a case-to-case matter that needs to be approached depending on the kind of issues you’re facing. Based on Google’s explanation of what a quality site is, we can surmise that combatting thin pages is all about putting user experience ahead of everything else. Here’s the list of possible thin content issues again along with recommendations on how to address them:

Doorway pages – Doorways are bad because they don’t deliver on their promises to users who click search results. The best way to address them is to remove them from your site and find other ways to drive organic traffic to your money pages. It’s understandable why some marketers might hesitate to do this, but it really has to be done for the long term search visibility of a site.

If you really have to keep your doorway pages, you can rework them page to make good on the implicit promises that they issue searchers. If you have a page that’s optimized for the keyword “free yoga videos,” make it so that the page really does have those videos. Granted, this might interfere with some affiliate marketing strategies, but you really will have to think of another way to drive traffic to affiliate sites or risk losing your search traffic altogether.

Spun articles – Get subject matter experts to write for your site. These people won’t need to look at the work of other people to compose good content pieces that users will find unique and valuable. If your organization’s experts don’t have the time to create content, get a really good writer who can communicate to them and use their thoughts to create solid content pieces.

Try to avoid writers who have no background in your industry. The tendency for them is to read up on the work of others and then recycle the key points that they get. Don’t even think about using automated article spinners because they’re just plain horrible.

For the record, I’m not saying that using information from other sites is wrong. As long as you properly represent the context and attribute it to the source, you’re in good shape. However, make it a point to add your own information and insights to your work. If you simply rewrite or report another author’s work, you’re giving human readers and search engines reasons to think that you’re a copycat and that they’re better off going straight to the source.

Aggregated content – Aggregated content is externally sourced by default. You’re presenting the work of other authors in your site based on a theme that your audience is interested in. To avoid being mistaken as a copycat/spammer site, take the following precautions:
- Add the nofollow attribute to the links pointing to aggregated content.
- Use your own words in the descriptions of the pages that you’re linking to.
- Encourage commentary on each piece of content so your community can create unique user-generated content for your site.
- Don’t rely on aggregated content alone. Create your own blog to publish announcements, site news, your own features, etc.
Internal search results and filtered pages – These are technical SEO issues that can be resolved with some help from your web development team. In essence, you need to identify how the URLs of internal search and filtered pages are structured. You can then identify which parameters they have in common so you can tell Google that they’re off-limits using your robots.txt file.

This is a very important step in lowering thin content page counts in ecommerce sites. Matt Cutts posted some years back that Google wants to take a tougher stance on indexing these pages and that statement has been reinforced with the Panda update and succeeding waves of manual action cases. For more on limiting bot access to these page types, I suggest reading this post by Sean Carlos thoroughly.

Sparse category and tag pages – Categories are good for navigation purposes, but they can really clog up the SERPs with unoriginal content pulled from other internal pages within a site. Yet, it’s important to have them indexed (especially for ecommerce sites) because they can be used in going after fat head keywords.

If you want search engines to index your categories without penalizing you, take the time to optimize them by creating unique content. In WordPress, you can simply go to Pages>Categories and edit each one. Write a vivid, non-boilerplate description for each one and give them appropriate title tags and meta descriptions. The key here is to generate content that gives the user a better idea of what the category is about and what he can expect from the pages under it.

Tags are a different matter. They help users find mentions of relevant entities that represent topics discussed frequently in a site. They’re also used to aid internal search query generation. However, it’s hard to really squeeze a lot of value in tag pages, so you’re better off blocking bot access to them altogether. Yoast’s WordPress SEO plugin allows you to do this. For other CMS platforms, you’ll have to work with your devs or use robots.txt to keep tags from getting indexed.

Boilerplate product descriptions – In ecommerce sites, there’s little you can really do about boilerplate content other than rewriting, enriching and reformatting generic product descriptions. Most manufacturers will provide brief product overviews and specifications. That means you can build the content up by writing more detailed descriptions by stating the benefits, listing down the features and presenting the specs in the most reader-friendly way possible.

This is often a massive task that requires heavy lifting from copywriters and site merchandisers. The writers need to have sufficient product training to write accurate copy and the merchandisers pushing new parts to the online store have to constantly coordinate with the content team to make sure all descriptions are customized. If you don’t have the resources to get this done, you may have to deindex your product pages. It’s not ideal, but it’s better than creating tons of duplicate pages that search engines may penalize you for.

Location-targeted pages – There’s a debate among SEO experts on whether location-targeted pages are still viable in today’s Google landscape. I’m of the opinion that they are as long as you put in the work to make each of these pages as unique as possible.

Making these pages unique means talking about each specific place, its landmarks, using images that were really taken from those areas and giving out information that’s relevant to the locations. For instance, if you’re offering pest control services in Texas and your service area includes town A, B, C, D and so on, you can include statistics and anecdotes on how good or bad the pest situation in that place is. You can also talk about the place’s topography and what that means as far as pest-related issues are concerned.

If you have business addresses and phone numbers for each location you serve, make sure to state them in the copy of the corresponding pages. If you only have one business address and phone number, it may be better to show them in an image with no alt text or captions.

The part where you describe your company and your offerings can be a pain to write with different words dozens of times. A good way to keep writer fatigue low and originality high is to use several writers for this effort.

Low Word Count Pages – If you’re convinced that a page is thin from a word count perspective and there are other discussion points that you can raise to make it richer, go ahead and expand the content. Don’t resort to fluff just to fatten up the word count. Make sure the new information is succinct, relevant and written in a way that’s easy to absorb.

Most SEOs will agree that 400-500 words is the minimum range for what’s considered as sufficient word counts. Though studies have shown that there’s a correlation between higher word counts and better rankings, I’m not exactly subscribed to that school of thought. Personally, my simple rule for word count is this: make it as long as it has to be and as short as possible.

If good user experience is the end goal of content creation, then we should all give our readers content that’s simple, informative and easy to read. There’s no need for big words, empty phrases and complex sentence contractions. Getting the message across is the point of writing to begin with.

At the end of the day, dealing with thin content is all about making your pages worth indexing. Keep an eye out for possible instances of this issue, deal with them using the recommended actions and you should be able to improve your overall site quality and ultimately, your search engine visibility.