Archive

Posts Tagged ‘Robots Exclusion Standard’

robots.txt and Removing Content from Google’s Index

November 19th, 2008

As you might have read I’m currently struggling with Google to get my current sites indexed and my outdated and deleted pages out of the index. Today some progress was made. First of all the first content removal request was processed! Hooray! I had to wait two days (Google indicated that it takes between 3-5 business days usually … so I’m really asking myself, what are they doing? Are they processing thos requests manually?) and the result: well the pages are still in the index. BUT it just might have been my fault, because I chose to remove the page /tag/ instead of the directory /tag/. The latter one should include all subdirectories, so I posted another removal request today.

But I have also done something else - I updated my robots.txt. It now looks like this:

User-Agent: *
Allow: /
Disallow: /blog/category/
Disallow: /blog/tag/
Disallow: /*?
Disallow: /blog/2008/

Basically the first line says that whoever is reading this (some bot usually) you better take note of what follows. In principle every directory on the host is allowed to be indexed, BUT don’t even try to index categories, tags and the 2008 archive! I also have excluded any page with a question mark - those pages are usually the ones from a search request. I’m doing all this to get rid of the outdated content that is still in the Google index and also to try to minimize the duplicate content, as this is supposedly not good for page ranks.

I have also modified the looks of the blog today a little bit. Tweaked the headers (now on the mainpage the blog name is <h1> and the titles of the entries are <h2> and on every single entry view the blog title is <h3> and the title of the entry is <h1>) and filled some titles with some keywords. We’ll see how all this works out in the coming days I suppose!

Blogging, Search Engine Optimization ,