09/8/08
One of the most common mistakes I see many web developers make is trying to hide webpages from search engines while making it incredibly easy for people to discover those pages. Your eyes may start to glaze while reading this, but stay with me…this will save your booty at some point. The first and most common mistake is the use of a robots.txt file. Basically what this file does is tells spiders (search engines) what files and directories to scan. You can usually see if a website has one by simply adding a /robots.txt after their domain name. For example, you could see Google’s by going to http://www.google.com/robots.txt, which reveals numerous directories they don’t want search engines to scan…interesting stuff if you’re into that sort of thing.
So…how can this be a bad thing? Well, take a look at Google’s robots.txt again and notice how they disallow directories, which is the proper way of doing it. The mistake that some websites make is to disallow actual pages. It could say something like:
User-agent: *
Disallow: hiddenpage.html
Now, the “User-agent: *” part means “Hey all search engines…this applies to all of you”. Then the “Disallow: hiddenpage.html” line means exactly what it implies…don’t mess with hiddenpage.html. Now while this is all fine and dandy when it comes to search engines, what happens is that hiddenpage.html is now exposed to the entire world! Even worse, if there are multiple pages like this in the robots.txt, then the website has essentially listed every single secret page in one, organized location for anyone to see. (More info about setting up a robots.txt file)
I can’t tell you how many internet marketers (and many other types of site) I see making this mistake. Their landing page/sales letter/squeeze page or whatever you want to call it has little more than a form to submit your name and email address to get something for free…and some even require payment before receiving a “link” to a “secret download page”. Well…just type in their domain.com/robots.txt and voila…instant access sometimes to the very pages that you will eventually end up at. You’re not “hacking” anything and there’s nothing illegal about this. It’s just a simple misuse on their part of the robots.txt file. Unethical and immoral? Perhaps.
Now what do you do if you’re one of these very people with exposed files in your robots.txt file? The best thing you can do is to move your “secret” pages into a directory and then disallow the directory. It would look something like this if you move them one directory deep to a folder called “secret”.
User-agent: *
Disallow: /secret
Presto…all of your files in that directory will not be spidered by search engines and you’re not revealing actual pages. Go one step further and stick a blank file in that directory named index.html. I won’t go into details why…just do it to be even more secure.
The next mistake I see made is the lack of noindex/nofollow tags on the “secret” pages. I’ll cover that in another post as this is plenty for you to chew on for now.
Shane Eubanks


