Ever had content crawled by Google which you didn’t intend to expose? Of course you have, you just don’t want to admit it
I recently had a client getting their entire staging environment indexed by Google all because someone wondered what the robots.txt file was and instead of asking they decided to delete it. So now hundreds of pages from the live environment had a duplicate from the staging environment showing up in the search result and exposing their staging site to the entire world. Now I’m all against exposing the stage server to the public web but since it’s what the client wanted, it’s what they got.
So, what to do in this case?
Putting up IP-restictions
We decided the first thing to do was to put the stage environment behind ip-tables so that at least no one could read any of the information and even though most of it was the same as on the live site, there where still some content that wasn’t suppose to go public yet and some references to old files containing terms and conditions that was old and outdated and not to mention the “duplicate content penalty” you might receive from Google. Since we had no access to the firewall and the technician where nowhere to be found, restricting access in the firewall was not an option. We had to go with the next best thing which was adding ip-restrictions in the IIS.
Restrict search engine access
We put back the robots.txt file in the root of the stage site and instructed the IIS to allow all traffic to that file.
These two lines tells the search engines to ignore all site content from the site root and forth.
Removing the Google index
Google Webmaster Tools allows you to remove certain pages or an entire site from the Google index. We didn’t have the stage site in GWT so the first thing to do was verify the site from GWT and since we IP-restricted the entire site we could not go with the meta-option. We went with the HTML file option and put the verification file in the site root and let go of the IP-restrictions for that file.
Note: IP-restrictions on per file basis in IIS is not that straightforward. You need to switch to Content View, right-click the file and choose Switch to Features View and then go into IP Addresses and Domain Restriction.
Now that we could verify the stage site in GWT we could proceed with removing the site from the index. In GWT under Google Index there an option called Remove URLs.
To remove the entire site from the index you click Create a new removal request and just enter a forward slash (/). Keep in mind that this will remove all content under the domain from the index. Click Continue and GWT will ask you to confirm your request. One thing to keep in mind is that if you like to keep the removed url from being indexed again you need to have the robots.txt file present in the site root all the time.
It took us about three hours for the index of the staging site to disappear from the search result. The lesson learned from this mistake is always ip-restrict your stage or test environments. This may not be the best solution in all cases and sometimes the client want to be able to access the site from several different ip:s but always communicate to the client what implications not doing this might have.