SharePoint Blogs / SharePoint University
SharePoint Blogs and SharePoint University - all in one place!
Need SharePoint Training? Attend a SharePoint Bootcamp!

Please delete cookies related to sharepointblogs.com and sharepointu.com to resolve login issues!

How not to set up Sharepoint Search

This is going to be a quick little post where I confess to doing something stupid.  I should have known better.

We are currently prototyping a MOSS 2007 Enterprise solution for a government organization.  This customer has a fairly large public website, and a similarly large intranet site.  While demonstrating how to configure search scopes, I created a scope pointing at the external site, and changed its settings to "unlimited" without really thinking about what that meant.

Well, what it means is, it's a good way to fill up your server's hard drive, and a good way to annoy other departments of the government.  It turns out that in "unlimited" mode, Sharepoint crawls every link, then keeps crawling links found on destination pages, branching out until, I assume, it has indexed the entire Internet.  I also learned that Sharepoint doesn't handle the ROBOTS.TXT files correctly, so sensitive information could be crawled along with everything else.  I couldn't believe this, until I found a post here that confirms it:  Sharepoint only looks at the root of the site for ROBOTS.TXT files, and ignores them wherever else they may be:

Observations
 
During our testing we discovered the following.
 
1. robots.txt file is cached for 24 Hours following it's first request by the crawler. The implication of this is that changes to robots.txt require either a restart of the Office Search Service or a delay of upto 24 hours before they are respected by the gatherer.
 
2. Placing a robots.txt anywhere other than the root of a website is completely ineffective.
For example. http://www.website.com/folder/robots.txt will be ignored

 So after running for a weekend, the following Monday I received a much-forwarded email from irate webmasters in another city who wondered what the hell my MOSS server was doing crawling every directory on their site, even in places specifically flagged by the ROBOTS.TXT file (that upset them more than anything else). 

I also saw that the Search database had grown by seven gigabytes over the weekend, which brought my Sharepoint server to its knees (the VM we were using had limited disk space). 

So, let my life serve as a warning to others.  Don't use the "Unlimited" scope setting unless you know where every link on your website goes. 

Edited to add:

I forgot to mention the resolution to this.  One of the customer's biggest problems with their Intranet and Public website is search integration.  For an external site, probably the best solution is to use a Google Site Search, which basically embeds a Google search field on your webpage and filters the results to your organization's URL.   

This is better, I think, than pointing Sharepoint at it, because first of all, Google handles the ROBOTS.TXT files correctly.  Second of all, especially if you have a large site, Google does all the heavy lifting and even archives versions of the web pages.  If you wanted to, you could drop this functionality into a Sharepoint Search Center page and have the Google field sit next to your Sharepoint field.

For the customer's intranet, it is another matter entirely.  A couple of years ago, they were running both webs on a beleaguered NT4 server running IIS4.  I know, most of you are cringing.  This server had been hacked in the past, but it had been recovered and locked down and was limping dutifully along.  We migrated all content to IIS6 without much difficulty, but the one thing that couldn't be migrated was the search solution.  In IIS4 there was a built-in script to allow basic user searches.  Actually, it worked pretty well.  But in IIS6, Microsoft stripped it out, perhaps out of concern for security, and there was no similar functionaliy.  We explored and evaulated numerous open-source and third party replacements, but they were either too expensive, inaccurate, or simply didn't work as advertised.  The most functional solution was the FrontPage search solution that used a web bot to index content, but it turned out it never did this automatically; you'd have to re-run it every time you updated the website (which was often). 

Now this customer has MOSS 2007, and I'm glad to say that this is going to solve their problem once and for all.  Now we have the ability to create search scopes, target types of information, and go far beyond any functionality they ever had before. 


Posted 02-03-2008 10:09 AM by moffitar

Comments

Rick wrote re: How not to set up Sharepoint Search
on 02-04-2008 2:28 PM

This is very helpful information. I saw the unlimited option and never knew what it meant. Thankfully, I never have selected it!

gatherer database wrote gatherer database
on 03-26-2008 5:50 AM

Pingback from  gatherer database

Adrian wrote re: How not to set up Sharepoint Search
on 11-17-2008 10:34 PM

"...Sharepoint crawls every link, then keeps crawling links found on destination pages, branching out until, I assume, it has indexed the entire Internet."

LOL!!!

I can't think of any practical situation where one would WANT this behaviour?

Todd Thompson wrote re: How not to set up Sharepoint Search
on 11-21-2008 9:57 AM

Just as an FYI everything I've been reading on robots.txt says that only the one in the root of the site is valid.  So for example microsoft.com/robots.txt should be paid attention to, while microsoft.com/.../robots.txt should be ignored.

Sources:

en.wikipedia.org/.../Robots.txt

www.robotstxt.org/robotstxt.html

www.w3.org/.../notes.html

Add a Comment

(required)  
(optional)
(required)  
Remember Me?
Need SharePoint Training? Attend a SharePoint Bootcamp!
Posts (c) their respective authors. Everything else (c) 2009 SharePoint Experts, Inc.