SearchAnalysis.Info - A comparison of search tools for finding resources

Issues Paper

Home

A comparison of search tools for finding resources

Bobby Maisnam
Email: searchanalysis AT i-space.us

IS567 Issues Paper
Spring 2004

Index of Contents

Introduction
Search Tools

2.1 Classifications of Search Engines

2.1.1 Search Engines
2.1.2 Web Directories
2.1.3 Meta Search Engines

2.2 How search engines index/crawl pages
2.3 Ranking Methodologies

2.3.1 Introduction
2.3.2 Google's PageRank algorithm
2.3.3 Why different search engines produce different results

2.4 Features Comparison

Comparison Queries

3.1 Term - "Computers" (Technical term)
3.2 Term - "Gretchen Whitney" (Person's name)
3.3 Term - "Saturn" (Dual meaning - can be both a planet and a car)

Conclusion
Useful Links

1. Introduction

How do we find information on the Internet? Do we find the appropriate information easily or do we have to search a lot for them? Which tools are popular these days and which tools give accurate results? How do we compare the various search results?

These are some of the issues which are discussed in this paper. In particular, it focusses on the comparison of search tools for finding resources on the Internet.

back to Index of Contents

2. Search Tools

2.1 Classifications of Search Engines

The term "search engine" is often used generically to describe both crawler-based search engines and human-powered directories. These two types of search engines gather their listings in radically different ways. [1]

Search tools can be classified into 3 main categories:

Search Engines (crawler-based)
Web Directories (human-powered/edited)
Meta Search Engines

back to Index of Contents

2.1.1 Search Engines

Wikipedia [2] defines a Search Engine as:
"A search engine is a program designed to help the user access files stored on a computer, for example a public server on the World Wide Web, by allowing to ask for documents meeting certain criteria (typically those containing a given word, a set of words, or a phrase) and retrieving files that match those criteria. Unlike an index document that organizes files in a predetermined way, a search engine looks for files only after the user has entered search criteria."

As mentioned before, search engines are mostly crawler-based.

Examples (alphabetically):

a9.com
Altavista
AllTheWeb
Excite
Google
HotBot
Lycos
MSN Search
NorthernLight
WebCrawler

back to Index of Contents

2.1.2 Web Directories

Wikipedia defines a Web Directory [3] as:
"A web directory is a directory on the World Wide Web that specializes in linking to other web sites and categorizing those links. Web directories often allow site owners to submit their site for inclusion. Editors review submissions for fitness."

The Open Directory project is the most famous of the directories. Google uses the open directory results but displays the list according to its page ranking order.

The major advantage of web directories is that the directory is built and managed by humans. As a result, they tend to be a higher quality than computer-generated resources. But since it is managed by actual people, the amount of information it can manage is limited. One major difference between search engines and web directories (in terms of searching the content) is that in the case of a web directory, a user can only search only the titles, descriptions and subject categories of the entries in the directory and not the actual content of the web pages they point to.

Examples (alphabetically):

About.com
Looksmart
Open Directory Project
Yahoo! Directory
Zeal

Notes:

Web Directories are also known as Subject Directories.
The Open Directory Project (ODP) is also known as DMoz (for Directory.Mozilla). It started as Gnuhoo in 1998, became Newhoo in June 1998 and became ODP in October 1998 when it was bought by Netscape for $1 million [4].

back to Index of Contents

2.1.3 Meta Search Engines

Meta Search Engines are defined as [5]:
"Utilities that search more than search engine and/or subject directory at once and then compile the results in a sometimes convenient display, sometimes consolidating all the results into a uniform format and listing. Some offer added value features like the ability to refine searches, customize which search engines or directories are queried, the time spent in each, etc. Some you must download and install on your computer, whereas most run as server-side applications."

Meta Search engines query several other Web search engine databases in parallel. Unlike search engines, metacrawlers don't crawl the web themselves to build listings. Instead, they allow searches to be sent to several search engines all at once. The results are then blended together onto one page.

Meta search engines are useful when you expect to find few materials on your topic; their ability to search multiple databases concurrently will quickly identify which database has materials meeting your criteria. The disadvantages are less apparent; search depth is shallow, the ability to refine searches is quite limited, they may overwhelm you with seemingly relevant materials, and, often, you have no choice in what search engines are used.

Examples (alphabetically):

Dogpile
Mamma
MetaCrawler
Profusion

back to Index of Contents

2.2 How Search Engines index/crawl pages

A web crawler (also known as web spider) is a program which browses the World Wide Web in a methodical, automated manner. A web crawler is one type of webbot. Web crawlers not only keep a copy of all the visited pages for later processing - for example by a search engine but also index these pages to make the search narrower. [6]

In general, the web crawler starts with a list of URLs to visit. As it visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit. The process is either ended manually, or after a certain number of links have been followed.

Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a web page may have been "spidered" but not yet "indexed." Until it is indexed -- added to the index -- it is not available to those searching with the search engine.

Web crawlers typically take great care to spread their visits to a particular site over a period of time, because they access many more pages than the normal (human) user and therefore can make the site appear slow to the other users if they access the same site repeatedly.[7]

For similar reasons, web crawlers are supposed to obey the robots.txt protocol, with which web site owners can indicate which pages should not be spidered.

back to Index of Contents

2.3 Ranking Methodologies

2.3.1 Introduction

Different search engines use different ranking algorithms. But the common aspect in them is that they search for keywords and meta tags. The location (whether the keyword appears in title or heading or in a paragraph), frequency (number of occurences of the keyword) and proximity (physical closeness between two words in a multiple word query) also play a major role in determining the ranking of a page.

2.3.2 Google's PageRank algorithm

The most famous and discussed about ranking algorithm is Google's PageRank algorithm. The algorithm works like this - each page is ranked by how many pages link to it, on the premise that good or desirable pages are linked to more than others. The PageRank of linking pages and the number of links on these pages contribute to the PageRank of the linked page. This makes it possible for Google to first present pages that are highly linked to by quality websites. Another great factor in Google's success, and one aspect which spawned many offsprings, is the simplicity of its user interface.

Google explains [8] the PageRank algorithm as follows:
"PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B. But, Google looks at more than the sheer volume of votes, or links a page receives; it also analyzes the page that casts the vote. Votes cast by pages that are themselves 'important' weigh more heavily and help to make other pages 'important'."

The PageRank algorithm is so popular (and discussed about) that a search for "PageRank" on Google generates 918,000 results! (12:37 PM 4/28/2004)

back to Index of Contents

2.3.3 Why different search engines produce different results

All the major search engines follow the location/frequency method to some degree. But no two search engines do it exactly the same, which is one reason why the same search on different search engines produces different results. To begin with, some search engines index more web pages than others. Some search engines also index web pages more often than others. The result is that no search engine has the exact same collection of web pages to search through. That naturally produces differences, when comparing their results [9].

The databases of search engines differ from each other. They are created by robots which have followed different paths across the web and used different rules for finding web sites for inclusion in their databases. The database information may have been created from information in any or all of the following areas of the pages: web page titles, URLs (web page addresses), web page content (text & images), and links on the web page.

Because the robots of the various search engines start in different places and work in different ways, different search engines will produce different results [10].

Search engines may also penalize pages or exclude them from the index, if they detect search engine "spamming." An example is when a word is repeated hundreds of times on a page, to increase the frequency and propel the page higher in the listings. Search engines watch for common spamming methods in a variety of ways, including following up on complaints from their users.

back to Index of Contents

2.4 Features Comparison

Most of the information in this section is from these pages: [11], [12] and [13]

Table: Comparision of various features of different search engines.

Feature	a9.com	Altavista	AllTheWeb	Excite	Google	HotBot	Lycos	MSNSearch	Teoma
Deep Crawl	Y	N	Y	N/A	Y	N/A	N/A	N/A	N
Frames Support	Y	Y	Y	Y	Y	Y	Y	Y	Y
robots.txt	Y	Y	Y	Y	Y	Y	Y	Y	Y
Paid Inclusion	N/A	Y	Y	Y	N	Y	Y	Y	Y
Full Body text	Y	Y	Y	Y	Y	Y	Y	Y	Y
Stop Words	Y	Y	N/A	N/A	Y	N/A	N/A	N/A	N/A
Meta keywords	Y	Y	Y	N/A	Y	N/A	N/A	N/A	N
Image Search	Y	Y	N	N	Y	N	N	N	N
URL shortcuts	Y	N	N	N	N	N	N	N	N

The features mentioned above are explained below:

Deep Crawl: All crawlers will find pages to add to their web page indexes, even if those pages have never been submitted to them. However, some crawlers are better than others. This section of the chart shows which search engines are likely to do a "deep crawl" and gather many pages from your web site, even if these pages were never submitted. In general, the larger a search engine's index is, the more likely it will list many pages per site.

Frames Support: This shows which search engines can follow frame links. Those that can't will probably miss listing much of your site. However, even for those that do, having individual frame links indexed can pose problem.

robots.txt: The robots.txt file is a means for webmasters to keep search engines out of their sites. The robots exclusion standard or robots.txt protocol is a convention to prevent well-behaved web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website.

Paid Inclusion: Shows whether a search engine offers a program where you can pay to be guaranteed that your pages will be included in its index. This is NOT the same as paid placement, which guarantees a particular position in relation to a particular search term.

Full Body Text: All of the major search engines say they index the full visible body text of a page, though some will not index stop words or exclude copy deemed to be spam. Google generally does not index past the first 101K of long HTML pages.

Stop Words: Some search engines either leave out words when they index a page or may not search for these words during a query. These stop words are excluded as a way to save storage space or to speed searches.

Meta Keywords: Shows which search engines support the meta keywords tags, as explained on the How HTML Meta Tags Work page.

back to Index of Contents

3. Comparison Queries

Notes:

Comparison queries have been carried out for three search terms from different categories:
[1] Term - "Computers" (Technical term)
[2] Term - "Gretchen Whitney" (Person's name)
[3] Term - "Saturn" (Dual meaning - can be both a planet and a car)

These comparisons are being done across search engines, directories and meta search engines.

3.1 Term - "Computers" (Technical term)

Term: Computers a9.com	http://www.apple.com/ http://www.dell.com/ http://www.gateway.com/ http://www.gateway.com/meta_refresh/global/ftr_gtw.asp http://www.compaq.com/

Term: Computers Altavista	http://www.dmoz.org/Computers/ http://directory.google.com/Top/Computers/ http://www.dell.com/ http://www.apple.com/ http://www.computers.com/

Term: Computers Open Directory	http://dmoz.org/Computers/ http://dmoz.org/Society/Gay,_Lesbian,_and_Bisexual/Computers_and_Internet/ http://dmoz.org/Society/Religion_and_Spirituality/Computers/ http://dmoz.org/Science/Instruments_and_Supplies/Laboratory_Computers_and_Software/ http://dmoz.org/Home/Consumer_Information/Computers_and_Internet/

Term: Computers Google	http://www.apple.com/ http://www.dell.com/ http://www.gateway.com/ http://www.gateway.com/meta_refresh/global/ftr_gtw.asp http://www.compaq.com/

Term: Computers HotBot	http://www.dell.com/ http://www.dmoz.org/Computers/ http://www.apple.com/ http://directory.google.com/Top/Computers/ http://www.computers.com/

Term: Computers Meta Crawler	http://www.dell.com/ http://www.apple.com/ http://www.compaq.com/ http://www.hp.com/ http://www.microsoft.com/

Term: Computers Yahoo!	http://www.howstuffworks.com/pc.htm http://www.cnet.com/ http://www.tomshardware.com/ http://www.pcworld.com/ http://reviews.cnet.com/

Term: Computers Summary of Findings	Note: MetaCrawler is really annoying. It puts sponsored ads randomly between the search results. (15 sponsored ads in the first 20 results) Note[2]: Found an interesting result in MetaCrawler's search results. It seems that in most of the search queries for common commercial products (like computer, camera, fuji finepix) there are always 15 sponsored results in the first 20 results. The sponsored results are always at positions 1,2,3,4,5, 8,9,10,11,12, and 15,16,17,18,19. No wonder I use Google!

back to Index of Contents

3.2 Term - "Gretchen Whitney" (Person's name)

Term: Gretchen Whitney a9.com	http://web.utk.edu/~gwhitney/gwpage2.html http://web.utk.edu/~gwhitney/ http://www.whitneyhs.net/ http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/w/Whitney:Gretchen.html http://www.schoolwisepress.com/pdf-vault/19/19-64212-1931880h.pdf

Term: Gretchen Whitney Altavista	http://web.utk.edu/~gwhitney/gwpage2.html http://listserv.utk.edu/cgi-bin/wa?A1=ind9908&L=utlisnet http://store.schoolwisepress.com/pdf-vault/19/19-64212-1931880h.pdf http://web.utk.edu/~gwhitney/ http://www.whitneyhs.net/

Term: Gretchen Whitney Open Directory	No Results

Term: Gretchen Whitney Google	http://web.utk.edu/~gwhitney/gwpage2.html http://web.utk.edu/~gwhitney/ http://www.whitneyhs.net/ http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/w/Whitney:Gretchen.html http://www.schoolwisepress.com/pdf-vault/19/19-64212-1931880h.pdf

Term: Gretchen Whitney HotBot	http://web.utk.edu/~gwhitney/gwpage2.html http://web.utk.edu/~gwhitney http://www.whitneyhs.net/ http://store.schoolwisepress.com/pdf-vault/19/19-64212-1931880h.pdf http://www.greatschools.net/modperl/browse_school/ca/1430

Term: Gretchen Whitney Meta Crawler	http://web.utk.edu/~gwhitney/gwpage2.html http://web.utk.edu/~gwhitney/ http://www.whitneyhs.net/ http://api.cde.ca.gov/api2002base/2002Base_sch.asp?SchCode=1931880 http://partypop.com/forums/Archive/0304.htm

Term: Gretchen Whitney Yahoo!	http://web.utk.edu/~gwhitney/gwpage2.html http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/w/Whitney:Gretchen.html http://listserv.utk.edu/cgi-bin/wa?A1=ind9908&L=utlisnet http://store.schoolwisepress.com/pdf-vault/19/19-64212-1931880h.pdf http://web.utk.edu/~gwhitney/

Term: Gretchen Whitney Summary of Findings	Notes: a9.com also displayed 3 books by Dr. Whitney: - Language Distribution in Databases: An Analysis and Evaluation (Aug, 1990) - The Transfer of Scholarly, Scientific and Technical Information Between North and South America: Proceedings of a Conference (Oct, 1986) - Digital Revolution (Dec, 1996) MetaCrawler: No sponsored results this time! (0 sponsored ads in first 20 results)

back to Index of Contents

3.3 Term - "Saturn" (Dual meaning - can be both a planet and a car)

Term: Saturn a9.com	http://www.saturn.com/ http://www.saturn.de/ http://www.solarviews.com/eng/saturn.htm http://www.jpl.nasa.gov/cassini/ http://www.saturn.org/

Term: Saturn Altavista	http://www.saturn.com/ http://www.seds.org/nineplanets/nineplanets/saturn.html http://www.solarviews.com/eng/saturn.htm http://nssdc.gsfc.nasa.gov/photo_gallery/photogallery-saturn.html http://www.jpl.nasa.gov/cassini/

Term: Saturn Open Directory	http://dmoz.org/Science/Astronomy/Solar_System/Saturn/ http://dmoz.org/Recreation/Autos/Makes_and_Models/Saturn/ http://dmoz.org/Games/Video_Games/Emulation/Sega/Saturn/ http://dmoz.org/Games/Video_Games/Console_Platforms/Sega/Saturn/ http://dmoz.org/Sports/Soccer/UEFA/Russia/Clubs/Saturn-REN_TV/

Term: Saturn Google	http://www.saturn.com/ http://www.saturn.de/ http://www.solarviews.com/eng/saturn.htm http://www.jpl.nasa.gov/cassini/ http://www.saturn.org/

Term: Saturn HotBot	http://www.saturn.com/ http://www.seds.org/nineplanets/nineplanets/saturn.html http://www.jpl.nasa.gov/cassini http://www.solarviews.com/eng/saturn.htm http://www.saturn.org/

Term: Saturn Meta Crawler	http://www.saturn.com/ http://www.solarviews.com/eng/saturn.htm http://www.seds.org/nineplanets/nineplanets/saturn.html http://nssdc.gsfc.nasa.gov/photo_gallery/photogallery-saturn.html http://ringmaster.arc.nasa.gov/saturn/saturn.html

Term: Saturn Yahoo!	http://www.saturn.com/ http://www.windows.ucar.edu/tour/link=/saturn/saturn.html http://seds.lpl.arizona.edu/nineplanets/nineplanets/saturn.html http://ringmaster.arc.nasa.gov/saturn/saturn.html http://nssdc.gsfc.nasa.gov/photo_gallery/photogallery-saturn.html

Term: Saturn Summary of Findings	Notes: Out of the 7 search tools, 6 of them (except dmoz) gave www.saturn.com as the first result. Also, the other top 5 results seemed to almost the same across most of the tools except for their ranking. For example www.solarviews.com/eng/saturn.htm is the third result in a9.com, Altavista and Google but is the second result in Meta Crawler and the fourth result in Hotbot) MetaCrawler is again really annoying. It puts sponsored ads randomly between the search results. (15 sponsored ads in the first 20 results)

back to Index of Contents

4. Conclusions

Since a9.com uses Google's engine, the search results for the two are always the same.
Yahoo's results are very similar to Google's. An interesting tool to visually compare the results in both is here
Metacrawler is boring. It gives too many sponsored results (15 sponsored ads in first 20 results!) I think that is too much and I feel that many users must be getting annoyed by it.

back to Index of Contents

5. Useful links

http://www.searchengineshowdown.com/ (Really good reviews)
http://searchenginewatch.com/ (Good site)
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/ToolsTables.html (Good classification of search tools)
http://www.ndu.edu/library/searchengines.html (Classification of search tools)
http://en.wikipedia.org/wiki/Search_engines#How_search_engines_work (How Search engines work)
http://searchenginewatch.com/webmasters/article.php/2167961 (How Search engines rank pages)

back to Index of Contents