|
Issues Paper
A comparison of search tools for finding resourcesBobby MaisnamEmail: searchanalysis AT i-space.us IS567 Issues Paper Spring 2004 Index of Contents
1. IntroductionHow do we find information on the Internet? Do we find the appropriate information easily or do we have to search a lot for them? Which tools are popular these days and which tools give accurate results? How do we compare the various search results? These are some of the issues which are discussed in this paper. In particular, it focusses on the comparison of search tools for finding resources on the Internet. back to Index of Contents 2. Search Tools2.1 Classifications of Search EnginesThe term "search engine" is often used generically to describe both crawler-based search engines and human-powered directories. These two types of search engines gather their listings in radically different ways. [1] Search tools can be classified into 3 main categories:
back to Index of Contents 2.1.1 Search EnginesWikipedia [2] defines a Search Engine as: As mentioned before, search engines are mostly crawler-based. Examples (alphabetically):
back to Index of Contents 2.1.2 Web DirectoriesWikipedia defines a Web Directory [3] as: The Open Directory project is the most famous of the directories. Google uses the open directory results but displays the list according to its page ranking order. The major advantage of web directories is that the directory is built and managed by humans. As a result, they tend to be a higher quality than computer-generated resources. But since it is managed by actual people, the amount of information it can manage is limited. One major difference between search engines and web directories (in terms of searching the content) is that in the case of a web directory, a user can only search only the titles, descriptions and subject categories of the entries in the directory and not the actual content of the web pages they point to. Examples (alphabetically):
Notes:
back to Index of Contents 2.1.3 Meta Search EnginesMeta Search Engines are defined as [5]: Meta Search engines query several other Web search engine databases in parallel. Unlike search engines, metacrawlers don't crawl the web themselves to build listings. Instead, they allow searches to be sent to several search engines all at once. The results are then blended together onto one page. Meta search engines are useful when you expect to find few materials on your topic; their ability to search multiple databases concurrently will quickly identify which database has materials meeting your criteria. The disadvantages are less apparent; search depth is shallow, the ability to refine searches is quite limited, they may overwhelm you with seemingly relevant materials, and, often, you have no choice in what search engines are used. Examples (alphabetically):
back to Index of Contents 2.2 How Search Engines index/crawl pagesA web crawler (also known as web spider) is a program which browses the World Wide Web in a methodical, automated manner. A web crawler is one type of webbot. Web crawlers not only keep a copy of all the visited pages for later processing - for example by a search engine but also index these pages to make the search narrower. [6] In general, the web crawler starts with a list of URLs to visit. As it visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit. The process is either ended manually, or after a certain number of links have been followed. Sometimes it can take a while for new pages or changes that the spider finds to be added to the index. Thus, a web page may have been "spidered" but not yet "indexed." Until it is indexed -- added to the index -- it is not available to those searching with the search engine. Web crawlers typically take great care to spread their visits to a particular site over a period of time, because they access many more pages than the normal (human) user and therefore can make the site appear slow to the other users if they access the same site repeatedly.[7] For similar reasons, web crawlers are supposed to obey the back to Index of Contents 2.3 Ranking Methodologies2.3.1 IntroductionDifferent search engines use different ranking algorithms. But the common aspect in them is that they search for keywords and meta tags. The location (whether the keyword appears in title or heading or in a paragraph), frequency (number of occurences of the keyword) and proximity (physical closeness between two words in a multiple word query) also play a major role in determining the ranking of a page. 2.3.2 Google's PageRank algorithmThe most famous and discussed about ranking algorithm is Google's PageRank algorithm. The algorithm works like this - each page is ranked by how many pages link to it, on the premise that good or desirable pages are linked to more than others. The PageRank of linking pages and the number of links on these pages contribute to the PageRank of the linked page. This makes it possible for Google to first present pages that are highly linked to by quality websites. Another great factor in Google's success, and one aspect which spawned many offsprings, is the simplicity of its user interface. Google explains [8] the PageRank algorithm as follows: The PageRank algorithm is so popular (and discussed about) that a search for "PageRank" on Google generates 918,000 results! (12:37 PM 4/28/2004) back to Index of Contents 2.3.3 Why different search engines produce different resultsAll the major search engines follow the location/frequency method to some degree. But no two search engines do it exactly the same, which is one reason why the same search on different search engines produces different results. To begin with, some search engines index more web pages than others. Some search engines also index web pages more often than others. The result is that no search engine has the exact same collection of web pages to search through. That naturally produces differences, when comparing their results [9]. The databases of search engines differ from each other. They are created by robots which have followed different paths across the web and used different rules for finding web sites for inclusion in their databases. The database information may have been created from information in any or all of the following areas of the pages: web page titles, URLs (web page addresses), web page content (text & images), and links on the web page. Because the robots of the various search engines start in different places and work in different ways, different search engines will produce different results [10]. Search engines may also penalize pages or exclude them from the index, if they detect search engine "spamming." An example is when a word is repeated hundreds of times on a page, to increase the frequency and propel the page higher in the listings. Search engines watch for common spamming methods in a variety of ways, including following up on complaints from their users. back to Index of Contents 2.4 Features ComparisonMost of the information in this section is from these pages: [11], [12] and [13] Table: Comparision of various features of different search engines.
The features mentioned above are explained below: Deep Crawl: All crawlers will find pages to add to their web page indexes, even if those pages have never been submitted to them. However, some crawlers are better than others. This section of the chart shows which search engines are likely to do a "deep crawl" and gather many pages from your web site, even if these pages were never submitted. In general, the larger a search engine's index is, the more likely it will list many pages per site. Frames Support: This shows which search engines can follow frame links. Those that can't will probably miss listing much of your site. However, even for those that do, having individual frame links indexed can pose problem. robots.txt: The robots.txt file is a means for webmasters to keep search engines out of their sites. The robots exclusion standard or robots.txt protocol is a convention to prevent well-behaved web spiders and other web robots from accessing all or part of a website. The information specifying the parts that should not be accessed is specified in a file called robots.txt in the top-level directory of the website. Paid Inclusion: Shows whether a search engine offers a program where you can pay to be guaranteed that your pages will be included in its index. This is NOT the same as paid placement, which guarantees a particular position in relation to a particular search term. Full Body Text: All of the major search engines say they index the full visible body text of a page, though some will not index stop words or exclude copy deemed to be spam. Google generally does not index past the first 101K of long HTML pages. Stop Words: Some search engines either leave out words when they index a page or may not search for these words during a query. These stop words are excluded as a way to save storage space or to speed searches. Meta Keywords: Shows which search engines support the meta keywords tags, as explained on the How HTML Meta Tags Work page. back to Index of Contents 3. Comparison QueriesNotes: Comparison queries have been carried out for three search terms from different categories: These comparisons are being done across search engines, directories and meta search engines.
3.1 Term - "Computers" (Technical term)
back to Index of Contents
3.2 Term - "Gretchen Whitney" (Person's name)
back to Index of Contents
3.3 Term - "Saturn" (Dual meaning - can be both a planet and a car)
back to Index of Contents 4. Conclusions
back to Index of Contents 5. Useful links
back to Index of Contents
|
Copyright © 2004 · www.SearchAnalysis.Info |