Paper presented at the INET'98 Conference of the ISOC

Internet Information Retrieval: The Further Development of Meta-Searchengine Technology

Wolfgang Sander-Beuermann <wsb@rrzn.uni-hannover.de>
Mario Schomburg <schomburg@rrzn.uni-hannover.de>
Computer Center of Lower Saxony (Regionales Rechenzentrum Niedersachsen, RRZN) and
Institute for Computer Networks and Distributed Systems (Lehrgebiet Rechnernetze und Verteilte Systeme, RVS), University of Hannover, Germany

Abstract

This paper first describes the state of the art of the meta-search technology. It defines criteria for evaluating such applications and investigates existing meta-searchengines. Secondly it outlines our approaches to solve some problems of Internet information retrieval, which were undertaken at Hannover University. We have been running high-traffic meta-searchengines for nearly two years (http://mesa.rrzn.uni-hannover.de/ and http://meta.rrzn.uni-hannover.de/), and we are describing our experiences and the developments we have made to gain a higher degree of completeness and quality of Internet information retrieval.

One of the mainstream ideas we are following is the combination of the overwhelming mass of internet data with manually reviewed information sources and own ranking algorithms. Combining these leads towards high quality results based on an Internet search as complete as possible.

Table of Contents

  1. Introduction
    1. The Problem of Internet Information Retrieval
    2. Could one Searchengine solve the Problem?

  2. State of the Art of Meta-Searchengine Technology
    1. Some Definitions and Examples
    2. The Existing Meta-Searchengines (Client-based, Server-based)

  3. Experiences with Meta-Searchengines at Univ. of Hannover, Germany
    1. History and Current State
    2. How it works
    3. The Problems
    4. Problems and Cooperation with the Searchengine Maintainers

  4. Current and Future Development of our Meta-Searchengines
    1. Combination of Local Databases and Meta-Searchengines (QuickTips)
    2. Automatic Generation of Special Purpose Searchengines (Level3)
    3. Could Java help to solve the Problems?
    4. The User-Interface and the Quality of Results


1. Introduction

The Internet is the richest source of information humanity has ever developed. Finding a wanted information in the Internet however is still a major problem - retrieval of high quality information is even more difficult.

1.1 The Problem of Internet Information Retrieval

The difficulties of Internet information retrieval can be addressed by at least two main challenges:

  1. An Internet search must be as complete as possible, and

  2. the results should be of high quality.

The "quality" of the results are judged by the user only: if the customer is satisfied with the results of the search, we will call it a "good quality". That means that the designer of any searchengine interface is forced to understand the way his users are thinking.

Like any database retrieval the Internet search should be complete. The definition of "complete" however in an Internet sense is difficult - we can never really search "the whole Internet". What we can do is searching parts of the Internet nearly complete: We will mainly consider the World Wide Web as information source. If we confine ourselves to that part only, one might ask the following question:

1.2 Could one Searchengine solve the Problem?

To answer that question we must have an estimate of two factors: the total amount of data of the WWW, and the maximum amount one single searchengine can index. We start with an estimate of the total amount of data of the WorldWideWeb.

According to http://www.nw.com/zone/WWW-9707/firstnames.html, we had 754,716 Internet hosts in July 1997, with a name starting with www. So we are on the safe side of a guess, assuming that we have at least approximately 750,000 WWW servers on the Internet. If we now do a rough guess of the average amount of data on an average web server, we know the order of magnitude of the complete web data. We analysed lots of servers and we found approximately about 10 MByte of data per server. So we might have about 7,500 GByte of web data in July 1997. Extrapolation to January 1998 (when this paper is written) leads us to an order of magnitude of about 10,000 GByte of web data. We consider this as a conservative estimate, because the number of servers is definitely more than those starting with the name www and the average amount of data of a web server is probably higher than our guess of 10 MByte.

On the other hand we have to consider how much data can be indexed by the most advanced searchengine technology. If we take Altavista (http://www.altavista.digital.com/) as an example of such technology, we might draw the following conclusions:

According to Altavista's own statements, their searchengine is indexing 100 million web pages. The indexable part of a web page which we consider nowadays is of course just the textual part. So we must have an estimate of the average text content of a web page. Analysing our proxy-caches at Univ. of Hannover, we found that the average web page contains about 4 KByte of text. So Altavista is indexing something in the order of 400 GByte. Estimating the portion of the total web text data then demands knowledge about the text to non-text (binary) relation of the average web page. That relation is probably the most difficult part to guess. Analysing our Hannover Univ. proxy-caches shows a relation between 10% - 50% of text per web page.

With all these estimates in mind, we can calculate the indexed portion from:

(number_of_pages_indexed * text_data_per_page) / (total_web_data * text_to_binary_relation)

Executing this calculation with the above values results in Altavista indexing between 8% - 40% of the total amount of text in the web. Although the above calculations are rough estimates only, this fits together well if we compare the result with the answers of querying different searchengines.

Meta-Searchengine Efficiency

We can calculate a meta-searchengine efficiency by comparing the results found by the meta-service to those found by that searchengine which delivers most results (the "best" searchengine). If we define:
allHits = sum of all hits found by the meta-searchengine,
bestHits = number of hits found by the "best" searchengine,
duplicates = sum of duplicates found by the meta-searchengine,
an efficiency might be evaluated by:

eff = (allHits-duplicates)/bestHits     or

eff = 1 + (allHits-bestHits)/bestHits - duplicates/bestHits

If we look at the duplicates in detail, we encounter at least two difficulties: first we have to set up a good algorithm for duplicate recognition, and second duplicates might be procduced by searchengines on its own (because their duplicate recognition algorithm might be not as good as the one of the meta-service). If we however look at the rate of duplicates in practice, we found that they range in the order of just 10 to 30 percent. For a first estimate we are just interested in orders of magnitude and factors. Therefore we might neglect the duplicates for a first estimate.

No matter how we calculate the efficiency in detail - realistic searches lead to values in the range of 2 to 5, meaning that a meta-searchengine will deliver 2 to 5 times more results then the best single searchengine.

Therefore the question "could one searchengine solve the problem?" can nowadays be answered by a clear no. If even a search-automat like Altavista is not able to index more than 40% of the web, it is very sure that manually maintained databases like Yahoo (http://www.yahoo.com/) can never be able to cover significant parts. Even the problem of maintaining the data up-to-date is not solvable: Investigation of our proxy-cache data shows that within half a year about half of the web addresses are outdated.

If one searchengine can not solve the problem of Internet information retrieval, we obviously have to query several engines. If we do this in an automated way, we call the resulting automat a meta-searchengine.

2. State of the Art of Meta-Searchengine Technology

Before we discuss meta-searchengines we should have a common understanding of the terminology used.

2.1 Some Definitions and Examples

As the last example shows, these definitions are not common sense yet. Often the catalogues (or directories) like Yahoo are called searchengine, too. Although it is misleading, the usage of that terminology is already that wide spread, that we will not make an effort to revers this. To make things even more confusing, we have search services which use both: a searchengine combined with a directory, like Lycos http://www.lycos.com/.

We will not go further into this discussion, we will now focus on meta-searchengines only.

2.2 The existing Meta-Searchengines

Client-based

At first we will have a look at the client-based meta-searchengines. These suffer from two shortcomings:

  1. the last-mile-problem,
  2. the update-problem.
The last-mile-problem addresses the fact that the "last mile" of the Internet connection from the provider to the user is the part with the lowest bandwidth. On the other hand, every meta-search creates high downstream dataflows from the searchengines. From these dataflows about 50% or more is just thrown away by the meta-search postprocessing (due to removing multiple hits from different engines, and due to removing "useless information", like advertisement and other data not related to the search itself).

The update-problem results from the fact that the searchengine maintainers tend to change their output format rather often. With every change of that format the postprocessing software of the meta-searchengine needs to be updated. From our experiences this happens at least once per month. So an update has to be made every month. Because this is impractible for the end-user we feel that client-based meta-searchengines will play no major role in Internet information retrieval and we will not consider them here any more.

Server-based

From the user's point of view the server-based meta-searchengine just looks like any other searchengine. Before we list the existing meta-searchengines we will discuss some criteria to distinguish and rank these.

  1. We will only look at those which do a parallel search (no all-in-one-forms).

  2. The results of the different search engines should be merged, i.e. the meta-search should do more than a simple one-after-the-other listing.

  3. Identical hits, found by different engines (duplicates) should be eliminated.

  4. The boolean operators AND and OR should be available (at least).

  5. The (short) description of the hits, if delivered from the searchengines, should be transferred to the user: The information which the meta-search delivers should not be less than that of the searchengines.

  6. Searchengine-hiding: The specifica of the underlying searchengines should be hidden to the user. The user should not need to know about specifica of any searchengine.

  7. The meta-searchengine should allow a complete search, i.e. it should deliver hits as far as any searchengine is able to deliver hits. We consider this criterion as most important: one of the main advantages of meta-searching is the possibility of completeness of the search. This advantage would be given away if a meta-searchengine does not have this feature (and most do not have it).
We can now list the existing meta-searchengines and evaluate them by these 7 criteria:

The Existing Meta-Searchengines :
Meta-Searchengine parallelmergenoDoubleAndOrdescr.hidecomplete
metasearch.com no - - - - - -
www.digiway.com/digisearch yes no no yes yes no no
search.onramp.net yes yes yes no no yes no
www.designlab.ukans.edu/profusion yes yes yes yes yes no no
search.cyber411.com yes no no no no yes no
search.metafind.com yes yes yes yes no partly no
www.inference.com/infind yes partly yes yes no no no
www.dogpile.com yes no no yes no yes no
www.mamma.com yes yes no yes yes yes no
guaraldi.cs.colostate.edu:2000/form yes no no yes yes yes no
www.metacrawler.com yes yes yes yes yes yes no
mesa.rrzn.uni-hannover.de yes yes yes yes no yes no
meta.rrzn.uni-hannover.de yes yes yes yes yes yes yes
www.highway61.com yes yes yes yes yes yes yes

Only those services which have a "yes" in each column fulfill the criteria of being a real meta-searchengine. As we can see, at the moment (Jan. 1998) there are just two of those: Highway61 and our MetaGer.

If we examine the boolean operators of each meta-searchengine precisely, we can see that many of them will not perform a consequent AND: they sometimes switch to OR (to be precise: some of the underlying searchengines do that if they can not find anything matching the AND search, and the meta-service does not filter that out). This happens without any warning or notification to the user. Such behaviour might be acceptable for a searchengine, to give the user at least some results. But it is unacceptable for a meta-searchengine, because their results might then be mixed up with true AND hits. Even Highway61 shows this inadequate behaviour, so that presently just one meta-searchengine remains which fulfills all criteria in a strict manner. We will describe MetaGer http://meta.rrzn.uni-hannover.de/ in the following.


3. Experiences with Meta-Searchengines at Univ. of Hannover, Germany

3.1 History and Current State

The idea of building our own meta-search engine was born while having lunch some day at the Cebit fair, 1996. We had the first prototype of our engine running some months later. It gathered results from AltaVista, Infoseek, Lycos, Yahoo and others. When we were ready to present our engine to the outside world, we learnt that Erik Selberg and Oren Etzion of the Computer Science Department at the University of Washington had already launched the MetaCrawler, a similar device, several months earlier. We felt that there was no need to offer a similar service twice. However, at the same time, the idea of Internet searching became more and more a topic of interest to people in Germany. As a consequence, we concentrated our efforts on providing a meta-search for German services, which were not served by the MetaCrawler.

Another problem showed up to us in Spring 1997: it seemed to be a common problem to find people's e-mail addresses. We solved that by taking the framework of MetaGer and implemented MESA, the MetaEmail SearchAgent for international meta-search.

The contact with our users right from the beginning was very important to us: We tried to learn what their needs were, and we tried to incorporate their ideas into our searchengine. A link-checker was implemented and we added the option for an international search by querying Highway61. After several months of usage we carefully analysed our users interests and demands and responded by inventing the so-called "QuickTips" (see 4.1).

The software primarily runs on a Unix machine (ReliantUNIX Version 5.43) sponsored by Siemens-Nixdorf, RM600, having 2 CPUs R4400, 512 MB RAM and 100 GB disks on a 34-Mbps-ATM network interface. When implementing the software, we tried to use that programming language, which is most suited for each task: we are using C, perl, awk, sed, Tcl/Tk and Bourne-Shell. At the time of writing this paper, our RM600 machine runs at it's maximum capacity, and we are presently implementing a load distribution, which automatically transfers user queries during high load periods to background Unix machines (SunUltra, Solaris 2.5.1).

3.2 How it works

The principle of a meta-searchengine can be described by 7 steps:
  1. Accept a user query,
  2. convert the query into the correct syntax for every underlying searchengine,
  3. launch the multiple queries,

  4. wait for the results, and in parallel do some searching on a local database (QuickTips)

  5. analyse the results, eliminate duplicates, do a ranking,

  6. merge the results,

  7. deliver the postprocessed results to the user's client.

Some of the most common problems of the above steps are described in the following.

3.3 The Problems

One of the main technical problems of running a meta-searchengine is the changing of output formats by the searchengine maintainers. If the format of that data changes, the postprocessing software has to be adapted. When we realized that this happens quite often, we developed an administration tool just for this purpose. Additionally we have the requirement that the postprocessing software is robust: even if the format of the results has changed - and that may happen at any time - the meta-search output must still be presented well-formatted only.

Converting the query into the correct syntax for every underlying search service reveals another problem. Each search service uses a different query language and even more important: each service offers different options. If a meta-searchengine wants to be transparent (i.e. does a true searchengine hiding), it can only offer options that each of the underlying services offers (e.g. not every service gives us the possibility to perform a string search). Furthermore, it is sometimes difficult to collect the HTML form parameters that are necessary to get the results (e.g. hidden parameters with undetermined values). But even these efforts are sometimes not successful because the service maintainer wants exactly the information in the HTTP request for the results that e.g. Netscape Navigator uses. The tool webtee (http://www-cache.dfn.de/Cache/Software_webtee.html) is suited well for analysing such situations.

Waiting for the results from the underlying services is another topic worth looking at. How can we keep a user waiting? For MetaGer, we specified a default maximum search time of 40 seconds. During this time we regularly give information how much time remains and why the user has to wait. Users are more willing to wait if they know how long and why.

3.4 Problems and Cooperation with the Searchengine Maintainers

The most serious problem for any meta-searchengine however is an economic problem: all results presented are drawn from the resources of the searchengine maintainers. These companies earn money by renting space for advertisement on the searchengine webpages. The meta-searchengines cut these ads out, and give the pure information to the user. So it is understandable that a searchengine maintainer might not be too pleased with being queried by a meta-searchengine.

After launching MetaGer, we however made the experience that the reaction was opposite: maintainers asked us to add their service to MetaGer. This might be because the German search services were pretty new at that time, and they expected some advertisement effects if they show up by us as an independent university organization. A few services (e.g. the e-mail search service Four11 and the German catalog web.de) solve the advertisement problem in their own way (which is good for them, not for us): they do not include the original Internet addresses in the result pages but offer a link to another address which will reveal the correct address. This of course makes it impossible for a meta-searchengine to combine these results but enables the service maintainer to show an advertisement under any circumstances.

Another problem are counting procedures for web sites. Some of the services queried by MetaGer are using the counting service of the German IVW, which is a member of IFABC (International Federation of Audit Bureaux of Circulations). IVW is an independant organization, which provides measured numbers of PageImpressions (PageViews) und Visits. It gives the customer who places his advertisement on a webserver a certain guarantee that it is seen by the measured number of clients. The IVW counting relies on the download of a small image. So we agreed to download

This procedure now gives the benefits to both sides: The advertisement on the searchengines is seen by all users of the meta-searchengine too, and this view is counted by the measurement of IVW. We feel that this is a well balanced compromise.

4. Current and Future Development of our Meta-Searchengines

Any searchengine will be outdated, if not continuously improved and developed, just as the Internet as a whole is continuously developing.

4.1. Combination of Local Databases and Meta-Searchengines (QuickTips)

One of the mainstream ideas we are following is the combination of the overwhelming mass of internet data and manually reviewed information sources. We decided to rely on two sources of such information:
  • a locally built database, maintained purely be ourselves,
  • a widely distributed database which is already on the Internet: the Domain-Name-System (DNS).

    If someone does a query at our meta-searchengine, we check these two sources. We decided to incorporate the DNS after analysis of our logfiles: a lot of unexperienced users are searching for terms which can easily be found as part of the DNS. This holds especially for the queries for companies: most companies have a webserver named www.Company.com etc. For two and more word queries we are looking for combinations of the searchwords, like www.word1-word2.com etc. To increase the speed of response, these DNS lookups run in parallel to the meta-search. From our users feedback we can conclude, that about 75% of them are really happy with the so-called "QuickTip-search". If a DNS lookup leads to a useless entry (e.g. someone has reserved a name without using it), we can exclude these flops by a manually maintained stoplist file.

    Our own local database however relies on a different strategy. We know that we do not have the manpower to maintain a catalog like Yahoo or so. So we decided to put only those entries into our database which have been searched for with a certain frequency. On the other hand we found from our logfiles that even frequently searched words have a very low share (about 0.5%). What we can do, however, is to react to current events. These events show up as queries in our service, like the landing of Pathfinder on the Mars or heavy snowfall in Germany. When we realize queries related to such phenomena, we put entries into our database which lead to corresponding webpages.

    The QuickTips are mainly a help for the unexperienced user. On the other hand we have users which do really sophisticated queries. After checking our log files we estimate that the portion of such users is just in the range of a few percent. Even so, these "power users" are our multipliers: if they spread knowledge of our service being "good", then their word counts and brings us many new users. For the experienced user we proceed as described in the following section.

    4.2. Automatic Generation of Special Purpose Searchengines (Level3)

    Experienced users do often searches in their fields of speciality only. They know the terminology in these fields - much better than we do. Therefore, it seems to be adequate to give these users the means to build their own special purpose services, dedicated solely to them and their working-group. For example, let us consider a working-group which does research on VRML and related techniques. This group is well experienced in Internet searches. Their problem with information retrieval is that they are usually overwhelmed with the data found, and that they have now the cumbersome job of finding those pieces of information which they are really interested in. If this group had a special purpose search engine, looking for VRML and related topics only, they would have a valuable tool for their work.

    This is exactly the project we are working on: we are in the process of building a tool which automatically generates special purpose searchengines following input keywords given by the users. We are aware that such a capability may result in really heavy network load. Therefore, this tool will never be open to the general public. Every single user of this technique has to have a validation from us, and we will be very cautious with such validations. The project is sponsored by the Verein zur Förderung eines Deutschen Forschungsnetzes e.V. - DFN-Verein within the DFN-Expo Project. When we call a meta-searchengine a second-order engine (the normal searchengines are first-order engines in that terminology), we might call this type of searchengine a third-order engine. We therefore named it "Level3".

    4.3. Could Java help to solve the Problems?

    Some new searchengines are now based on the download of a Java applet, like
  • http://lorca.compapp.dcu.ie/fusion/
  • http://www.allesklar.de/ .

    The first engine is supposed to be an international meta-searchengine, but we found that service to be out of order most of the time. So we stopped considering it for our work. The latter is a German service, querying German sources only. That service started with a pure Java interface. But after realizing that many users do not have a Java capable browser, or that they have switched it off, the Java applet is now offered optionally only. Does the usage of Java offer any advantages for the searchengines?

    We answer this question by a clear no. The idea of downloading a Java applet is similar to the idea of the client-based meta-searchengines: the local system should do the workload. But this again does not help here: the searchengines have to do most of their work by extracting data from their database. That must be done on the server. The meta-searchengines have to do most of their work by extracting data from the searchengines over the net. The Java philosophy forbids that the client applet does connect to any other server, except to the one where it is originated from. Only the postprocessing part (duplicates filtering, ranking etc.) could be done by the client. We would, in fact, avoid the manual update-problem (discussed in 2.2 for client-based meta-searcher) because every download of the Java applet would load the latest version. But first, the postprocessing is the smallest part considering the total workload, second we have the same last-mile-problem which we experienced with the client-based meta-searcher, and third the download of long Java applets is time consuming.

    4.4. The User-Interface and the Quality of Results

    Although these two topics seem to be pretty far apart at first glance they are the most important ones from the user's point of view.

    During the operation of our meta-searchengines we have learnt that the design of user-interfaces is a continuous process which never ends. Both sides (the user and the maintainer) learn over time. A year ago, about 75% of our users queried with a single searchword. This is often not suitable, because a single word can not sufficiently describe the problem in many cases. Now just 50% ask with a single word, the other half uses two and more words to describe their search.

    Another experience we have made is that about 95% of the users do not change any of the default options. We did not expect this to happen. Therefore, in the beginning, we created numerous options relying upon the user choosing his optimum environment. When we saw the users lack of doing this, we had to change our defaults so that they fit to most of the queries. What the user really wants to know can be recognized most often by an experienced person just by looking at the user's query. A really good user interface should react to the user's question in terms of a natural language. An optimum interface should lead "by itself" (i.e. by a dialog with the user) to the necessary quality of the results. Presently we are in the process of negotiating an offer to incorporate such software.

    The quality of the results we deliver yet are gained by two means: the QuickTips described above, which are under our direct control. Secondly we do not rely upon the ranking of the underlying searchengines only, but in addition combine these with our own ranking. Our ranking is based on word counts within title, URL and description of the hits. We mix up these numbers by our ranking algorithms und present the results in the order of ranking numbers within five categories, marked by different colors. Especially the usage of the colors to distinguish the quality (the more red, the "hotter"/better the quality) was accepted well by our users.

    This all lead us to the statement: Only the combination of the two factors completeness and quality will result in Internet information really searched for by the user.

    Acknowledgment

    The authors would like to thank: