One of the mainstream ideas we are following is the combination of the overwhelming mass of internet data with manually reviewed information sources and own ranking algorithms. Combining these leads towards high quality results based on an Internet search as complete as possible.
Like any database retrieval the Internet search should be complete. The definition of "complete" however in an Internet sense is difficult - we can never really search "the whole Internet". What we can do is searching parts of the Internet nearly complete: We will mainly consider the World Wide Web as information source. If we confine ourselves to that part only, one might ask the following question:
According to http://www.nw.com/zone/WWW-9707/firstnames.html, we had 754,716 Internet hosts in July 1997, with a name starting with www. So we are on the safe side of a guess, assuming that we have at least approximately 750,000 WWW servers on the Internet. If we now do a rough guess of the average amount of data on an average web server, we know the order of magnitude of the complete web data. We analysed lots of servers and we found approximately about 10 MByte of data per server. So we might have about 7,500 GByte of web data in July 1997. Extrapolation to January 1998 (when this paper is written) leads us to an order of magnitude of about 10,000 GByte of web data. We consider this as a conservative estimate, because the number of servers is definitely more than those starting with the name www and the average amount of data of a web server is probably higher than our guess of 10 MByte.
On the other hand we have to consider how much data can be indexed by the most advanced searchengine technology. If we take Altavista (http://www.altavista.digital.com/) as an example of such technology, we might draw the following conclusions:
According to Altavista's own statements, their searchengine is indexing 100 million web pages. The indexable part of a web page which we consider nowadays is of course just the textual part. So we must have an estimate of the average text content of a web page. Analysing our proxy-caches at Univ. of Hannover, we found that the average web page contains about 4 KByte of text. So Altavista is indexing something in the order of 400 GByte. Estimating the portion of the total web text data then demands knowledge about the text to non-text (binary) relation of the average web page. That relation is probably the most difficult part to guess. Analysing our Hannover Univ. proxy-caches shows a relation between 10% - 50% of text per web page.
With all these estimates in mind, we can calculate the indexed portion from:
(number_of_pages_indexed * text_data_per_page) / (total_web_data * text_to_binary_relation)
Executing this calculation with the above values results in Altavista indexing between 8% - 40% of the total amount of text in the web. Although the above calculations are rough estimates only, this fits together well if we compare the result with the answers of querying different searchengines.
Meta-Searchengine Efficiency
We can calculate a meta-searchengine efficiency by comparing the results
found by the meta-service to those found by that searchengine which
delivers most results (the "best" searchengine). If we define:
allHits = sum of all hits found by the meta-searchengine,
bestHits = number of hits found by the "best" searchengine,
duplicates = sum of duplicates found by the meta-searchengine,
an efficiency might be evaluated by:
eff = (allHits-duplicates)/bestHits or
eff = 1 + (allHits-bestHits)/bestHits - duplicates/bestHits
If we look at the duplicates in detail, we encounter at least two difficulties: first we have to set up a good algorithm for duplicate recognition, and second duplicates might be procduced by searchengines on its own (because their duplicate recognition algorithm might be not as good as the one of the meta-service). If we however look at the rate of duplicates in practice, we found that they range in the order of just 10 to 30 percent. For a first estimate we are just interested in orders of magnitude and factors. Therefore we might neglect the duplicates for a first estimate.
No matter how we calculate the efficiency in detail - realistic searches lead to values in the range of 2 to 5, meaning that a meta-searchengine will deliver 2 to 5 times more results then the best single searchengine.
Therefore the question "could one searchengine solve the problem?" can nowadays be answered by a clear no. If even a search-automat like Altavista is not able to index more than 40% of the web, it is very sure that manually maintained databases like Yahoo (http://www.yahoo.com/) can never be able to cover significant parts. Even the problem of maintaining the data up-to-date is not solvable: Investigation of our proxy-cache data shows that within half a year about half of the web addresses are outdated.
If one searchengine can not solve the problem of Internet information retrieval, we obviously have to query several engines. If we do this in an automated way, we call the resulting automat a meta-searchengine.
We will not go further into this discussion, we will now focus on meta-searchengines only.
At first we will have a look at the client-based meta-searchengines. These suffer from two shortcomings:
The update-problem results from the fact that the searchengine maintainers tend to change their output format rather often. With every change of that format the postprocessing software of the meta-searchengine needs to be updated. From our experiences this happens at least once per month. So an update has to be made every month. Because this is impractible for the end-user we feel that client-based meta-searchengines will play no major role in Internet information retrieval and we will not consider them here any more.
Server-based
From the user's point of view the server-based meta-searchengine just looks like any other searchengine. Before we list the existing meta-searchengines we will discuss some criteria to distinguish and rank these.
| Meta-Searchengine | parallel | merge | noDouble | AndOr | descr. | hide | complete | |
|---|---|---|---|---|---|---|---|---|
| metasearch.com | no | - | - | - | - | - | - | |
| www.digiway.com/digisearch | yes | no | no | yes | yes | no | no | |
| search.onramp.net | yes | yes | yes | no | no | yes | no | |
| www.designlab.ukans.edu/profusion | yes | yes | yes | yes | yes | no | no | |
| search.cyber411.com | yes | no | no | no | no | yes | no | |
| search.metafind.com | yes | yes | yes | yes | no | partly | no | |
| www.inference.com/infind | yes | partly | yes | yes | no | no | no | |
| www.dogpile.com | yes | no | no | yes | no | yes | no | |
| www.mamma.com | yes | yes | no | yes | yes | yes | no | |
| guaraldi.cs.colostate.edu:2000/form | yes | no | no | yes | yes | yes | no | |
| www.metacrawler.com | yes | yes | yes | yes | yes | yes | no | |
| mesa.rrzn.uni-hannover.de | yes | yes | yes | yes | no | yes | no | |
| meta.rrzn.uni-hannover.de | yes | yes | yes | yes | yes | yes | yes | |
| www.highway61.com | yes | yes | yes | yes | yes | yes | yes |
Only those services which have a "yes" in each column fulfill the criteria of being a real meta-searchengine. As we can see, at the moment (Jan. 1998) there are just two of those: Highway61 and our MetaGer.
If we examine the boolean operators of each meta-searchengine precisely, we can see that many of them will not perform a consequent AND: they sometimes switch to OR (to be precise: some of the underlying searchengines do that if they can not find anything matching the AND search, and the meta-service does not filter that out). This happens without any warning or notification to the user. Such behaviour might be acceptable for a searchengine, to give the user at least some results. But it is unacceptable for a meta-searchengine, because their results might then be mixed up with true AND hits. Even Highway61 shows this inadequate behaviour, so that presently just one meta-searchengine remains which fulfills all criteria in a strict manner. We will describe MetaGer http://meta.rrzn.uni-hannover.de/ in the following.
Another problem showed up to us in Spring 1997: it seemed to be a common problem to find people's e-mail addresses. We solved that by taking the framework of MetaGer and implemented MESA, the MetaEmail SearchAgent for international meta-search.
The contact with our users right from the beginning was very important to us: We tried to learn what their needs were, and we tried to incorporate their ideas into our searchengine. A link-checker was implemented and we added the option for an international search by querying Highway61. After several months of usage we carefully analysed our users interests and demands and responded by inventing the so-called "QuickTips" (see 4.1).
The software primarily runs on a Unix machine (ReliantUNIX Version 5.43) sponsored by Siemens-Nixdorf, RM600, having 2 CPUs R4400, 512 MB RAM and 100 GB disks on a 34-Mbps-ATM network interface. When implementing the software, we tried to use that programming language, which is most suited for each task: we are using C, perl, awk, sed, Tcl/Tk and Bourne-Shell. At the time of writing this paper, our RM600 machine runs at it's maximum capacity, and we are presently implementing a load distribution, which automatically transfers user queries during high load periods to background Unix machines (SunUltra, Solaris 2.5.1).
Some of the most common problems of the above steps are described in the following.
Converting the query into the correct syntax for every underlying search service reveals another problem. Each search service uses a different query language and even more important: each service offers different options. If a meta-searchengine wants to be transparent (i.e. does a true searchengine hiding), it can only offer options that each of the underlying services offers (e.g. not every service gives us the possibility to perform a string search). Furthermore, it is sometimes difficult to collect the HTML form parameters that are necessary to get the results (e.g. hidden parameters with undetermined values). But even these efforts are sometimes not successful because the service maintainer wants exactly the information in the HTTP request for the results that e.g. Netscape Navigator uses. The tool webtee (http://www-cache.dfn.de/Cache/Software_webtee.html) is suited well for analysing such situations.
Waiting for the results from the underlying services is another topic worth looking at. How can we keep a user waiting? For MetaGer, we specified a default maximum search time of 40 seconds. During this time we regularly give information how much time remains and why the user has to wait. Users are more willing to wait if they know how long and why.
After launching MetaGer, we however made the experience that the reaction was opposite: maintainers asked us to add their service to MetaGer. This might be because the German search services were pretty new at that time, and they expected some advertisement effects if they show up by us as an independent university organization. A few services (e.g. the e-mail search service Four11 and the German catalog web.de) solve the advertisement problem in their own way (which is good for them, not for us): they do not include the original Internet addresses in the result pages but offer a link to another address which will reveal the correct address. This of course makes it impossible for a meta-searchengine to combine these results but enables the service maintainer to show an advertisement under any circumstances.
Another problem are counting procedures for web sites. Some of the services queried by MetaGer are using the counting service of the German IVW, which is a member of IFABC (International Federation of Audit Bureaux of Circulations). IVW is an independant organization, which provides measured numbers of PageImpressions (PageViews) und Visits. It gives the customer who places his advertisement on a webserver a certain guarantee that it is seen by the measured number of clients. The IVW counting relies on the download of a small image. So we agreed to download
If someone does a query at our meta-searchengine, we check these two sources. We decided to incorporate the DNS after analysis of our logfiles: a lot of unexperienced users are searching for terms which can easily be found as part of the DNS. This holds especially for the queries for companies: most companies have a webserver named www.Company.com etc. For two and more word queries we are looking for combinations of the searchwords, like www.word1-word2.com etc. To increase the speed of response, these DNS lookups run in parallel to the meta-search. From our users feedback we can conclude, that about 75% of them are really happy with the so-called "QuickTip-search". If a DNS lookup leads to a useless entry (e.g. someone has reserved a name without using it), we can exclude these flops by a manually maintained stoplist file.
Our own local database however relies on a different strategy. We know that we do not have the manpower to maintain a catalog like Yahoo or so. So we decided to put only those entries into our database which have been searched for with a certain frequency. On the other hand we found from our logfiles that even frequently searched words have a very low share (about 0.5%). What we can do, however, is to react to current events. These events show up as queries in our service, like the landing of Pathfinder on the Mars or heavy snowfall in Germany. When we realize queries related to such phenomena, we put entries into our database which lead to corresponding webpages.
The QuickTips are mainly a help for the unexperienced user. On the other hand we have users which do really sophisticated queries. After checking our log files we estimate that the portion of such users is just in the range of a few percent. Even so, these "power users" are our multipliers: if they spread knowledge of our service being "good", then their word counts and brings us many new users. For the experienced user we proceed as described in the following section.
This is exactly the project we are working on: we are in the process of building a tool which automatically generates special purpose searchengines following input keywords given by the users. We are aware that such a capability may result in really heavy network load. Therefore, this tool will never be open to the general public. Every single user of this technique has to have a validation from us, and we will be very cautious with such validations. The project is sponsored by the Verein zur Förderung eines Deutschen Forschungsnetzes e.V. - DFN-Verein within the DFN-Expo Project. When we call a meta-searchengine a second-order engine (the normal searchengines are first-order engines in that terminology), we might call this type of searchengine a third-order engine. We therefore named it "Level3".
The first engine is supposed to be an international meta-searchengine, but we found that service to be out of order most of the time. So we stopped considering it for our work. The latter is a German service, querying German sources only. That service started with a pure Java interface. But after realizing that many users do not have a Java capable browser, or that they have switched it off, the Java applet is now offered optionally only. Does the usage of Java offer any advantages for the searchengines?
We answer this question by a clear no. The idea of downloading a Java applet is similar to the idea of the client-based meta-searchengines: the local system should do the workload. But this again does not help here: the searchengines have to do most of their work by extracting data from their database. That must be done on the server. The meta-searchengines have to do most of their work by extracting data from the searchengines over the net. The Java philosophy forbids that the client applet does connect to any other server, except to the one where it is originated from. Only the postprocessing part (duplicates filtering, ranking etc.) could be done by the client. We would, in fact, avoid the manual update-problem (discussed in 2.2 for client-based meta-searcher) because every download of the Java applet would load the latest version. But first, the postprocessing is the smallest part considering the total workload, second we have the same last-mile-problem which we experienced with the client-based meta-searcher, and third the download of long Java applets is time consuming.
During the operation of our meta-searchengines we have learnt that the design of user-interfaces is a continuous process which never ends. Both sides (the user and the maintainer) learn over time. A year ago, about 75% of our users queried with a single searchword. This is often not suitable, because a single word can not sufficiently describe the problem in many cases. Now just 50% ask with a single word, the other half uses two and more words to describe their search.
Another experience we have made is that about 95% of the users do not change any of the default options. We did not expect this to happen. Therefore, in the beginning, we created numerous options relying upon the user choosing his optimum environment. When we saw the users lack of doing this, we had to change our defaults so that they fit to most of the queries. What the user really wants to know can be recognized most often by an experienced person just by looking at the user's query. A really good user interface should react to the user's question in terms of a natural language. An optimum interface should lead "by itself" (i.e. by a dialog with the user) to the necessary quality of the results. Presently we are in the process of negotiating an offer to incorporate such software.
The quality of the results we deliver yet are gained by two means: the QuickTips described above, which are under our direct control. Secondly we do not rely upon the ranking of the underlying searchengines only, but in addition combine these with our own ranking. Our ranking is based on word counts within title, URL and description of the hits. We mix up these numbers by our ranking algorithms und present the results in the order of ranking numbers within five categories, marked by different colors. Especially the usage of the colors to distinguish the quality (the more red, the "hotter"/better the quality) was accepted well by our users.
This all lead us to the statement: Only the combination of the two factors completeness and quality will result in Internet information really searched for by the user.