The State of the Art of Searching the Web

7. German-American Frontiers of Engineering Symposium, Washington D.C. 28.4. - 1.5. 2004, Wolfgang Sander-Beuermann

Presentation

Abstract:

The state of the art of searching the web is defined largely by the capabilities and shortcomings of the various available search engines. Consequentially, this presentation starts with an overview of search engine technology. Although this technology is momentarily nearly monopolized one company, the overview will include the different approaches of all global active players in this field. Furthermore, several technological niches will be highlighted, along with their potential impact on the search engine market. We will share experiences and directions gained during the development of our own search engines built at SearchEngineLab (http://metager.de/suma-eng.html) of the University of Hannover, Germany.

Currently the discussion of the future of searching the web is dominated by the term "semantic web", introduced by the "inventor" of the web himself, Tim Berners-Lee. The concept is based on annotated metadata (XML/RDF). However, since the metadata do strongly depend on the contribution of the web authors, the concept seems to be fragile. Additionally it shows inherent limitations for a general approach: defining an ontology for the whole world of being has not been successful since the philosophers of the antique began addressing this problem. Therefore this concept of the "semantic web" will probably offer solutions for well-defined niches only.

For a general approach of introducing semantics into the web search, new methods still have to be found. One possible such method is the advanced investigation of the pure text of web documents, based on associative relations of terms (we call it "associated web"). First results of this new technology will be shown.

Furthermore, it seems questionable whether the introduction of semantics is actually the most important challenge in today's search engine research: the largest part of the web is the "deep" or "invisible" web, created dynamically out of databases. It is currently nearly unknown to all search engines, nobody knows its size or its importance. In general the importance of database knowledge is rated higher because it is reviewed and well-structured information. Developments at SearchEngineLab are therefore specifically targeted at the "invisible web", which is already being gathered by our search engine for the German scientific web http://researchportal.net

Another open challenge since decades of information retrival is an intelligent interactive interface. It should act like querying a human informant who is able to understand what the user is really looking for by some kind of dialog. During this dialog the user query will be carefully worded, and not untill that process is finished sended to the search engine. Current aproaches and developements in this field will be outlined, although a general solution of this outstanding challenge will surely last serveral more decades.

The presentation will be completed by summarizing the important current developements in search engine technology.