The World Wide Web had humble beginnings: It was intended as a simple way to share fairly simple information (notes, room schedules, phone directories) within an organization (CERN) and perhaps between like minded organizations. It quickly out grew its humble beginnings to encompass the sharing of all sorts of information between all sorts of people and much more, besides. Today, people do much which feels far different than reading and publishing documents: They chat, play games, shop, or use progams such as word processors or spreadsheets.
As amazing as the Web is, it remains, at heart, a Web primarily for end user consumption. HTML allows authors to describe the structure of a page in limited, mostly document oriented ways. HTML, of course, allows for hyperlinks and we need to look no further than Google to see how this human generated, human intended information can be exploited to the benefit of human readers: By interpreting links to a Web page as a vote for that page’s importance, the search engine can prioritize its results far better than any prior attempt.
The idea of the Web as a collection of documents has served us very well. But there are places where that idea creaks. Not everything published on the Web is published for humans first, last, and always, and somethings published for humans first also has a useful life as pure data. For example, while online maps are typically provided by one Web site (e.g., Google maps), people want to mix them with data from another. Whether one is plotting locations from CraigsList‘s real estate ads or finding where a photo from your last vacation was taken, the ability to get at the data of a Web page in a form amendable to programmatic manipulation is critical to making such “mash-ups” robust, reliable, and easy to produce. The alternative, to wit, “scraping” data out of human oriented HTML is rather difficult and fraught with pitfalls. Essentially, the consuming program has to filter out the irrelevant parts of the page (ads, or narrative text), interpret the HTML as the sort of data in question, typically by reverse engineering the generating program. Since both the irrelevant parts of a page and the presentational structure of the data on the page tend to change a lot between pages and on the same page over time, the consuming program is faced with a Sisyphean task. Fortunately, there are several popular, well-supported formats for data exchange and web site publishers are increasingly socialized to provide “data” views of their websites.
Thus, we now have data “on” the Web, but this move does not fundamentally move us away from the Web as a collection of documents. Each piece of data is like a little document, and thus suffers the problems of data in HTML on the Web: The data aren’t hyperlinked (so the data are on the Web, not “of” the Web) and a programmer has to interpret the data in order to write a program that uses it sensibly.
There is an alternative conceptualization of the Web that aims to overcome these problems; that is, to make a true Web for programs that is on a par with the Web for people. This conceptualizaion is known as the Semantic Web. Like the Web, the Semantic Web has enabling technologies addressing the key goals of linking and meaning: The Resource Description Framework (RDF) is a “Web native” data model that deeply incorporates hyperlinking. The Web Ontology Language (OWL) is a “Web native” ontology language that extends RDF with the ability to write logic based descriptions of things so that an automated reasoning tool can draw conclusions about data incorporating these descriptions.
OWL is based on a family of logics, so-called “description logics” which have played a prominent role in the field of knowledge representation, esp. bio-medical informatics, for over 30 years. The University of Manchester has played a key role in the development of these logics from the theory, to the implementation, to the application and in their standardization in the form of OWL and OWL 2. Prof. Uli Sattler, in collaboration with Prof. Ian Horrocks (while he was at the University of Manchester; he is now at Oxford University) designed the logic and reasoning techniques for the description logics underlying both OWL (the logic “SHOIQ”) and OWL 2 (the logic “SROIQ”). The difficulty in logic engineering for ontologies is allowing sufficient expressivity to be useful (e.g., so that modellers can say things like a foot has 5 toes which are part of it and it, in turn, is part of a leg) but where the reasoning procedure is computationally reasonable (so that we can write reasoners that can figure out that each toe is also part of a leg, and do so before the heat death of the universe). FaCT++, an OWL 2 reasoner developed at Manchester, is the practical realization of their design and can handle such enormous and complex ontologies as the Systematized Nomenclature of Medicine — Clinical Term (SNOMED-CT), a key component of many national clinical information management systems.
Trying to build something on the scale and nature of the Web, especially when the Web already exists, is a grand, perhaps grandiose, project. The Web, of course, grew more than was built and perforce so will the Semantic Web. Like an economy, the Web is the result of millions of people performing millions of interactions focused on their specific interests and needs. Similarly, languages like OWL 2 will succeed if they meet specific needs and do so well. As the bio-medical community standardizes on OWL 2 and pushes its boundaries, we can see in that microcosm what the Semantic Web might one day be like. And, unlike utopian fairy-tales, that microcosm can make a valuable direct contribution to human welfare.