Friday, June 29, 2007

re-post: The Wilkinson Lab Semantic Web Declaration of Independence

This was the most controversial of all the posts I put up on my last blog. I screwed-up and I lost the comments that were made to it (Sorry Chris! This really wasn't on purpose!). Chris Mungall objected vehemently to this post. His arguments were (I am paraphrasing, but Chris, please re-iterate your arguments here if I get them wrong) that (a) the GO *is* an ontology, and to say it isn't an ontology is complete crap, and (b) that an ontology should be as big as it needs to be, and that to put artificial limitations on the size of an ontology is absurd and narrow-minded. I responded that we were using different definitions of "ontology" - that I was talking about the narrow definition of an OWL-DL perspective where all classes are precisely defined as restrictions on their properties, but I conceed that he is absolutely correct that this is, as he points out, a *very* narrow definition.

Having said all that, I have re-read this post over and over again, and I still believe most/all of what I said, so... here it is in all it's glory:

As most of you readers will know, my lab is somewhat obsessed with ontologies. It has become apparent over the past year, however, that we have views that are not shared by a large portion of the Semantic Web in Life Science community. I've made several of my more contentious viewpoints clear in presentations and papers, and these have been variously refered to as "inflammatory", "simplistic", or even "showing a lack of understanding about what the Semantic Web is".

Well, I'm in the mood to "put a stake in the ground" today and make some additional statements that I've been pussyfooting around for the past year. In part, I want to say these things because I honestly believe them and I hope that they might be interesting perspectives for others to think about; In part, it's because I think some of these ideas are quite novel, but not sufficiently well-supported for me to put into a publication or a position paper; and in part because I simply enjoy rocking the boat from time to time :-)

So... here goes!

The Wilkinson Lab Declaration of Semantic Web Independence

We hold these truths to be self-evident

1. Ontologies are a path to a goal, not the goal itself.

Though I do think that some ontologies (e.g. upper ontologies) should be well-engineered and static, I think that the majority of ontologies that we are building today are simply too "heavy". Just follow the is-a hierarchy below...

  • Ontologies are World Views

Clay Shirky said it best. Ontologies embody the bias of the belief of the moment, and moreover try to predict the future according to that view. Views change. The world changes. Legacy is a huge problem! (The fact that we tend to use these transient world views to annotate our ~permanent data stores makes me nervous, but the solution to that belongs in another post).

  • World Views are Hypotheses

...with apologies to the "Ontology of Biolocal Reality" ;-)

  • Hypotheses are Queries

This comes as no surprise, since the idea that a database query represents a hypothesis has been around for years! (e.g. query based data mining and Ben's comment below points here as another example). However, about a year ago I made the transitive closure of the above three statements in a presentation to the National Heart Lung and Blood Institute in Bethesda - "Ontologies are queries" - and was very nearly laughed off the podium by the mega-ontology audience. Now, I'm not saying that all ontologies are queries, but I think we need to start perceiving them increasingly as such. Interestingly, this sweeping statement necessarily excludes ontology-like hierarchical vacabularies such as the Gene Ontology since its "classes" are not defined; Strictly speaking, the GO is not an ontology, and was only intended to be a "dynamic, controlled vocabulary that can be applied to all eukaryotes". As such, I think my "rule" stil holds - that formal ontologies are, by and large, just queries. Interestingly, Luciano and Stevens recently made the same assertion in their paper on the semantics of biological pathways, though I don't think they quite made enough song and dance about it. It's an idea that I think needs to be emphasized more than it is...

  • Queries are Disposable

I can count on one hand the number of queries I have ever saved and re-used, other than those that are embedded in my code

  • Therefore Ontologies are (should be) Disposable!

We really need to move to a point where the ontology is simply a transient tool that is used to discover appropriate data somewhere in the universe, rather than the ontology being the end-point in itself. Ontologies have got to be cheap, lightweight, and disposable. Let's put a number on it... say... $10K. I'll stick my neck out and say that if an ontology costs more than $10K to produce, then it has cost too much, since this is about how much it costs to do a simple biological pilot study and there's no reason that a computational hypothesis should cost more to develop than a biological one.

2. Reasoning is your problem, not mine!

To the mega-ontology crowd I'd like to say "take that!" and hand them one of my instances. If you're going to build an ontology with 50,000 classes and 10,000 relationships please don't expect me to download it and reason over it. It's just not practical, and I cannot see the semantic web functioning that way in the long-run. It seems to me that, as the provider of an ontology, it could/should be your responsibility to provide a reasoning service that consumes my individuals and adds the rdf:type tag to them. Not only would the semantic web work better (IMO) this way, but it would make people think twice about building mega-ontologies ;-)

3. LSIDs provide a great way of solving the ontology-segmentation problem

In the meantime, while we still have mega-ontologies and still need to download and reason over them, ontology segmentation keeps coming into my mind as a possible solution. But how does it fit into the semantic web vision? Well, we'd need a way of naming individual nodes in an ontology without using #document_fragments (since those are interpreted client-side). One possibility is to however I'm a bit concerned that we will be tempted to put the isa hierarchy into the path and then use it in "casual reasoning", which would be nasty. What I really like is the idea of naming nodes by LSID, and having the LSID metadata resolution return only the portion of the ontology that is relevant to the interpretation of that node. I should probably write an entire post on this issue, since there are all sorts of additional reasons that I have come to this conclusion...

4. Ontological predicates can be thought of as, and mapped to, Web Services.

This is the core of the CardioSHARE architecture. It stemmed from the observation that, if we "surf" through BioMoby services and generate an RDF document of the data that goes into and out of a service, the predicates that join the input and output have a relationship to the Web Service that generated them... so why not just turn this on its head and say that the Subject of the SPO triple represents the input, the Predicate of the SPO triple represents a Web Service, and the Object of the SPO triple represents the output. In that case, OWL individuals of the Subject class can be fed into a Web Service identified by the OWL predicate and the output of that Web Service should be individuals of the Object OWL class. Moreover, knowing the type of input, and the type of service (the subject and predicate) is precisely a BioMoby registry query, so the process of discovering and transacting that service could be automated. A little wrapper around DIG and we're off to the races, having merged the disparate worlds of the Semantic Web and Web Services!! Hopefully I'll be able to post a link to the prototype of this architecture within the next couple of months...

Right, that's enough spouting-off for one day. Let the flames begin!!!

No comments: