Friday, June 29, 2007

The Dodo, The Turanian Tiger, and The Browser

One day I hope to see the Web Browser included in the web page of extinct animals. I want to re-iterate my opinion that The Web Browser should be an extinct technology in the near future.

No... let me correct that. The existence of a text-entry box where you type a URL will (should) become extinct. I keep going back to the conversation that Cartik and I had in the pub a few nights ago, that he captured in his blog entry. We were talking about AOL keywords, but also about bookmarks. The reason that we have bookmarks is not only so that we can re-find a resource that we want, but also so that we don't have to remember, or type-in, its URI. The Browser interface design has already shown us that people really don't want to have to deal with URIs, nor should they have to.

One of the things we are "promising" from the Semantic Web is that it will be more "human friendly" - able to locate information in a more intuitive and "human" way. Well, clearly, the first step to that end is that we do not expect our "humans" to type-in URIs.

So, I say again, and explicitly, that we should not be designing Semantic Web architectures with the constraint that typing-in a URI should cause information to be displayed in a browser. That's like designing a cell-phone system specifically to support a morse-code tapper! Old technologies should not be dictating the behaviours/architecture of new technologies.

Please, everyone... let's move on! Embedding Semantic Web inside of task-specific, non-browser applications is surely the future... or?

The argument for LSIDs

I posted this as a comment to Ben's blog post about LSIDs, but I want to re-post it here because it sounds like the working group is planning to contact me and Carole to discuss our use of LSIDs so I might as well make my arguments more visible and explicit. Cartik Kothari, a Post-Doc in my lab has also waded into the fray

here was my response to Ben's assertion that LSIDs should be abandoned:

I agree with only a part of what you [Ben] say, but think you aren't being ambitious enough. What we should be pushing for is that the LSID spec (or something very very similar to it) is re-branded and ADOPTED BY THE W3C!!

What worries me about NOT adopting a new identifier system as we move into the Semantic Web is that we start to hack and kludge our way to full functionality by adding novel behaiours on top of URLs, or start putting the "intelligence" of where to find data/metadata into redirects, purl URLs, or other nasty, centralized, and IMO unsustainable architectures.

LSIDs solve a very distinct set of problems - separation of identity from location; separation of data from metadata; and multiple end-points/protocols for both data and metadata retrieval. As far as I can tell, NONE of the solutions that have been proposed in the discussions within the HCLS community have come close to addressing these three issues in anywhere near as elegant a way as the LSID spec does, and some of the proposals have been a bit worrisome (e.g. "just add a ? to the end of your URL if you want metadata"... where is THAT in the HTTP spec??). Even more odd, to me, is that all of this contorting and hand-wringing is only because people want to be able to stick a URI in their browser and see something at the end of it. Frankly, I just don't see the point of designing architectures around browsers! (I quite liked Cartik's argument that, in the hey-day of AOL, you simply typed a keyword into your browser! **NOBODY** wants to type URLs (URIs) into their browser! Good Lord! The sooner we move the end-user away from the "guts" of the Web architecture the better!)

One of the keynote talks at the WWW2007 meeting was from a Microsoft fellow (can't remember his name) who reminded us that, within the next 10 years, the interfaces into the Web will become ubiquitous in our lives. "The Browser" is going the way of the Dodo! Why are we so concerned about designing next-generation architectures around last-generation interfaces?

In the BioMoby project we use LSIDs extensively (and by the way, I have almost never found the need to plug one of them into my browser...). Here's one of the uses we have for them:

A Web Service is identified by an LSID. The Moby Central registry knows certain things about that service (its inputs, its outputs, its semantic type, its authorship), and through an hourly "ping" it knows if that service is visible/available or not. This information is available as getMetadata from Moby Central. In addition, however, the service provider knows things about their own service. They know what example inputs and outputs might be, they know system maintenance schedules, etc. All of these things can be provided as getMetadata from the service provider. As a consumer, I want to know about a service, so I go to the LSID authority and say "where can I get information about this service?", the authority says "you can go here (Moby) and here (provider)", I do so, and I can combine the knowledge both resources have about that service. THIS IS ALL PART OF THE LSID SPEC! No hacks, no kludges, no new consensus was required within the community.

I don't know about you, but as for me and my family, we are going to continue using LSIDs until someone comes up with a BETTER alternative!

The semantic web in Haiku...

I stumbled over this site today while looking at the new interface to (which is really quite appealing, but does not seem to have translated into additional market share...)

Here is the semantic web explained in Haiku

It made me chuckle :-)

re-post: The Wilkinson Lab Semantic Web Declaration of Independence

This was the most controversial of all the posts I put up on my last blog. I screwed-up and I lost the comments that were made to it (Sorry Chris! This really wasn't on purpose!). Chris Mungall objected vehemently to this post. His arguments were (I am paraphrasing, but Chris, please re-iterate your arguments here if I get them wrong) that (a) the GO *is* an ontology, and to say it isn't an ontology is complete crap, and (b) that an ontology should be as big as it needs to be, and that to put artificial limitations on the size of an ontology is absurd and narrow-minded. I responded that we were using different definitions of "ontology" - that I was talking about the narrow definition of an OWL-DL perspective where all classes are precisely defined as restrictions on their properties, but I conceed that he is absolutely correct that this is, as he points out, a *very* narrow definition.

Having said all that, I have re-read this post over and over again, and I still believe most/all of what I said, so... here it is in all it's glory:

As most of you readers will know, my lab is somewhat obsessed with ontologies. It has become apparent over the past year, however, that we have views that are not shared by a large portion of the Semantic Web in Life Science community. I've made several of my more contentious viewpoints clear in presentations and papers, and these have been variously refered to as "inflammatory", "simplistic", or even "showing a lack of understanding about what the Semantic Web is".

Well, I'm in the mood to "put a stake in the ground" today and make some additional statements that I've been pussyfooting around for the past year. In part, I want to say these things because I honestly believe them and I hope that they might be interesting perspectives for others to think about; In part, it's because I think some of these ideas are quite novel, but not sufficiently well-supported for me to put into a publication or a position paper; and in part because I simply enjoy rocking the boat from time to time :-)

So... here goes!

The Wilkinson Lab Declaration of Semantic Web Independence

We hold these truths to be self-evident

1. Ontologies are a path to a goal, not the goal itself.

Though I do think that some ontologies (e.g. upper ontologies) should be well-engineered and static, I think that the majority of ontologies that we are building today are simply too "heavy". Just follow the is-a hierarchy below...

  • Ontologies are World Views

Clay Shirky said it best. Ontologies embody the bias of the belief of the moment, and moreover try to predict the future according to that view. Views change. The world changes. Legacy is a huge problem! (The fact that we tend to use these transient world views to annotate our ~permanent data stores makes me nervous, but the solution to that belongs in another post).

  • World Views are Hypotheses

...with apologies to the "Ontology of Biolocal Reality" ;-)

  • Hypotheses are Queries

This comes as no surprise, since the idea that a database query represents a hypothesis has been around for years! (e.g. query based data mining and Ben's comment below points here as another example). However, about a year ago I made the transitive closure of the above three statements in a presentation to the National Heart Lung and Blood Institute in Bethesda - "Ontologies are queries" - and was very nearly laughed off the podium by the mega-ontology audience. Now, I'm not saying that all ontologies are queries, but I think we need to start perceiving them increasingly as such. Interestingly, this sweeping statement necessarily excludes ontology-like hierarchical vacabularies such as the Gene Ontology since its "classes" are not defined; Strictly speaking, the GO is not an ontology, and was only intended to be a "dynamic, controlled vocabulary that can be applied to all eukaryotes". As such, I think my "rule" stil holds - that formal ontologies are, by and large, just queries. Interestingly, Luciano and Stevens recently made the same assertion in their paper on the semantics of biological pathways, though I don't think they quite made enough song and dance about it. It's an idea that I think needs to be emphasized more than it is...

  • Queries are Disposable

I can count on one hand the number of queries I have ever saved and re-used, other than those that are embedded in my code

  • Therefore Ontologies are (should be) Disposable!

We really need to move to a point where the ontology is simply a transient tool that is used to discover appropriate data somewhere in the universe, rather than the ontology being the end-point in itself. Ontologies have got to be cheap, lightweight, and disposable. Let's put a number on it... say... $10K. I'll stick my neck out and say that if an ontology costs more than $10K to produce, then it has cost too much, since this is about how much it costs to do a simple biological pilot study and there's no reason that a computational hypothesis should cost more to develop than a biological one.

2. Reasoning is your problem, not mine!

To the mega-ontology crowd I'd like to say "take that!" and hand them one of my instances. If you're going to build an ontology with 50,000 classes and 10,000 relationships please don't expect me to download it and reason over it. It's just not practical, and I cannot see the semantic web functioning that way in the long-run. It seems to me that, as the provider of an ontology, it could/should be your responsibility to provide a reasoning service that consumes my individuals and adds the rdf:type tag to them. Not only would the semantic web work better (IMO) this way, but it would make people think twice about building mega-ontologies ;-)

3. LSIDs provide a great way of solving the ontology-segmentation problem

In the meantime, while we still have mega-ontologies and still need to download and reason over them, ontology segmentation keeps coming into my mind as a possible solution. But how does it fit into the semantic web vision? Well, we'd need a way of naming individual nodes in an ontology without using #document_fragments (since those are interpreted client-side). One possibility is to however I'm a bit concerned that we will be tempted to put the isa hierarchy into the path and then use it in "casual reasoning", which would be nasty. What I really like is the idea of naming nodes by LSID, and having the LSID metadata resolution return only the portion of the ontology that is relevant to the interpretation of that node. I should probably write an entire post on this issue, since there are all sorts of additional reasons that I have come to this conclusion...

4. Ontological predicates can be thought of as, and mapped to, Web Services.

This is the core of the CardioSHARE architecture. It stemmed from the observation that, if we "surf" through BioMoby services and generate an RDF document of the data that goes into and out of a service, the predicates that join the input and output have a relationship to the Web Service that generated them... so why not just turn this on its head and say that the Subject of the SPO triple represents the input, the Predicate of the SPO triple represents a Web Service, and the Object of the SPO triple represents the output. In that case, OWL individuals of the Subject class can be fed into a Web Service identified by the OWL predicate and the output of that Web Service should be individuals of the Object OWL class. Moreover, knowing the type of input, and the type of service (the subject and predicate) is precisely a BioMoby registry query, so the process of discovering and transacting that service could be automated. A little wrapper around DIG and we're off to the races, having merged the disparate worlds of the Semantic Web and Web Services!! Hopefully I'll be able to post a link to the prototype of this architecture within the next couple of months...

Right, that's enough spouting-off for one day. Let the flames begin!!!

re-post: The Semantic 404

A re-post of the same text that was on my previous blog... unfortunately, the links are scraped out of it... if I get ambitious I will edit this post and put them back in:

Hello from the Banff WWW2007 post-mortem.

I'm befuddled! A couple of days ago a significant member of the SWHCLS community indicated that it didn't matter if URI's resolved or not, and that we could build the SW without resolution.

...My brain hurts!

What confuses me is why so many people perceive the Semantic Web to be such a different animal from the World Wide Web. Is there anyone on earth who would have said "who cares if your hyperlinks don't resolve, build the Web anyway!" Of course not! Ben and I have been harping on about this for ages, but even those who cite our Creeps paper still don't seem to "get it"... or at least, don't seem to care.

The Semantic Web is, first and foremost, a Web based technology. The problem is that the community leaders seem to be focusing on "Semantic" rather than on "Web", and that (IMO) spells death for the SWHCLS initiative. Sure, we can get Semantic Web-like behaviours by building local semantically-enabled data warehouses, but that isn't the Semantic Web. Would the Web ever have come into being if none of the URLs resolved? Of course not. And neither will the Semantic Web.

Until then, all we have created is the "Semantic 404", and that's not much good to anyone is it, be honest :-)

Apparently there's going to be a special meeting of the SWHCLS working group at noon today... fingers crossed that something great comes out of this meeting - perhaps the SWHCLS will begin today!!

re-post: Relational databases on the Semantic Web

this is a re-post of the rant I had on my old blog now re-posted here:

Greetings from WWW 2007!

I've been "hit" several times in the past couple of weeks with a recurring idea that seems to be gaining momentum with a wide variety of groups - the idea of exposing "traditional" relational databases through an OWL-mapping layer. To name just a few, we have:

Bio2RDF DartGrid ComparaGrid and an offering from SMI

Now, don't get me wrong! I am not criticizing any of these projects in any way, and am in fact extremely excited about their successes! But it does make me wonder...

The Semantic Web, IMO, is something more than just the exposure of relational databases on the Web (even with the hidden semantics of their relational model fully explicit and exposed). I would argue that, because we have never had the ability to express the kinds of semantics that we can express with OWL, we have never captured the kinds of semantically rich data that we are going to want when the Semantic Web is finally established. My own experience in leading the SIRS DB project (a component of the CardioSHARE project, where we are attempting to build an RDF/OWL datastore that truly behaves in the way we envision the Semantic Web could behave) I have noticed that we are collecting far more data in this semantic database than we would ever have attempted to store in a more traditional RDB... simply because the pain of building a relational model to hold this extra data is somewhat higher than sticking a few extra triples into a triple store.

I understand that, from the perspective of the W3C, OWL isn't a necessary part of the Semantic Web (and I'm not entirely convinced that OWL will survive in the long-term either!); however I do think that, if the SW is going to live-up to it's promises... or more importantly, not disappoint the funding agencies so badly that they cut their investments after we have built-up their expectations... we are going to have to do more than just expose our databases on the Web in RDF.

As Eric Neumann argued when I asked this question of the HCLS Workshop panel yesterday, this is a necessary first-step, and I agree with him that it might succeed in bootstrapping a somewhat lackluster (IMO) start to the entire SWHCLS enterprise... but I hope that we aren't thinking of it as anything more than the low-hanging fruit. I fear that, if we don't go the next step and start focusing on data/metadata capture and modelling in a "true" SW manner, and encouraging others to do so by example, we may unnecessarily delay our achievement of the high expectations that we, and our funding agencies, have for the Semantic Web in Health Care and Life Sciences.

Wednesday, June 27, 2007

Scientific Web Communities - a "missed opportunity"

"The lack of scientific web communities represents a significant missed opportunity."

Man, you can say that again!

I had the distinct pleasure this morning of reading a scientific paper that made me feel good!  ...actually, that's an under-statement... it literally made my heart soar!

In their paper Alzforum and SWAN: the present and future of scientific web communities Tim Clark and June Kinoshita describe their success in building a functional community of health researchers in the domain of Alzheimers research; a group of scientists interacting, debating, sharing knowledge and ideas via the SWAN Semantically-enabled infrastructure.

Now, I have to admit my personal bias. I'm a big fan of Tim Clark from the get-go. I like the way he thinks, and have for many years! For those who can remember back that far, he was one of the original authors of the LSID specification... and anyone who knows me knows what a big fan I am of that! When he first told me about his SWAN project a couple of years ago, I laughed at how similar our "visions" were for how the Semantic Web should (must!) work, and what it might look like for a community of biologist-end-users.

In my "second life" as the IT/Data manager for a large (~300-person) health sciences research institute I have been fighting what seems at times to be an uphill battle. No, I'm not trying to cure cancer or find a miracle drug to prevent heart attacks (at least, not in my "second life" ;-) ). All I'm trying to do is get the resident researchers to share their data with one another.

Granted, there are ethics issues involved in clinical data, but that's really not where the problem lies. There's also issues around "simplicity" - perhaps we're not making it quite easy enough for them to contribute their data, but I don't really think that's the issue either. Based on the first question that comes up whenever I give a presentation on the institutional database, I think it's "siloism". The first question, invariably, is "Can others see my data??".

As Carole Goble said, "Scientists would rather share their toothbrush than their data!". Researchers within a single institute, sharing a common purpose, even sharing common equipment, are nevertheless loathe to let their neighbours casually browse their results, or integrate their data into a common database for fear of somehow losing control or giving away hidden gems of knowledge.

" is possible for a scientist to develop [and contribute to.  MW] a valuable community resource without sacrificing professional advancement." - these words from Tim Clark were music to my ears! He goes on to describe the keys to creating a successful scientific community as "... neutrality, inclusiveness, trust (emphasis added), high quality, timeliness, proactive solicitation of community participation and value."

It was heartwarming... THRILLING!!... to see his vision (and mine!) become a reality.  I am going to keep that manuscript close-at-hand and read it every time I get depressed about interoperability in health care research :-)

thanks Tim!

New blog host... hopefully less spam!

For anyone who was reading my blog in its former location ( I apoligize for unceremoniously shutting-down that site; however spam-management was getting out of control!  I made copies of the most interesting posts/responses and will re-post them in their entirety here over the next few days.

I don't want anyone to think it was a matter of "censorship" or anything like that...

So, welcome to my new blog!