November 20, 2008

Development and Verification of Rule Based Systems - a Survey of Developers

A few weeks ago I finally presented the results from the developer survey about Rule Base Development - Method, Tools, and Problems at the RuleML symposium in Florida. Embedded below are the slides summarizing the results; you can also download an extended version of the RuleML paper here.

BTW: The camera that was offered as a prize to participants of the survey went to Dr Flower in Australia who develops rule bases for Westpac Banking Corporation to be used in trading applications.

October 10, 2008

Collective Intelligence and Enterprise 2.0 - Comparing 'Web Scale' to 'Organization Scale'

TrainPassengers Discussing the use of Collective Intelligence for the 'Enterprise 2.0', one almost inevitable argument is that experiences from the web cannot be transferred, because web scale is just sooo much larger. But I wondered - exactly how much larger, in particular when we look at really really large organizations, even the biggest of them all:

Too make it quick: 600 times; there are 600 times more web-users than there are employees in the largest organization. Current estimates put the number of web-users at 1.5 billion, while (according to wikipedia) the largest organization has 2.5 million employees. There are more people in the Facebook group "AGAINST THE NEW FACEBOOK LAYOUT" than there are people in the largest organization on earth! By the way: this organization is the state owned Indian Railways (hence the nice picture by Akuppa); other organizations of similar size are the People's Liberation Army (estimated at 2.25 million active duty personnel) and the largest private organization: Wal-Mart (2.1 million). The largest German organization is Deutsche Post (DHL) with 0.5 million; there are 3000 times more web users than employees in Germany's largest company.

So yes, very clearly, experience from the open web cannot be easily generalized to organizations; not even to the biggest ones and not even if we disregard all issues beyond pure size (e.g. motivational issues).

October 8, 2008

Large Scale Uses of RDF

In a recent post ReadWriteWeb laments the little use of RDF in commercial applications. While the general point is valid, they miss quite a few large scale uses of RDF that I wanted to share with you:

1) The largest use of RDF in a real web setting: FOAF, and in particular its support by Google's Social Graph API.

2) XMP, the format used by Adobe to embed metadata in PDF (and other) files. Its most commonly stored in a subset of RDF. With all the Adobe tools, this is deployed on more than a hundred million computers.

3) The use of RDF in Firefox, e.g. for the description and management of extensions. Just take a look at you profile directory, you'll see.

Labels:

October 2, 2008

Tackling the Curse Of Prepayment - Collaborative Knowledge Formalization Beyond Lightweight

We finally came round to write up our ideas on how to overcome the motivation and incentive problems for collaborative heavyweight knowledge formalization:

This paper argues for collaborative incremental augmentation of text retrieval as an approach that can be used to immediately show the benefits of relatively heavyweight knowledge formalization in the context of Web 2.0 style collaborative knowledge formalization. Such an approach helps to overcome the "Curse of Prepayment"; i.e. the hitherto necessary very large initial investment in formalization tasks before any benefit of Semantic Web technologies is visible. Some initial ideas about the architecture of such a system are presented and it is placed within the overall emerging trend of "people powered search".

You can read the entire paper here. I will present it at the INSEMTIVE workshop at this year's ISWC; if you're in Karlsruhe, it would be great to see you there!

Labels: , ,

August 21, 2008

License Madness

Today I wanted to take a few moments to contribute to the commons by improving a few pages on Wikitravel. The first page lacked any images and I though I just take a few nice ones from Wikipedia. But, not so fast; as it turns out Wikipedia is GFDL licensed, while Wikitravel is CC-BY-SA-1.0 (Creative Commons Attribution-Share Alike 1.0 Generic) licensed - and these licenses are incompatible. I cannot copy something from Wikipedia unless I have authored it myself or have asked all authors.  Exactly the problem CC licenses try to prevent.

Now, the images on Wikipedia can have different licenses and so some of these can be used. But even images that are licensed under CC-BY-SA, are often licensed under a different version. In fact there seem to be 5 major version (1.0, 2.0, 2.1, 2.5, 3.0 - 2.5 seems to the most frequently used, followed by 3.0 and 2.0) and many more localized version. That's really problematic - how on earth should I know whether I can an upload an image licensed under CC-BY-SA-2.5-DK (the Danish localization of the 2.5 version of the creative commons attribution share alike license) to a site using for example CC-BY-SA-2.0-KR (the Korean localization of the 2.0 version ...).  Not even to mention the question on the compatibility of Wikitravel's CC-BY-SA-1.0 with theoretically less restrictive licenses such as CC-BY-2.1-JP (the Japanese localization of creative commons attributions licence version 2.1 - still used for 1100 Wikimedia Commons images).

Luckily Wikitravel also allows images to have a license different the CC-BY-SA-1.0 used for the text - and so I can use pictures licensed under CC-BY-SA-2.1-JP. Still I cannot use CC-BY-SA-2.5-JP (a newer version of this Japanese license) or CC-BY-2.1-ES (the Spanish version of the same license) and I still don't know about the less restrictive CC-BY-2.1-JP mentioned above.  Why this selection of licenses? - I don't know, but it says so here

Turning to Flickr (another great source for images), we see that they support 7 different Creative Commons licenses, but not the CC-BY-SA-1.0 license of Wikitravel. However, 2 of the 6 licenses are at least accepted for images on Wikitravel but, alas of the almost 2.8 billion pictures* on Flickr, 77 million are CC licensed (2.7%) and only 12 million (15% of the CC licensed, .4% overall) are licensed compatible to Wikitravel.

I don't see a simple answer to this problem - probably it will be a while until an agreement on the best open licence emerges. But there is one thing we can all do: If you want to contribute, try to release your stuff into the public domain. Only then can you be sure that it can be reused with whatever collaboration platform may be devised in the future. That's not because a restriction e.g. to get acknowledged is unreasonable - its only because  the restrictions can be worded in so many different ways that it will inevitably lead to incompatibilities. I know, there are some things that your need to continue to control, but there are many more where its really not going to hurt you.

I for one, have freed the few images I've uploaded to Wikipedia and put them into the public domain - please do the same for yours! And yes, actually I would like people using these pictures to include an attribution, but that's a matter of good behavior, not a question that lawyers should be involved in.  

*: In fact 2785244632 uploaded pictures by the time I write this - Flickr numbers its pictures consecutively, so you can see the absolute number of pictures uploaded by looking at the URLs of the pictures most recently uploaded. Although this also includes some movies and pictures since deleted.

August 3, 2008

1st Workshop on Incentives for the Semantic Web

IMG_0311 At this years ISWC there is the first workshop on incentives for the semantic web about the very important question how people can be motivated to create semantic data. You can still submit papers until the 8th of August*.

Program and organizing committee include a lot of cool people, e.g  Katharina Siorpaes (the creator of OntoGame and MyOntology), Denny Vrandecic (one of the persons behind SemanticMediaWiki and the project leader of the Active IP), Andreas Schmidt (project leader of the Mature IP) and me :)

The picture to the left shows 'Schloss Gottesaue' in Karlsruhe, the location of this years ISWC and hence of this workshop.

*: Sorry for writing this so late, but I'm rather busy trying to finally finish my PhD thesis

 

Labels:

August 2, 2008

Steril vs Generative - A Talk on the Future of the Internet

A great (non-technical) one hour presentation by Professor Jonathan Zittrain about the content of his forthcoming book "The Future of the Internet - And How To Stop It". The major theme of his talk is the dichotomy of steril (i.e. iPods, systems that can only do what their manufacturers intended) and generative systems (i.e. the PC or the Internet, systems that can be adapted by anyone). The starting point is that he sees the security problems on the Interent pushing the pendulum from generative to steril. His talk is followed by a shorter 20 min talk by Lawrence Lessig (without his famous slides, though) about the Privacy-Security trade-off.

In particular the talk by Zittrain is really a joy to watch, insightful and also quite funny (he manages to sneak in "Cats that look like Hitler" and his creative definition of best effort routing: "also known as send it and pray or every packet an adventure")

The video is embedded below, you can also see it at youtube.

Labels:

May 12, 2008

Score One For Explicit Semantics

Powerset, the most hyped 'Semantic Search' engine of recent times (e.g. here, here and here), can now finally be tried out by us mere mortals here: http://www.powerset.com/ (only searches Wikipedia and Freebase so far).

The interesting thing is that while Powerset has always focused on their know-how in natural language processing and entity recognition (that 'other kind of semantic'), the top results for almost all queries I tried (e.g. 'China size', 'Rudi Studer', 'Germany Population') are sourced from freebase - score one for (collaboratively evolved) explicit semantics, I' ld say ;)

Labels:

April 30, 2008

Morgan Stanley's Internet Trends report

Techcrunch has highlighted an interesting presentation by Morgan Stanley about current Internet trends. The entire presentation is embedded below or here at Slideshare. The tidbits I found particular interesting:

  • YouTube and Facebook together have more Page Views than google.com or yahoo.com
  • 16% of online time is spend with 'social connections'
  • More than 50% of Facebook users log in daily, 95% of Facebook users have used at least one third party application; 14 million photos are uploaded to Facebook every day
  • In the US the money spend on direct telephone adds is still five times more than that spend on Internet advertising, money spend on newspapers ads is still more than double that spend on Internet advertising.
  • Paid search accounts for 16% ($3billion) of the revenue generated on the Mobile Internet
  • The majority of visitors to the US's main sites comes from outside the US (Except for Fox Interactive :) )
  • More than twice as many mobile phone users as Internet users, the number of users in Asia+Africa has overtaken Europe+Americas in both cases.
  • 6 of the top 10 Internet sites are social sites (YouTube, live.com, Facebook, hi5, Wikipedia, Orkut)

 

Labels:

April 15, 2008

Participate in Survey on Rule Base Development -Method, Tools and Problems (and Maybe Win a Camera)

 

canonIf you have been involved in the developed of any rule based system in the past five years, it would be great if you could find 15 minutes and participate in my survey on Rule Base development - methods, tools and problems. You can even win a nice Canon Ixus 80IS

Thanks !

March 5, 2008

How Much Is That Ontology?

Ontologies are expensive to build. By now thats known to everyone and we have lots of people thinking about how they can justify the cost of building an ontology for their enterprise. Entirely wrong question - most companies don't need an ontology at all, they should go and bugger the data wharehousing companies. And another misconception is that when people think of 'expensive ontologies' they think its the formalization that makes it costy - na, for all meaningfull ontologies its creating the shared model of the domain; writing it down doesn't add that much and might even help.

And I just realized that the machine in 'shared machine understandable model of a domain' can mean soo much more than just being able to use reasoners with it - just have a look at the project to create the international barcode of life (here at Wikipedia, or watch the TechTalk embedded below)

Labels:

March 2, 2008

Rules as a Simple Way to Model Knowledge - Closing the Gap between Promise and Reality

There is a considerable gap between the potential of rules bases to be a simpler way to formulate high level knowledge and the reality of tiresome and error prone rule bases creation processes.
Based on the experience from three rule base creation projects this paper identifies reasons for this gap between promise and reality and proposes steps that can be taken to close it. An architecture for a complete support of rule base development is presented.

A publication of mine accepted for this years ICEIS conference, you can read the whole thing here.

Labels: ,

Collaborative Knowledge Formalization Beyond Lightweight - Tackling the Curse of Prepayment; Part II

This is the second in a series of three posts - you may wish to start with the first.

'Knowledge' Does Dot Equal 'Knowledge'

When the collaborative knowledge formalization community talks about 'knowledge' they mean something quite different from what most of the Uppercase Semantic Web community or knowledge based systems community think. The collaborative knowledge formalization community thinks of taxonomies, thesauri, skos or of structured data; the other communities are thinking of Logic Programs, Description Logics, OWL or First Order Logic. Current collaborative knowledge formalization approaches just don't support the formalisms that are commonly associated with knowledge formalization.
Now you might argue that this must be this way - that highly formal representations are just not well suited to be edited in the web2.0 style collaboration that is the topic of the collaborative knowledge formalization community. Indeed this may be the case, but its surely worth trying. There is no definite argument proving that highly formal representations cannot be edited in this way and I believe that trying to bring knowledge formalization with more powerful and more complex formalisms to the crowd will at the very least bring advances in robust reasoning and usable knowledge formalization interfaces.

The Challenges Of Using More Heavyweight Formalisms

There are, however, many challenges entailed in moving to more heavyweight formalisms. Challenges such as:

  • Usability / Debuggability: Formalisms such as OWL or First Order Logic are harder to understand, in particular errors are much harder to find.
  • Robustness: A single faulty statement added to a knowledge base with a million of axioms may break everything. Unless this problem is tackled, open collaborative knowledge formalization is impossible.
  • Performance and the  Language Expressivity / Performance tradeoff: Current reasoners for representation languages such as OWL or FOL could not dream of supporting a continuously updated knowledge base of even a fraction of the size of Wikipedia; hence something would have to give: there would have to be restrictions on language expressivity, reasoning algorithms that do not achieve soundness and/or completeness, or languages that are not purely declarative would have to be used.
  • Mixed Formality: the kind of collaborative knowledge formalization approaches discussed here rely on incremental and partial formalization- hence the data store is never fully formalized, contains data at different levels of formality. Current reasoning approaches are not well suited to tackle this.

The Curse of Prepayment - Again

All of the problems in the previous section are real and important - but there is one that trumps them all - the question of what is the immediate benefit of formalizing even small parts of a data store? What do I get from spending time and/or money on bringing a part of my data store to a more formal level? Having answered this question then allows me to decide the tradeoffs needed to address the challenges described in the previous section.

Here the collaboration knowledge formalization community has the same problem as the wider Semantic Web community: "what exactly do I get in extra benefit from using OWL? And is this worth the effort?". I believe there is an answer to that questions - but I'll describe it in the next installment of this series*.

* The first ever cliffhanger on this blog ;)

Labels: ,

February 20, 2008

The CKC Challenge: Exploring Tools for Collaborative Knowledge Construction

The 'challenge' in which some tools for the collaborative creation of structured knowledge were compared and in which SOBOLEO participated has now been described in a IEEE Intelligent systems publication (that is freely available here, you can also find a longer technical report version here).

The publication is light on real conclusions, but is a decent overview of tools for the collaborative creation of structured knowledge. In the conclusions you can also read that we are working on integrating SOBOLEO and BibSonomy - true, but somewhat surprising for me to see it announced publicly by people other than the BibSonomy guys or us.

Labels:

Collaborative Knowledge Formalization Beyond Lightweight - Tackling the Curse of Prepayment; Part I

The Curse Of Prepayment
The Curse of Prepayment is also often referred to as the Chicken-Egg problem of Semantic Technologies: Semantic Technologies promise great functionality once a great amount of knowledge is formalized. And because knowledge formalization is difficult, often not well supported and cumbersome you need to make a great up-front investment before you see any functionality. Now this insight is not new at all, there are already numerous approaches that try to address it; of particular interest here are approaches that try to harness web2.0 ideas for this task. These web2.0 approaches to knowledge formalization can be roughly separated into two groups

  1. The first group is based on the observation that lots of people are successfully creating structured data with tagging applications. These approaches then try to extend these systems with a bit more structure, a bit more formality. Our own soboleo system, GroupMe, Int.ere.st, Bibsonomy and gnizr are examples for these kinds of systems.
  2. The second group of systems start from the observation that people are spending large amounts of time creating semi-structured data in wikis. These system then try to give people the tools and the support such that they can create data with more structure, more formality. The Semantic Media Wiki, Freebase, IkeWiki and MyOntology are example for these kinds of systems.

Making Every Penny Count, Immediately
What makes these systems interesting, what gives them a chance to tackle the Curse of Prepayment are five closely related properties:

  • Simple: Formalization is simple, can be done with little training, little effort and not only by logic experts.
  • Collaborative: Formalization can be done jointly in a group - in this way the cost is spread over multiple persons; the prepayment needed from every person is reduced. 
  • Incremental: Not everything needs to be formalized at once, formalization can be done incrementally.
  • Partial: The tools can work with data stores that are only partly formalized, that contain data at different levels of formality.
  • Immediate: Formalized data can be used immediately, immediately brings some benefit to the user.

Together these five properties can be summed up as: "Making Every Penny Count, Immediately". There is an immediate benefit for formalizing even small parts; and because these systems are simple and collaborative, formalizing these small parts is relatively cheap.

The exact nature of this 'immediate benefit' differs between the systems mentioned above, for example it is:

  • Tables and less redundant data: The unique selling point of the Semantic Media Wiki: as soon as just a few attribute values have been specified, these can be used to create tables and overview pages that before had to be maintained manually.
  • Hierarchical Organization: In systems like Soboleo or Bibsonomy tags can be organized hierarchically, this allows for more effective maintenance of the tag repository as well as for more effective navigation and retrieval. This works after having just one such relation.
  • Advanced Search: For example in the SOBOLEO system adding just one synonym for a tag/concept will already improve the search experience, searching for this synonym will then also consider the documents annotated with the topic.

This post is the first in a series of three posts, the next will focus on the challenges for collaborative knowledge formalization we encounter when moving beyond the very lightweight formalisms currently employed in the tools mentioned above. 

Labels: ,