A good CMS will respect you in the morning
My post today ends up with an example of how I think semantic tagging will work, using an example typically found in local government.
The seed of the idea has been stuck in my head for some time, but bear with me and I will explain how I got to spilling it out onto this page.I invariably find really interesting things over at Tony Hirst's OUseful.info blog. Today I had just followed a tweet from him about a term that was new to me : Content Transclusion and I read how it features in Eprints. On first reading I assumed that this was to do with creating what Marti Hearst terms "document surrogates" the extract from a document which is included in pages of search results - as seen with google et al ( read my previous search patterns if you want to know more). Go to Tony's blog and read about Transclusion, wiki links, videos Content Transclusion: One step closer. In his blog Tony mentions identifying content at the paragraph level:One of the things we’ve started exploring the JISCPress project is the ability to publish each separate paragraph in a document (each with its own URI), in a variety of formats – txt, JSON, HTML, XML.
Paragraph level publishing - I nod in agreement as I read this.
Clarification added 16 Aug 09: My comments look misleading. IKS are not responsible for building the stack, but trying to orchestrate efforts, measure effectiveness and share information and 'making useful techniques available'.
One of the ideas some members are working on is to create a sort of "semantic rich text editor", which is shorthand for saying "adding semantic markup tools must be as simple as adding a WYSIWYG editor to a CMS".
I suspect that for many, me included, it also a metaphor for how the end product must appear, and behave on-screen in order for the idea of "ubiquitous semantic tagging" to gain any traction.Certainly, to move away from thinking about "tagging the document" to thinking about the "tagging the paragraph" is a really simple but helpful step-change, in my particular case. I can both vouch for this and describe what I mean by drawing from my experience in dealing with committee documents such as simple Agendas and Minutes in local government. A complete Agenda will consist of many "items" and cover a vast range of subjects, as described in IPSV (Integrated Public Sector Vocabulary ~3000 terms) or even the LGSL ( Local Government Service List ~900 items) Each "item" on the Agenda in turn may have a single broad subject such as "Grants".Each "Grant" in this agenda item then refers to different local organisations.Trying to add semantic meaning to this document (the entire agenda) will be very difficult, even adding meaning to the "item" would possibly return too many results. For example this single item from an Agenda (I removed real names and actually its the final Minutes, but it'll do ... ), I've colored the text blocks for clarity - each color is a different subject within the item "Application for Grant Aid"Minute 000567-09 APPLICATION FOR GRANT AID
At this stage I admit to extending the term 'paragraph' that Tony uses to loosely mean 'a block of text all about a single subject', and you will see that the first block about the Community Transport grant would clearly invoke different IPSV tags than the others:
A quick perusal of IPSV tells me the first block would probably need to be described as being the subject of at least :- Public transport
- public transport investment
- Community transport
- community buses
- dial-a-ride
- Transport
- transport for disabled people
- Rural Communities
- Voluntary services
- Charities
- grants for charities
- Public funding
... as well as the usual Dublin Core stuff:
- spatial records
- town xxxxxxxx
- postcode area [ABC]
- temporal records
- creator, owner etc
... and perhaps :
- a sum of money
- £ sterling
- your currency
All of that information should then be stored in an RDF store and opened up to the world for querying.
It also means this block would need its own URI too, or would a fragment still do? /minutes/aug2009-community.html#minute000567-09I realise that at this point some might say, well why not make a discrete Agenda Item for each grant? Well, that is not the way the real world works. In practice "Agendas" remain somewhat woolly in order that last minute items can be added, and in any case it breaks my motto : "software should fit people - not the other way round". Perhaps this idea of "microdocs" - paragraph (or block) level publishing will be right for every type of web document at the moment, but on first glance I find it a good fit for structured documents. In many ways the committee documents in UK local government provide an ideal testing ground for tools which generate tags because:- they are already structured documents
- they are usually managed by a small group of people - committee secretaries
- the controlled vocabulary is already in place
- broad subjects can be prompted by users from smaller vocabularies, via mapping thus 'preloading' IPSV subjects
- they contain public information (any sensitive data is hidden in any case)
- done correctly they can contain lots of links and references to more information
- the battle for making useful html is already well lost to PDF **
** The depressingly wide adoption of largely inaccessible, unstructured blobs of text as 'PDF legacy docs' is a lost cause in any case. The proliferation of PDFs provide an opportunity, not a threat.
This is what I understand to be the nature of the task that faces us in local government in the UK, where IPSV (as far as I know) is the accepted controlled and shared vocabulary for semantic markup.So, (half) close your eyes now and imagine the scenario;Dave thinks WTF? Its Friday, I always pick 'Community Agendas' on a Friday, whats wrong with this CMS?Dave types furiously for 10 minutes (lets imagine its all that stuff about the Trippa bus grant) "OK Dave, these are the tags I think you need to describe that item, do you agree?". Dave mutters, if I see "Coaches - sport coaches" in with public transport suggestions just one more time, I'm gonna, well - I'm gonna tell someone.Dave deletes a couple more tags that he disagrees with. Save. Save to the web. Save to the semantic web.
I don't know exactly how far away from that we are but it's on the horizon. Semi-automated maybe, but semantic tagging, done properly will mean things like:
- some piece of software on a machine somewhere will be able to fetch all the information about the Trippa Bus Service on its own, and
- collate all grants for 2009
- or maybe return total grants given
- a person will be able to search for information about the dial-a-ride service in Xxxxxxxxx without knowing the local name is "Trippa"
- you phone will tell you the number of the dial-a-ride concept provider as you pass through this area
- <your guess>
If I am wrong then, please, someone put me right, will ya?
<your guess> add a comment below, cheers.
