A good CMS will respect you in the morning

My post today ends up with an example of how I think semantic tagging will work, using an example typically found in local government.

The seed of the idea has been stuck in my head for some time, but bear with me and I will explain how I got to spilling it out onto this page.

I invariably find really interesting things over at Tony Hirst's OUseful.info blog.

Today I had just followed a tweet from him about a term that was new to me : Content Transclusion and I read how it features in Eprints.  On first reading I assumed that this was to do with creating what Marti Hearst terms "document surrogates" the extract from a document which is included in pages of search results - as seen with google et al ( read my previous search patterns if you want to know more).

Go to Tony's blog and read about Transclusion, wiki links, videos Content Transclusion: One step closer.

In his blog Tony mentions identifying content at the paragraph level:

One of the things we’ve started exploring the JISCPress project is the ability to publish each separate paragraph in a document (each with its own URI), in a variety of formats – txt, JSON, HTML, XML.


Paragraph level publishing - I nod in agreement as I read this.

That is the conclusion I am quickly approaching as I grapple with how to add semantic meaning to webpages in general and committee documents in particular.

I have not read the spec of JISCPress so maybe this idea is already implemented, but in my mind, to go the whole distance and impregnate the semantic web with this information as linked data then there is a need for semantic-meta-tagging at the paragraph level too.

So I believe there is another delivery mechanism - or perhaps more accurately a descriptor - generating the RDF tags describing the contents and meaning of the paragraph.

The degree with which this can be automated, or at least 'meaning which is stealthily gathered and finally presented to the paragraph author for their confirmation' will eventually govern the size and effectiveness of the semantic web.

The Interactive Knowledge Stack attempts to take ideas from Java and elsewhere to provide a toolset to semantically-enable LAMP-based CMSs to do pull off such a trick.

Clarification added 16 Aug 09: My comments look misleading. IKS are not responsible for building the stack, but trying to orchestrate efforts, measure effectiveness and share information and 'making useful techniques available'.

One of the ideas some members are working on is to create a sort of "semantic rich text editor", which is shorthand for saying "adding semantic markup tools must be as simple as adding a WYSIWYG editor to a CMS".

I suspect that for many, me included, it also a metaphor for how the end product must appear, and behave on-screen in order for the idea of "ubiquitous semantic tagging" to gain any traction.

Certainly, to move away from thinking about "tagging the document" to thinking about the "tagging the paragraph" is a really simple but helpful step-change, in my particular case.

I can both vouch for this and describe what I mean by drawing from my experience in dealing with committee documents such as simple Agendas and Minutes in local government.

A complete Agenda will consist of many "items" and cover a vast range of subjects, as described in IPSV (Integrated Public Sector Vocabulary ~3000 terms) or even the LGSL ( Local Government Service List ~900 items)

Each "item" on the Agenda in turn may have a single broad subject such as "Grants".

Each "Grant" in this agenda item then refers to different local organisations.

Trying to add semantic meaning to this document (the entire agenda) will be very difficult, even adding meaning to the "item" would possibly return too many results.

For example this single item from an Agenda (I removed real names and actually its the final Minutes, but it'll do ... ), I've colored the text blocks for clarity - each color is a different subject within the item "Application for Grant Aid"

Minute 000567-09 APPLICATION FOR GRANT AID

Trippa Community Transport

£5,000 was agreed to assist with providing a door-to-door non-emergency patient transport service for the residents of xxxxxxxxxx and surrounding villages to the County Hospital and other health facilities in the area.

The following grant conditions were agreed:

That the grant be conditional upon the regular (quarterly) supply, to the Town Clerk, of management information from District Community Transport about the origin of Hospital Trippa trips - such that it is possible to identify trips that originate from a [ABC] postcode as a proportion of all trips.

That the grant be conditional upon District Community Transport re-examining its charging policy (within a timescale to be specified) so that xxxxxxxxxx residents pay a lower charge for trips to the County Hospital than other residents.

That the grant be conditional upon evidence that District Community Transport supplies evidence demonstrates that it has been pro-active in seeking funds from other appropriate bodies, particularly other parish councils.

The Xxxxxxx Chamber of Commerce

£500 was agreed as a contribution towards the costs of running the Xxxxxxxxxxx Food Festival as an initiative to encourage local residents and others into the town centre.

The Xxxxxxxxx Trust

£400 was agreed as a contribution towards the Trust's costs in running the Sport Days in September 2009.

============= end snip ============

At this stage I admit to extending the term 'paragraph' that Tony uses to loosely mean 'a block of text all about a single subject', and you will see that the first block about the Community Transport grant would clearly invoke different IPSV tags than the others:

A quick perusal of IPSV tells me the first block would probably need to be described as being the subject of at least :

  • Public transport
    • public transport investment
  • Community transport
    • community buses
    • dial-a-ride
  • Transport
    • transport for disabled people
  • Rural Communities
  • Voluntary services
  • Charities
    • grants for charities
  • Public funding

... as well as the usual Dublin Core stuff:

  • spatial records
    • town xxxxxxxx
    • postcode area [ABC]
  • temporal records
  • creator, owner etc

... and perhaps :

  • a sum of money
    • £ sterling
    • your currency

All of that information should then be stored in an RDF store and opened up to the world for querying.

It also means this block would need its own URI too, or would a fragment still do?

/minutes/aug2009-community.html#minute000567-09

I realise that at this point some might say, well why not make a discrete Agenda Item for each grant?  Well, that is not the way the real world works.  In practice "Agendas" remain somewhat woolly in order that last minute items can be added, and in any case it breaks my motto : "software should fit people - not the other way round".

Perhaps this idea of "microdocs" - paragraph (or block) level publishing will be right for every type of web document at the moment, but on first glance I find it a good fit for structured documents.

In many ways the committee documents in UK local government provide an ideal testing ground for tools which generate tags because:

  1. they are already structured documents
  2. they are usually managed by a small group of people - committee secretaries
  3. the controlled vocabulary is already in place
  4. broad subjects can be prompted by users from smaller vocabularies, via mapping thus 'preloading' IPSV subjects
  5. they contain public information (any sensitive data is hidden in any case)
  6. done correctly they can contain lots of links and references to more information
  7. the battle for making useful html is already well lost to PDF **

** The depressingly wide adoption of largely inaccessible, unstructured blobs of text as 'PDF legacy docs' is a lost cause in any case. The proliferation of PDFs provide an opportunity, not a threat.

This is what I understand to be the nature of the task that faces us in local government in the UK, where IPSV (as far as I know) is the accepted controlled and shared vocabulary for semantic markup.

So, (half) close your eyes now and imagine the scenario;

Dave starts an edit session.

"Which broad subject area are you writing about today, Dave?".

Dave picks one from the list of subjects he always works on, they come hardwired from his job description, Hmmph what's so intelligent about that?
Dave thinks WTF? Its Friday, I always pick 'Community Agendas' on a Friday, whats wrong with this CMS?

Dave types furiously for 10 minutes (lets imagine its all that stuff about the Trippa bus grant)

"OK Dave, these are the tags I think you need to describe that item, do you agree?". 

Dave mutters, if I see "Coaches - sport coaches" in with public transport suggestions just one more time, I'm gonna, well - I'm gonna tell someone.

Dave deletes a couple more tags that he disagrees with.

Save.  Save to the web. Save to the semantic web.


I don't know exactly how far away from that we are but it's on the horizon. Semi-automated maybe, but semantic tagging, done properly will mean things like:

  1. some piece of software on a machine somewhere will be able to fetch all the information about the Trippa Bus Service on its own, and
    • collate all grants for 2009
    • or maybe return total grants given
  2. a person will be able to search for information about the dial-a-ride service in Xxxxxxxxx without knowing the local name is "Trippa"
  3. you phone will tell you the number of the dial-a-ride concept provider as you pass through this area
  4. <your guess>

If I am wrong then, please, someone put me right, will ya?

<your guess> add a comment below, cheers.