Adding tags to DC metadataSeptember 18th, 2008
In libraries we usually deal with what we call 'authoritative' metadata, but of course the wisdom of the crowds movement means there is a new source for metadata - the user community. The question I have is how to integrate the two (yet still be able to separate them if necessary).
In particular I've been looking at how to put tags into Dublin Core metadata. I had a look at how people have been defining tagging models, and I think it shows up some deficiencies in DC.
There are a number of efforts looking at modelling tagging, primarily so tags can interoperate between websites. The models all seem to revolve around tagging being an event, along the lines of: an agent asserts at a given time that a particular string is related to a particular resource within a given context.
The models include:
- Tag Ontology Design - Richard Newman (2005) - Establishes that a tag is separate to the described resource. Defines 'tags:taggedWithTag' and tags are defined as RDF resources so can identify using URIs and can be described (eg. name, equivalentTag). Unfortunately says taggedWithTag is a sub-property of skos:subject which is a sub-property of dc:subject.
- TagOntology - Thomas Gruber (2005) - Clarifies discussions about ontologies - that just formally specifying a conceptualisation (in an ontology) is different to more specific taxonomic classifications or controlled vocabularies, so Clay Shirky's claim that "ontology is overrated" isn't anti-ontology, but rather anti-"top-down-categorisation". We can now use distributed human intelligence to achieve categorisation. Proposes a conceptual model - Tagging(object, tag, tagger, source, polarity).
- SCOT (No longer up - Ed) (Social Semantic Cloud of Tags) Ontology (2007) - Extends Newman's model to include the collaborativeness of tagging that leads to folksonomies, so includes features like TagCloud and co-occurring frequency. Proposes an RDF ontology, using (amongst others) scot:taggingActivity, tag:taggedItem, tag:associatedTag (where the tag: namespace is Newman's ontology)
- TagCommons (No longer up - Ed) (2007) - Run by Tom Gruber. Still in development, aiming for common semantics. The TagCommons Wiki has useful use-cases for sharing tags and detailed definitions of terms.
- SIOC (Semantically-Interlinked Online Communities) Initiative (2004) - Aiming to model entire community sites so they can interoperate, in particular the site-forum-post-tag, space-container-item, and role-user-usergroup streams. Defines 'sioc:topic', unfortunately this is a sub-property of dc:subject.
Other relevant work:
- Semantics Through the Tag (No longer up - Ed) - Dave Beckett (2006) - Paper indicating interoperability of tags is hindered because the semantics/meaning behind each tag is unknown, so they are difficult to match on. Suggests tags are backed up by wiki entries, so the meaning behind each tag can be easily captured.
- TAGora project (2006) - This is a 3-year research project in Semiotic Dynamics ("a new field that studies how semiotic relations can originate, spread, and evolve over time in populations, by combining recent advances in linguistics and cognitive science with methodological and theoretical tools from complex systems and computer science").
- Annotea (2001) - A protocol for attaching external annotations to individual web pages - the Annotea model seems to mirror the tagging model thinking.
- RSS Taxonomy module (2001) - A fairly early attempt that adds 'taxo:topics' to RSS items.
- Tag Triples - Phil Dawes (2005) - A proposal for a simplified version of RDF, includes a 'tag' property.
Tags are metadata too!
Now, it seems to me, that tagging is essentially the act of creating metadata - information about a resource. Conventionally we think of tags as user-generated and consisting of a couple of words, but generally there's no actual restrictions like that, especially in the above models, so theoretically a tag could be what we conventionally think of as a subject term ("dog"), a title ("The Red Report"), a creator ("ByDCMI"), description ("Yellow with green polka dots"), etc.
Which begs the questions - isn't traditional descriptive metadata effectively a sub-class of tagging? Both 'metadata' and 'tagging' are about associating information with a resource, it's just metadata is usually considered to be doing it in a more structured way.
In contrast, the above tagging models are actually richer than DC. DC metadata is silent on who asserted (and when) that this resource is about 'Dog breeds--History', all DC metadata does is record the result of the description process without capturing details of the description event itself.
There have been attempts to add this to DC metadata. In particular Administrative Components (aka "Admin Core), which includes ac:source (tagger/agent?), ac:scope (context?), and ac:dateRange (taggedDate?), except this is only for the entire record, not individual properties as is possible with the above tagging models.
If I append tagging metadata complying with the above models to my 'authoritative' metadata record, it now looks uneven - the tags all come with provenance details but the main metadata doesn't - it's anyone's guess who made those "authoritative" bits up!
Tag metadata requirements
I've already described two use cases I have, plus the fact that we do know the meaning of some community-generated metadata. But to re-state the requirements, I want to be able to:
- Indicate which metadata values within a DescriptionSet (metadata record) are 'authoritative' and which are 'community-generated'. I'm not going to get into the argument of which source is better, but I need to be able to separate them if necessary at this higher level. Maybe it's kind of like the "non-preferred" concept in controlled vocabularies - it's not a judgement call on which term is better, but more "lets just standardise on this version for interoperability"??
- At a more granular level, I want to be able to indicate the exact source for each value in the description (if known), this reason is obvious for community tags/comments, but even authoritative metadata may come from multiple sources (e.g. geographic location may be added separately sourced from the local Land Information Ministry).
- The above two apply for all properties, not just a 'tags:taggedWithTag' (or equivalent) property. If I know the user-generated tag "black dog" is a subject term (e.g. because the input screen has a different field for different types of tags, or the user enters the tag as "subject:blackdog"), then it makes most sense to add it to the dc:subject property, except to satisfy (1) above I need some way to indicate this dc:subject wasn't entered by the original cataloguer/indexer/describer.
What might a solution look like?
I don't have the answer, but am willing to discuss...
The question is whether every property needs provenance added, or if it can be applied as a default to the DescriptionSet with the possibility of overriding it in individual properties. Would something like this work?
DescriptionSet( AdminSource( "National Library of NZ", RDFType("Authoritative") ) Description( Statement( dc:title, "Homepage" ) Statement( dc:subject, "'Dog breeds - History", Vocabulary("LCSH") ) Statement( dc:subject, "BlackDog", AdminSource("user:39398") ) Statement( dc:spatial, "georss:point:45.256,-71.92", AdminSource("Land Information NZ") ) Statement( tags:taggedWithTag, "CoolWebsite", AdminSource("user:77766") ) ) )
There's also the question of what property is best to use for un-typed tags. Should we use one of the above proposed properties (tags:taggedWithTag, tag:associatedTag, sioc:topic, taxo:topics) or look for a new one? Pete Johnston has pointed out that it is not appropriate to place tags in dc:subject (or any sub-property of dc:subject) unless it is known they are definitely subjects - tags are often not "about-ness", eg. the tag "YouTube" is more about the publisher/distributor/host than the topic of the resource/video. This would rule out all of these properties (except tag:associatedTag).
If we do create a new property, it seems dc:relation is the most appropriate to be a sub-property of; the alternative is to create a top-level property.