Last modified 5 years ago Last modified on 07/23/12 05:12:33

Go back to the table of modules

Vocabulary

The vocabulary module implements management of vocabularies and thesauri, which involve importing from VDEX, SKOS or XSD, identity resolution (unambiguously identifying the vocabulary that a term is from, and finding a computer readable representation of the whole vocabulary), updating from source automatically, transparently converting from one format to another, replacing a vocabulary after edition, publishing vocabularies automatically and providing some relevant user interface elements re-usable by other modules, such as efficient vocabulary term choosers.

List of critical functionality

  • The Vocabulary module MUST be capable of either dynamically updating a vocabulary when a term isn't found, or regularly checking if a vocabulary has been updated.
  • Provide helper functions for vocabulary correspondences.
  • Provide a resolver service to find computer readable representation of vocabularies.
  • See eureka configuration file for a practical example (configGetVocInfoArray() in config.php

List of desirable functionality

  • Provides the widgets (or at least the JSON representations necessary to fill them) to select vocabulary entries (See #109). It is NOT realistic to use the MetadataEditor (calling Orbeon and the associated plumbing) for implementing the general purpose Vocabulary pickers (See the respective modules)
  • There are two cases:
  • Hierarchical vocabularies in which the user drills down from one select bok to another.
  • Provide a vocabulary correspondence editor (See #10). This is much more important than a general purpose vocabulary editor.
  • Provide a vocabulary editor (See #44).

Implementation choices/alternatives

We will use SKOS as the internal representation, since:

  • The system is based on a graph database, so we need a RDF format from the very beginning.
  • SKOS is based on RDF, RDFS and OWL and is specifically designed to handle vocabularies.
  • Editors are available for SKOS Ontologies

For classifications (hierarchical vocabulary), the internal representation may be implemented as 2 graphs stored directly in Mulgara (outside Fedora). One graph would contain the SKOS relationships, the other would contain the computed inheritance relationships (or transitive closure relationships). The computed relationships should be discarded and recomputed automatically whenever a change in the classification is detected.

When a term of classification is deleted, it will probably be safer to mark it as deleted instead of effectively removing it from the classification. Also, in case we have terms that are deprecated, the deprecated value should be kept, if possible. This way it's possible to apply smart corrections if we harvest records containing old data.

However, we will still need excellent support for VDEX externally, as many current modules depend on it, and there is no reason to stop providing it: VDEX is a proven and well understood format, and many real-world vocabularies in our field are only available in VDEX format.

Potential librairies to handle OWL in java

Potential librairies to handle SKOS in Java

De-referencing vocabulary terms

We want to de-reference vocabularies in vocabulary terms for the following reasons:

First, we want to exploit the relationships that may exist between classification systems to help the user find relevant resources.

Second, we want to allow searching inside description and labels. For example the following lom fragment

<learningResourceType>
   <source>!http://www.normetic.org/vdex/typeressourcev1_2.xml</source>
   <value>démonstration</value>
</learningResourceType> 
<learningResourceType>
   <source>LOMv1.0</source>
   <value>lecture</value>
</learningResourceType>

would be searched as if it were something like:

<learningResourceType>
   <source>!http://www.normetic.org/vdex/typeressourcev1_2.xml  </source>
   <value>démonstration
     <caption>
       <langstring language="en">demonstration</langstring>
       <langstring language="fr">démonstration</langstring>
     </caption>
     <description>
       <langstring language="fr">Ressource dont l'usage prévu consiste à présenter de l'information explicative.</langstring>
       <langstring language="en">Resource</langstring>
     </description>
    </value>
</learningResourceType>
<learningResourceType>
   <source>LOMv1.0</source>
   <value>lecture
     <caption>
       <langstring language="fr">exposé</langstring>
       <langstring language="en">lecture</langstring>
     </caption>
     <description>
       <langstring language="en">A lecture is "a discourse given before an audience upon a given subject, usually for the purpose of instruction" (OED).</langstring>
       <langstring language="fr">Discours devant un auditoire sur un sujet donné, habituellement à des fins d’instruction *(OED)</langstring>
     </description>
   </value>
</learningResourceType>

Third, we want to avoid having to update the resource's records and search indexes everytime the vocabulary is updated

Resolving identifiers

De-referencing vocabulary terms references in learning resources in current learning resource standards is problematical at best, even when best practices are followed. The reason is that standards like lom guarantee neither that:

  1. The source identifier of the term is unique
  2. That the source identifier, or the term identifier is a resolvable URL.

To dereference vocabulary terms successfully, we must:

  • store local copies of vocabularies, if only for performance reasons (it follows that we must keep them up to date regularly and automatically, otherwise we'll get dereferencing failures as the vocabularies get updated)
  • generate unambiguous references to those local vocabularies

This requires a lookup table containing following pieces of information

  • The vocabulary id, as expected in the learning resources. Note that there may be more than one id resolving to the sames vocabulary, as usage isn't always consistent in the wild.
  • The context, if the id is likely not to be unique. Ex: In LOM, if the source is LOMv1.0, we need the element name in which we received the term is expected to know which vocabulary to use.
  • The external URL at which we can retrieve a computer-readable version of the vocabulary (in VDEX, SKOS, XSD, etc.)
  • The internal URL or id where we can find the vocabulary if it's already in the system.

In Eureka this was implemented in configGetVocInfoArray(), although in a very LOM-Specific structure.

Assuming the lookup table generates a cache miss (the vocabulary isn't in the system yet), there are several possible scenarios:

  1. The vocabulary id is a URL, and there is a computer-readable vocabulary there. If the vocabulary isn't yet available locally, the system merely downloads it and stores it locally.
  2. The id isn't resolvable, but the entry in the lookup table has a URL where we can find the vocabulary.
  3. The syntax of the id indicates that it can be resolved thru some kind of external resolution service (example: DOI)
  4. The id isn't resolvable, we don't have a URL, but we still have the vocabulary available locally (uploaded manually)
  5. The whole process fails

In practice, different organizations can use different classification systems. Real world examples are: Discipline collégiale, Dewey, RESPEL

Legend:

  • Dashed arrows: a relationship between two terms in different hierarchical vocabularies. Be mindful of the direction. The label denotes the relationship between the terms (taken from the ISO2788 VDEX vocabulary)
  • RT (Related Term): The relationship can be navigated in both directions.
  • BT (Broader Term): The relationship is navigated as if the target of the relationship was the term's parent.
  • NT (Narrower Term): The relationship is navigated as if the target of the relationship was a children of the term.
  • TT (Top Term): The relationship is NOT navigated at all.
  • Solid arrows: a parent-child relationship between two terms in the same hierarchical vocabulary , where the pointed term the children
  • Dotted line: a relationship between a learning resource and a vocabulary term where the term is directly part of the classification of the resource
GraphViz image

With the above relationships, asking the QueryEnging? for all resources classified in:

  • rVoc1T1 should return exactly r1, and no other resource
  • rVoc1T2 should return exactly
  • with one level navigation: r2, r4, r5, r13 and no other resource
  • with two level recursive navigation: r2, r4, r5, r7, r8, r9, r12, r13 and no other resource
  • with unlimited recursion, the above and (unverified): r5, r3
  • pivotVocT1 should return exactly r3 and no other resource

Note that all vocabularies above (and their relationships are available as VDEX. See Attachments link at the bottom of the page.

There is a PHPUnit test available to verify the proper processing of the relationships above. It should be easily ported to JUnit.Note that the implementation above is VDEX (and that is how current vocabularies and relationships are available). Those relationships will have to be preserved and properly processed in whatever mechanism is actually used by COMETE to store and represent them (fedora and SKOS (OWL?)).

It is likely we will want to navigate other relationships than ISO 2788 (such as fedora:hasEquivalent). To implement a generic system we need to store the folloing pieces of information:

In Eureka, this is stored in configGetNavigableRelationshipsIdArray(), but only VDEX is supported.

Note that if a general system is implemented, it would be trivial to build it in such a way as it could be exposed over the Internet. This would allow COMETE repositories to ask a central resolution service if they don't have the information in their local lookup table, and add the information returned by the remote one. It would also greatly simplify bootstraping a new COMETE repository.

Publishing vocabularies automatically

It would be most desirable to have COMETE do the following:

  1. Publish any vocabulary whose identifier begin with http://adress_of_comete_repository/vocabularies/* at that address automatically
  2. Automatically generate such ids for all newly created vocabularies by default

It would go a long way to improve interoperability between systems, publishing all new vocabularies by default.

Attachments

0.9.8 © 2008-2011 Agilo Software all rights reserved (this page was served in: 0.66980 sec.)