- List of critical functionality
- List of desirable functionality
- Implementation choices/alternatives
The metadata module provides a uniform API to a number of translators/transformers to convert data from one format to another, extract metadata, or represent a group of resources in a single document (ex: RSS, OpenDocument? Off print). Some transformations are trivial (such as presenting search results in HTML), others are more involving (such as converting learning resource metadata from native formats to the comete Metamodel).
Translators or transformers are essential to the functionality of most modules dealing with metadata (specifically QueryEngine, Vocabulary and Identity). Basically, it is called every time a representation of a resource, a vocabulary, a vocabulary term, an identity, or a group of those is required.
Specific examples include providing a representation of the common data in a group of resource metadata for the group editor module, generating user viewable HTML representations of a resource's metadata (the resources web page), of a group of resources (ex: an Atom feed for a Collection, the list of search results in HTML), and for de-referencing linked resources (such as vocabulary terms) inside other Metadata. Much of the custom code in Comete will live in the Metadata module.
List of critical functionality
- LOM -> Metamodel conversion
- Necessary for most Comete functionality
- Metamodel -> DC conversion
- Required to store anything in Fedora
- Metamodel -> HTML
- For display to the end user. This includes collecting information from every vocabulary term used for hover help, get the Identities display format, etc.
- Metamodel -> HTML for a specific end user language
- Search result list -> HTML results
- For search UI, take the raw search results and adds things like logos and other data from the actual resource's metamodel.
- Metamodel -> Atom
- For RSS feeds
- Metamodel -> OpenDocument
- For Off-print (#54)
- The eureka source code includes a directory with a template document that can be re-used directly. Code to fill content.xml must be re-implemented in the data transformer. The code also includes the necessary relaxng schemas. Eureka used the odpack.pl and odpack.pl script to pack the opendocument, but there should be a java equivalent.
List of desirable functionality
- Group diff summary to make the group editor's job easier
- MLR -> Metamodel
It is expected that significant caching will be required as part of this module.
It must be representable as Fedora SDef and SDep. Several different techniques will need to be implemented. XSLT is suitable for complete conversion of a single document from one format to another. However, other formats are not XML based (VCard for example), or need to be built piece by piece (ex: RSS feeds).
Format conversions between different Metadata formats
For each element, the conversion can be: Consider format A and format B. The conversion from A => B can be (for each field)
Semantic meaning is lost in the conversion. Furthermore, if converting A => B => A, information is lost. Inexact vocabulary translation for non-hierarchical vocabularies. . For example, in LOM element 5.2 (type of resource), a resource has a value of "démonstration" in NORMETIC. So we have<learningResourceType> <source>http://www.normetic.org/vdex/typeressourcev1_2.xml</source> <value>démonstration</value> </learningResourceType>
Now according to the lom standard, we are supposed to also send along the closest term from LOMv1.0, so we pick "lecture" (See actual vocabulary at http://eureka.ntic.org/vdex/LOMv1.0_element_5_2_learning_resource_type_voc.xml). Not exactly the same, but close enough to be of more use than no information. So we'd send:
<learningResourceType> <source>http://www.normetic.org/vdex/typeressourcev1_2.xml</source> <value>démonstration</value> </learningResourceType> <learningResourceType> <source>LOMv1.0</source> <value>lecture</value> </learningResourceType>
In the reals world we'd leave it at that. The translation was lossy, in one direction. But suppose for some reason we didn't send along the NORMETIC entry "démonstration" and we wanted to translate back. So we have "lecture", and we want to find the closest term in NORMETIC. We'd probably pick "lecture/présentation". Oups. We not only lost some semantics, but the translation isn't symetrical.
A semantic conversion can be attempted. IF it succeeds, no information is lost. If it fails, information may or may not be lost. Examples:
- Fields allowing both structured and unstructured metadata, such as LOM 5.7 Typical age range. For example, we can have a value like "12-17" that expresses an age range but we could also have a value like "Teenagers". In the first case, we can interpret the value semantically (as a range is specified with 2 values separated by a dash) so we could have a lossless conversion. In the other case though, we are unable to interpret the value without human intervention and would have a hard time to keep the semantic value after the conversion.
- Fields that have a size limit (ex: one format has an unlimited "description", while the other only has a "Summary" limited to 250 characters. The conversion will be lossless if the original description is 250 characters or less. Otherwise, it will be lossy).
- When converting from some structured elements to less structured element such as converting a description from Dublin Core qualified to Lom. If the description didn't have an an abstract in the Dublin Core Qualified, the conversion is lossless. If it did, and we prepend the abstract to the description, the conversion is "lossy", in that we lost some structure and context, but no information is actually lost. The opposite conversion (LOM->DC Qualified) would be lossless.
Semantic meaning is lost in the conversion. However, in a round-trip conversion A => B => A, no information is lost. There are few practical examples of those. Mostly it's when the value space of both elements are vocabularies, are not directly equivalent, but the mapping is 1 to 1. A convoluted example would be a classification system for the quality of resources containing the values "Good, bad, very bad" and the other "Excellent, acceptable, unacceptable". The meaning is clearly not the same, but there is only one possible conversion, and it's symmetrical. A less convoluted example is if the system uses some internal format to to recover context if it has to translate back. If, in the example above of the DC qualified -> LOM conversion of an abstract, the system not only prepends the abstract to the description but surrounds it with the strings "Abstract: " and "\n---\n", and is able to parse it back, the conversion becomes lossy bidirectional.
For all intents and purpose, no information is lost . Examples:
- Converting a description from LOM to Dublin Core Qualified description (which can qualify an abstract separate from the description), we didn't loose anything in the conversion. The opposite conversion would be conditional lossless.
This translation matrix is extremely useful to allow editing metadata without a native editor. Fields whose translation is Lossy bidirectional, Conditional Lossless (successful conversion) or Lossless are editable. The others are not.
Java libraries for accessing the various formats to be translated
- Saxon is the best XSLT processor for Java and the only open-source solution that supports XSLT 2.0 AFAIK.
- LexEv could be useful for handling CDATA blocks, entities, comments, etc.
- XProc is a XML Pipeline language useful to create a declarative workflow of XML transformations (including XSLT, inclusions, renaming, validation, etc.)
- Tika Multi-format parser. Of specific interest to us is support for Dublin Core, [ OpenDocument, Atom and varioused compressed archives.
- Boomerang Filtres de données bidirectionnels
- Boomerang is a programming language for writing lenses—well-behaved bidirectional transformations—that operate on ad-hoc, textual data formats. Every lens program, when read from left to right, describes a function that maps an input to an output; when read from right to left, the very same program describes a "backwards" function that maps a modified output, together with the original input, back to a modified input.
Metadata extraction from resources
It is desired (See #71) to generate initial metadata automatically after uploading the original resource file. Such automatically generated data would off course be revised by a human before publication.
Several OpenSource? software projects provide relevant building blocks for this extraction.
- Ariadne SMmgI
- The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page, allowing getting title, description, etc.
- ExifTool, a very active open source project for extracting/writing structured metadata from a very large number of different formats
- JSOUP HTML Content scraping
- [FB:] Dans l'exemple "Lossy unidirectional", dans quel cas devrais-tu faire ce genre de traductions? Tu mentionnes que selon le standard LOM, on est supposé prendre un terme de LOMv1.0. Quand doit-on faire ça? Est-ce que ça veut dire qu'à chaque fois que l'on crée (ou importe) un LOM Normetic que l'on doive vérifier s'il a déjà une valeur de vocabulaire pour 5.2 avec une source LOMv1.0? Et que s'il n'y en n'a pas, que l'on doive en ajouter une? Est-ce que cette valeur supplémentaire doit être conservée seulement pour notre modèle interne ou devrait-elle être considérée comme une "correction" et être visible lorsque l'on réexposera la fiche?
- B: C'est un exemple de traduction entre deux élément, pas une spécification de comportement! Mais pour répondre à ta question, on pourrait faire cette transformation à l'enregistrement de la fiche dans le cadre du user story #108.