EPSG logo The Electronic Publishing Specialist Group

 
KIDMM: Knowledge, information,
data & metadata management



KIDMM home page RETURN TO:
‘Documents & other resources’
home page

Documents: Comment from Martin Bryan

The following comment is from Martin Bryan in reaction to an earlier BCS_KIMtec posting from Berin Gowan. Martin Bryan in involved with XML UK, represents the UK at ISO meetings about SGML/XML and related standards, and is a member of the EC’s Open Information Interchange initiative. [bio]


 

The need to impose ‘views’
over ‘data collections’

Berin Gowan wrote...

Catalogues, directories, encyclopaedias, dictionaries, timetables etc have drawn on both disciplines [information management and publishing – Ed.] for many years. The challenge has been to devise editorial environments for sustaining such information to a high standard of integrity, to configure databases to host that material and to build middleware both to service the editorial process and to drive the compilation processes that deliver multiple information products and services.

and...

This opens up the issue of different information types, repositories and supporting structures such as indexes and cross-references. If you look, for example, at the Early Day Motions on edmi/parliament.uk/edmi you will see the narrative texts of the motions supported by a thesaurus, MPs as signaturies listed by name, party, constituency and even a league table of the number of motions they have signed. There are a host of supporting facilities for filtering and sorting, and for handling such specific features as motions that propose amendments to previous motions. As you can imagine, this interface is rather different to the one for accepting new motions from MP's, and for MP's to add their signature to an existing motion.

The last sentence of both these paragraphs identifies the ‘real’ problem: information has multiple uses, many of which cannot be forseen when it is originally captured. The key to solving this problem is the ability to impose ‘views’ over ‘data collections’.

We must not presume that all the data that a particular view needs will come from a single source. In many cases different parts of the view required by a particular user will have been collected at different times. When an early day motion is accepted, only part of its history is available. When an MP adds their signature to it, another piece of the history falls in place. Until the motion can be compared with its contemporaries, or other similarly targeted motions that may be issued decades later, the recorded information may still only be partial. We must not second-guess who might want to compile our data in the future, and must avoid making it difficult for others to reference our data.

How can we ensure that our data is reusable? Firstly, make each reusable component uniquely referenceable. Databases do this by having keys to rows of information with fixed fields. But for unstructured texts this is harder to do. We do not want to have to assign unique identifiers (i.e. keys) to each paragraph. But we should assign keys to all headed sections, all tables and all figures so that they can be referred to easily from elsewhere. We should also be able to count things like paragraph or bulleted list items.

XML Paths are a great help here. We can ask for the second paragraph in the third section of the fourth chapter using XML Paths. Unfortunately there is, as yet, no standard way to refer to the second sentence in that paragraph (though you can refer to the part between the first and second full stops, which may or may not be the second paragraph!).

It is far easier to refer to things if they are named logically. For example, the same point in a hierarchical tree might be referred to as
book/text/chapter[1]/section[2]/para[3]
or as html/body/h1[1]/h2[2]/p[3]
or as a/b/c[1]/d[2]/e[3].

A computer will provide the same result from all three representations, but a human is more likely to get the first representation correct than either of the more concise representations.

When creating databases the same rules apply. Field names such as field1, field2, and field3 mean the same to a computer as Snm, Fnm and Tle — or Surname, Forename and Title. But as far as long term reusability is concerned, the fully-named field is much easier to maintain and use.

In our increasingly globalized world, however, we need to bear in mind that many of our collaborators may not be able to speak our language, and we certainly do not speak all their languages. Therefore it is vital that we have some way of mapping the names we give things to the names others give them. What we need is the ability to be able to ask for things in terms we understand, and have a computer translate this into names that are relevant for querying data sources captured and maintained by people who use alternate terms for the same information.

For this to be possible we need registries that record the equivalance between names as used within different data sources. This need is only just beginning to be realized, and it will be a decade or so before we have such registeries generally available.

For the time being we need to ‘logically label’ and ‘uniquely identify’ reusable information items so that they can easily be referred to by anyone seeking to use the data we captured to create a view of a data collection at some future date.

Martin Bryan

EPSG is a Specialist Group of the British Computer Society