The need to impose views
over data collections
Berin Gowan wrote...
Catalogues, directories, encyclopaedias, dictionaries,
timetables etc have drawn on both disciplines [information
management and publishing Ed.] for many
years. The challenge has been to devise editorial
environments for sustaining such information to a high
standard of integrity, to configure databases to host
that material and to build middleware both to service
the editorial process and to drive the compilation
processes that deliver multiple information products
and services.
and...
This opens up the issue of different information types,
repositories and supporting structures such as indexes
and cross-references. If you look, for example, at the
Early Day Motions on edmi/parliament.uk/edmi you will
see the narrative texts of the motions supported by a
thesaurus, MPs as signaturies listed by name, party,
constituency and even a league table of the number of
motions they have signed. There are a host of
supporting facilities for filtering and sorting, and
for handling such specific features as motions that
propose amendments to previous motions. As you can
imagine, this interface is rather different to the one
for accepting new motions from MP's, and for MP's to
add their signature to an existing motion.
The last sentence of both these paragraphs identifies
the real problem: information has multiple
uses, many of which cannot be forseen when it is
originally captured. The key to solving this problem is
the ability to impose views over data
collections.
We must not presume that all the data that a particular
view needs will come from a single source. In many cases
different parts of the view required by a particular
user will have been collected at different times. When
an early day motion is accepted, only part of its
history is available. When an MP adds their signature
to it, another piece of the history falls in place.
Until the motion can be compared with its
contemporaries, or other similarly targeted motions
that may be issued decades later, the recorded
information may still only be partial. We must not
second-guess who might want to compile our data in the
future, and must avoid making it difficult for others
to reference our data.
How can we ensure that our data is reusable? Firstly,
make each reusable component uniquely referenceable.
Databases do this by having keys to rows of information
with fixed fields. But for unstructured texts this is
harder to do. We do not want to have to assign unique
identifiers (i.e. keys) to each paragraph. But we
should assign keys to all headed sections, all tables
and all figures so that they can be referred to easily
from elsewhere. We should also be able to count things
like paragraph or bulleted list items.
XML Paths are a great help here. We can ask for the
second paragraph in the third section of the fourth
chapter using XML Paths. Unfortunately there is, as
yet, no standard way to refer to the second sentence
in that paragraph (though you can refer to the part
between the first and second full stops, which may or
may not be the second paragraph!).
It is far easier to refer to things if they are named
logically. For example, the same point in a hierarchical
tree might be referred to as
book/text/chapter[1]/section[2]/para[3]
or as html/body/h1[1]/h2[2]/p[3]
or as a/b/c[1]/d[2]/e[3].
A computer will provide the same result from all three
representations, but a human is more likely to get the
first representation correct than either of the more
concise representations.
When creating databases the same rules apply. Field
names such as field1, field2,
and field3 mean the same
to a computer as Snm, Fnm
and Tle or Surname,
Forename and Title. But as
far as long term reusability is concerned, the fully-named
field is much easier to maintain and use.
In our increasingly globalized world, however, we need
to bear in mind that many of our collaborators may not
be able to speak our language, and we certainly do not
speak all their languages. Therefore it is vital that
we have some way of mapping the names we give
things to the names others give them. What we
need is the ability to be able to ask for things in
terms we understand, and have a computer translate
this into names that are relevant for querying data
sources captured and maintained by people who use
alternate terms for the same information.
For this to be possible we need registries that record
the equivalance between names as used within different
data sources. This need is only just beginning to be
realized, and it will be a decade or so before we have
such registeries generally available.
For the time being we need to logically
label and uniquely identify reusable
information items so that they can easily be referred
to by anyone seeking to use the data we captured to
create a view of a data collection at some future date.
Martin Bryan
|