Comput*:
or what comes between grep and google?
1. Introduction
This essay follows the KIDMM meeting in London and takes as its
starting point the discussion document prepared [by Conrad Taylor]
for that meeting. K means knowledge, I means information,
D meant data, but I will want to hear it for document,
M meant metadata, but I also want to hear it for method
for methods we will need and missing is any S, for part
of my argument will have been what is systematisable, how are systems
to be built, rather than things?
The KIDMM meeting followed in a series which started with the British
Association for the Advancement of Science Creating Sparks festival,
proceeded to one on information literacy,
for which the report exists,
one on metadata, for which the report
does not exist, and one on taxonomy, which failed for want of participation.
That KIDMM has succeeded itself will need explanation.
We might ask, and it has often been asked, whether data, information,
knowledge, record, content, document management are all the same
thing, or are different things, and if they are different, is in what
they differ more that that which they share? Are the differences
varieties, or matters of specie?
We might want to deal with it by suggesting possible lines of
discourse or explanation. One explanation might be that these
are supported by different departments in organisations, and
the differences are not in matter but approach. So the finance
department, human resources, research, marketing, corporate
intelligence, have different views of what are essentially
the same objects?
Another explanation might be that it is market push from vendors of technology;
they come from different histories and have different views of how to
develop their niche? Consultants are a special brand of this object
and make up stories simply to sell. Academics might be seen to be
part of this latter group, and develop ideas or theories simply to
support research, teaching and writing. It might be these arise from
different academic disciples, that histories of ideologies produce
these? There may be others, but those might be enough?
We need also to benchmark the domain of all discourse in general,
against the particular discourse on a domain such as health,
transport, education of the subject in general versus the
definite subject, the metasubject of all subject versus the
particularity of a community in practice.
Finally, in this throat-clearing introduction, we need to note that this
is a matter of the British Computer Society which has an obligation,
as a consequence of its charter, to organise knowledge of computing
for the public good. We are therefore concerned not only with the
matter in general, but the detail of how we organise to do what.
2. Information
I am going to start at KIDMM point 2 [in Conrads paper],
definition, where reference to Shannon and Weaver is made,
in a sense I think correct. Thereupon I am going to point to the
Government Ministry of Information, to public information campaigns,
The English Book of Common Prayer, the Oxford English Dictionary, the
Encyclopedia Britannica, then to information science, information
systems, product placement, human resources, finance, all of which use
the information word, agree that it might also mean anything to do
with computers, anything digital, e-very-thing, and thereafter ban the
use of the word as having become without use.
There is undoubtably history to all this. I became involved in
teaching information systems design in 1983, having written about
it for more than ten years before that, and done it, or built them.
Shortly afterwards the BCS decided to call itself the Society for
Information Systems Engineering, the CCTA called itself Government
Information Systems, the Computer Board called itself the Joint
Information Systems Committee (JISC), the directors of computing in
higher education called itself, and still does, the UCISA, and so on
and so forth. This might want explanation.
What is an information system and how does one make one remains a
question. We might use what we build to attempt an answer?
We may run a simple test. Every time you find yourself using the I
word, stop. Ask yourself what other word would do? Or none? Write
down ten cases.
Listen to other peoples conversation. Repeat.
Note the first ten times in the course of the day you experience the I
word in a document. Put these into a folder.
What property do these occasions share?
Now we have a use case.
3. Mind the gap
Let us deal with the matter which it seems to me to need to appear
before section 4. I am prepared to accept that the design and layout
of a document contributes to its matter. But that seems to me
secondary to considering the document in the context of KIDMM.
Let us regard KIDMM as a process. I am writing this paper at this
moment. It is the consequence of all the previous papers I have read,
and all the previous meetings I have attended.
But as I press each character on the keyboard I am doing so in linear
time, and serially.
At various points I introduce a paragraph, and sections, which I have
clumsily decided to number. At some point there will be an end, and
then it will go somewhere. I have deliberately left out the scholarly
apparatus as that is part of what I want to examine.
It will then go into a blog or a wiki, it might also be a message on a
list server, and an email message to a variety of lists. It might go
into an electronic journal, or even a snail paper based one, though
probably not if it insists in having no apparatus. It might be
regarded as having been peer reviewed, it might be treated with scorn,
much more likely it will be completely ignored and disappear without
trace.
Historically the process of reviewing meant that a paper, a book,
entered the scholarly apparatus as well as being a report upon it.
There was a process of indexing, in the case of monographs, of
publishing, and then of reviewing. The reviews enter the scholarly
apparatus too and form part of the reception. In some disciplines
there would be annual reviews of everything published in that field,
as known to the authors, or editors. Part of the process of becoming
an expert or a scholar in that field was the knowing of the
literature. This enabled you to spot gaps, connections, but became
more and more detailed, refined, esoteric, repetitive, as little more
remained to be said.
These serials, series, monographs, journals, books, were bought by
libraries in a process of stock selection and procurement, partly
budget-driven and partly by discipline or the eccentricities of
particular scholars. The libraries also subscribed to the scholarly
apparatus. To visit a good library which has a good command of the
literature is an absolute pleasure, for everything is there, you can
move from reference to reference and check ideas.
A special condition applies to archives which hold the original, not
the secondary material. This has however an approach, methods, values
and needs to be divided perhaps into the national or official archives of
institutions or organisations, and the personal or private archives of
individuals. This provides the raw materials in some disciplines,
perhaps excessively so. There is a separation between these primary
sources and the secondary which is unfortunate and to which we will
return. In the context of KIDMM we must be pleased that we have to
rework our ideas of KIDMM in the context of archives, and we must be
pleased too with the presence of an archivist, or an archive
information manager.
A different type of KIDMM exists in museums, or galleries. They
organise things and things contain different KIDMM. These things have
a Spectrum which has a concept of units of information; and we could do
a long thing on this, but there was no museum representative at the
meeting, so I shall simply have to point to this area and develop it
elsewhere.
Pieces of software are themselves KIDMM objects and it is interesting
we spent little time dealing with them, though representatives of the
Artificial Intelligence and Information Retrieval worlds were present.
We also spent little time considering what we know about their approaches,
how they enhance, support, complicate my KIDMM outline earlier. Each
of them might and we hope will write something which
enhances our outline.
Then we may make a jump and suggest we need to elaborate our
discipline point from section 1? That health, transport, education,
government, commerce, have different KIDMM simply as a result of their
history and tradition? Even that might be the wrong level of
abstraction and granularity, for pharmaceuticals will be different
from surgery? Information systems design was once about taking the context,
the content, the containers, and organising them. Is that a world we
have lost, rather than the knowledge?
4. Metadata.
Documents, considered generically, had records ascribed to them. The
documents are organised into collections. The records are organised
into databases, or catalogues. The collections themselves are
catalogued and so too are the catalogues, into bibliographies of
bibliographies. These record structures are what we would consider to
be metadata. The content of the record is what we would consider to
be data. The association between a query and a response is what we
would consider to be information. The incorporation of the results or
the consequences we would consider knowledge. Whether knowledge turns
out to be right or wrong according to new or different information, or
whether the information is right or wrong, according to the knowledge,
is the process of learning.
The metadata structures developed differently in different communities of
practice. But there was a process of international harmonisation to
some extent, in which UNESCO in particular played a role.
This process of standardisation may be traced and may have an history.
This history may be charted, but we may bring it to a single point of
agitation, the subject field in Dublin Core.
5. Lets hear it for the subject.
The subject is a puzzling word and a puzzling concept.
We are subjects of our monarch. The monarch is the object of that
sentence. We are its subject. The subject of this paper is the
subject. The subject I called earlier discipline. This is the nature
of complex words.
The subject has an elaborate institution in the same way that theory
has. It is remarkable how little there is on the epidemiology of the
subject, or of theory. But we want to hear it for the organisation of
subjects.
Subjects have names. Names are words and the names of subjects are
words like any others. But subjects are concepts, concepts have names,
and the names of concepts are not words like any others.
The birds and the bees have a taxonomy, as do the elements, and the
parts of an aircraft. But the taxonomy of subjects is a special type
of issue.
Historically we have schemes from at least Francis Bacon through to
Ranganathan on how we may do this. The Dewey decimal classification
scheme is probably the most commonly available and at a variety of
times has been widely taught in primary and secondary schools, and in
higher education institutions. The Library of Congress classification
scheme and the subject headings demonstrate the variety of dictionary
and classificatory approaches while the primitives and protocols of
the concepts are widely available and known.
My argument is these may easily now be mapped onto one another, and
built into visualisations.
This does however raise the issue of general or universal schemes, and
the particular detail which is needed within the subject discipline,
where these general schemes cease to be useful, or have to be
augmented.
It also raises the issue of how many general categories may be assumed
within a polyhierarchical scheme.
The subject as citizen, client, consumer, comrade or victim has to do
the work of synthesising all categories in the management of private
life, and in the balancing of public and private, and rights and
obligations.
While the collection is tangibly organised within a building, you go
to that building, knowing where you are going, and you walk around the
shelves, knowing tangibly whether you have found what you are looking
for or not; knowing whether you have been informed.
First the catalogue becomes dematerialised, rendered intangible, a
strange world, virtualised, when it has lost virtue. Now we have only
grep.
Then some parts of some collections become dematerialised, JSTOR.
What does it contain? What is there? Other parts become Illumina,
Ingenta, Emerald
Some become electronic journals, some become electronic books.
Finally, the BCS
We have the chance to run a very rich experiment with the resources of
the BCS. The web site and the charter.
UNESCO and the decade of education for sustainable development.
The issue is that Comput* now engages every part of the life of the
citizen, everywhere. When it comes to the state we do not have the
option of walking away. Actually most of us cannot walk away from the
whole of banking, electricity, gas, water, retailing, music,
transport, education, health, we are the subjects of our subjects.
6. What comes between grep and google?
Actually, nothing. Google is grep.
Grep is character and string matching.
J.S. is not Js.
We have built a suite of cases in the hyphen-society of the smallest
units of matter.
We have the historic character, phonetic and graphic, which takes
quite a lot of explaining. We have the morpheme, grapheme, sememe,
which we usually call the syllable. This is the smallest unit of
meaning? Then we have words, dictionaries. Sentences, paragraphs,
graphs, (paragraph is not like parachute), images, sections,
hyperlinks, documents, collections, records, titles, pages, indexes,
foot/endnotes, references, bibliographies, citations, colophon. ISBN.
ISSN. ISRC. (Thats enough).
Historically, before google, we had the catalogue, the abstracting and
indexing service, the bibliography of bibliography.
Now we have grep.
Actually we still have the others but we also have a new infopolecon.
We had potentially the X.500 but it went away.
We have Z39.19 but it runs to 188 pages.
We have Dublin Core but at what level, and what is the subject?
6.1 Person
The Oxford Dictionary of National Biography shows us one thing we can
do.
There is a person field on which you search.
This produces a list of persons who match that string.
This is more than grep.
It means they have built a person field and populated it with the
authoritative form of the names the editors have decided upon. This
contains the form of the surname, family name, the form and order, and
the dates. Which this standard you may then retrieve.
This could be built into metafind?
Getty has done something similar.
6.2 Encyclopedia Britannica
They werent at the meeting, but I suppose they could have been.
They have built a rudimentary navigation tool based on their propedia; it
is in implementation really rudimentary, but at least shows what could
be done. Developments of the idea, or something similar are available
at the London Business School, but again, rudimentary. Whether these
concepts genuinely arent scalable is something we can only find out.
6.3 Ulrichs
I take this as an example of a tool of which there are others, but it
is possibly the most important for the matter in hand?
I will for the moment presume that all readers know what it is and
know what I am referring to in terms of its potential significance. We
will want to map its category structure as a reverse index into Dewey
(which is already there).
This is a little test, for if you dont know what Ulrichs is, then we
have a case of an absence of a common sense. We could do the same
thing with the ASLIB directory but it is much weaker. I am not sure
how many general things like this there are. We do not want specific
things as that is a subject issue.
6.4 Googlemaps
The scale in googlemaps lets you move across spatial orders of
magnitude then choose areas and a simple map then brings in the place
names relevant to that combination.
What we can immediately imagine is that we can harvest those strings
and run them against any combination of databases.
This cracks something I tried to crack in the 1980s with the Times
Gazetteer and failed. But no one else seems to have succeeded.
6.5 RDN
Im working hard here to point to things which actually work
and make a difference.
You can search across the topic fields, but I think actually it is
still only doing grep? But it does come between grep and google.
Whether it is worth the editorial work is open to question?
6.6 Metafind
Though I fear this is only a grand grep, at least there are the
following exercises in discrimination.
Metafind has been implemented by the ULRLS (try googling on that one)
and we may best see it as having three classes of objects. The first
is the library catalogues which constitute the ULRLS. This makes a
difference.
The second is they have grouped the abstracting and indexing services
to which they subscribe. This is already in danger of becoming too
long a list in which you have to know too much. You click their buttons
and they grep.
This will work only within Senate House, unless you have special
privileges, but you have to have special privileges to work in Senate
House anyhow. This is an infopolecon matter.
Already we have the collapsing of categories, for the abstract is an
element of the electronic text and we might have the full text of the
document rather than simply the abstract?
Then we have what are called electronic journals or electronic books.
These have aggregators. This turns back to the previous paragraph,
for we might have had a paper based document which has been digitised,
indeed this might be the whole corpus, or something like Project
Gutenberg which has now been around for thirty years, or we might have
things which are designed for the new media.
6.7 ACM
Now lets bring this nearer to home, and consider the digital library
of the ACM in terms of what we have, or let us consider INSPEC?
We need to test these with cases which are matter to hand, so consider
the Semantic Web? Now try near-field communication. Or metadata. In
what sense may these be said to work?
6.8 BCS
We have had the experiment of building a taxonomy.
So this will have
to be run and then we will learn what the subsequent steps need to be.
7. Wrapping
It seems to me indisputable that very small, very tightly defined, and
very simple things can be done. The ISBN mapped into the ANA and that
works.
But simple things have to be made interoperable with other simple
things if we are going to KIDMM.
Hadrians bottom is a case.
All I want to do is work out the statues from Hadrians Villa near
Tivoli which are known, who took them, where they have ended up, and
perhaps, who had them in between, plus perhaps where they have been
written about? I know about LIMC. I know what I have written in the
previous six sections. I know about the British Museum Enlightenment
Gallery, indeed that is partly why I have developed this case. I know
about the V&A, Getty, and Antinous. This is KIDMM.
John Lindsay, University of Kingston
|