EPSG logo The Electronic Publishing Specialist Group

 
KIDMM: Knowledge, information,
data & metadata management



KIDMM home page RETURN TO:
‘Documents & other resources’
home page

An essay by John Lindsay

This essay was presented on the KIDMM list on 9 May 2006. John explained that it is not quite complete and he will return to it, but would like it posted anyway. In part it is a response to the paper by Conrad Taylor that was circulated in advance of the KIDMM discussion meeting on 6 March 2006.


 

“Comput*”:
  — or what comes between grep and google?

1. Introduction

This essay follows the KIDMM meeting in London and takes as its starting point the discussion document prepared [by Conrad Taylor] for that meeting. K means knowledge, I means information, D meant data, but I will want to hear it for document, M meant metadata, but I also want to hear it for method – for methods we will need – and missing is any S, for part of my argument will have been what is systematisable, how are systems to be built, rather than things?

The KIDMM meeting followed in a series which started with the British Association for the Advancement of Science Creating Sparks festival, proceeded to one on information literacy, for which the report exists, one on metadata, for which the report does not exist, and one on taxonomy, which failed for want of participation. That KIDMM has succeeded itself will need explanation.

We might ask, and it has often been asked, whether data, information, knowledge, record, content, document management are all the same thing, or are different things, and if they are different, is in what they differ more that that which they share? Are the differences varieties, or matters of specie? (Actually I am not sure the question has been asked, but as the scholarly apparatus is part of what is at issue, I am not going to deal with it.)

We might want to deal with it by suggesting possible lines of discourse or explanation. One explanation might be that these are supported by different departments in organisations, and the differences are not in matter but approach. So the finance department, human resources, research, marketing, corporate intelligence, have different views of what are essentially the same objects?

Another explanation might be that it is market push from vendors of technology; they come from different histories and have different views of how to develop their niche? Consultants are a special brand of this object and make up stories simply to sell. Academics might be seen to be part of this latter group, and develop ideas or theories simply to support research, teaching and writing. It might be these arise from different academic disciples, that histories of ideologies produce these? There may be others, but those might be enough?

We need also to benchmark the domain of all discourse in general, against the particular discourse on a domain such as health, transport, education — of the subject in general versus the definite subject, the metasubject of all subject versus the particularity of a community in practice.

Finally, in this throat-clearing introduction, we need to note that this is a matter of the British Computer Society — which has an obligation, as a consequence of its charter, to organise knowledge of computing for the public good. We are therefore concerned not only with the matter in general, but the detail of how we organise to do what.

2. Information

I am going to start at KIDMM point 2 [in Conrad’s paper], definition, where reference to Shannon and Weaver is made, in a sense I think correct. Thereupon I am going to point to the Government Ministry of Information, to public information campaigns, The English Book of Common Prayer, the Oxford English Dictionary, the Encyclopedia Britannica, then to information science, information systems, product placement, human resources, finance, all of which use the information word, agree that it might also mean anything to do with computers, anything digital, e-very-thing, and thereafter ban the use of the word as having become without use.

There is undoubtably history to all this. I became involved in teaching information systems design in 1983, having written about it for more than ten years before that, and done it, or built them. Shortly afterwards the BCS decided to call itself the Society for Information Systems Engineering, the CCTA called itself Government Information Systems, the Computer Board called itself the Joint Information Systems Committee (JISC), the directors of computing in higher education called itself, and still does, the UCISA, and so on and so forth. This might want explanation.

What is an information system and how does one make one remains a question. We might use what we build to attempt an answer?

We may run a simple test. Every time you find yourself using the I word, stop. Ask yourself what other word would do? Or none? Write down ten cases.

Listen to other people’s conversation. Repeat.

Note the first ten times in the course of the day you experience the I word in a document. Put these into a folder.

What property do these occasions share?

Now we have a use case.

3. Mind the gap

Let us deal with the matter which it seems to me to need to appear before section 4. I am prepared to accept that the design and layout of a document contributes to its matter. But that seems to me secondary to considering the document in the context of KIDMM.

Let us regard KIDMM as a process. I am writing this paper at this moment. It is the consequence of all the previous papers I have read, and all the previous meetings I have attended.

But as I press each character on the keyboard I am doing so in linear time, and serially.

At various points I introduce a paragraph, and sections, which I have clumsily decided to number. At some point there will be an end, and then it will go somewhere. I have deliberately left out the scholarly apparatus as that is part of what I want to examine.

It will then go into a blog or a wiki, it might also be a message on a list server, and an email message to a variety of lists. It might go into an electronic journal, or even a snail paper based one, though probably not if it insists in having no apparatus. It might be regarded as having been peer reviewed, it might be treated with scorn, much more likely it will be completely ignored and disappear without trace.

Historically the process of reviewing meant that a paper, a book, entered the scholarly apparatus as well as being a report upon it. There was a process of indexing, in the case of monographs, of publishing, and then of reviewing. The reviews enter the scholarly apparatus too and form part of the reception. In some disciplines there would be annual reviews of everything published in that field, as known to the authors, or editors. Part of the process of becoming an expert or a scholar in that field was the knowing of the literature. This enabled you to spot gaps, connections, but became more and more detailed, refined, esoteric, repetitive, as little more remained to be said.

These serials, series, monographs, journals, books, were bought by libraries in a process of stock selection and procurement, partly budget-driven and partly by discipline or the eccentricities of particular scholars. The libraries also subscribed to the scholarly apparatus. To visit a good library which has a good command of the literature is an absolute pleasure, for everything is there, you can move from reference to reference and check ideas.

A special condition applies to archives which hold the original, not the secondary material. This has however an approach, methods, values and needs to be divided perhaps into the national or official archives of institutions or organisations, and the personal or private archives of individuals. This provides the raw materials in some disciplines, perhaps excessively so. There is a separation between these primary sources and the secondary which is unfortunate and to which we will return. In the context of KIDMM we must be pleased that we have to rework our ideas of KIDMM in the context of archives, and we must be pleased too with the presence of an archivist, or an archive information manager.

A different type of KIDMM exists in museums, or galleries. They organise things and things contain different KIDMM. These things have a Spectrum which has a concept of units of information; and we could do a long thing on this, but there was no museum representative at the meeting, so I shall simply have to point to this area and develop it elsewhere.

Pieces of software are themselves KIDMM objects and it is interesting we spent little time dealing with them, though representatives of the Artificial Intelligence and Information Retrieval worlds were present. We also spent little time considering what we know about their approaches, how they enhance, support, complicate my KIDMM outline earlier. Each of them might – and we hope will – write something which enhances our outline.

Then we may make a jump and suggest we need to elaborate our discipline point from section 1? That health, transport, education, government, commerce, have different KIDMM simply as a result of their history and tradition? Even that might be the wrong level of abstraction and granularity, for pharmaceuticals will be different from surgery? Information systems design was once about taking the context, the content, the containers, and organising them. Is that a world we have lost, rather than the knowledge?

4. Metadata.

Documents, considered generically, had records ascribed to them. The documents are organised into collections. The records are organised into databases, or catalogues. The collections themselves are catalogued and so too are the catalogues, into bibliographies of bibliographies. These record structures are what we would consider to be metadata. The content of the record is what we would consider to be data. The association between a query and a response is what we would consider to be information. The incorporation of the results or the consequences we would consider knowledge. Whether knowledge turns out to be right or wrong according to new or different information, or whether the information is right or wrong, according to the knowledge, is the process of learning.

The metadata structures developed differently in different communities of practice. But there was a process of international harmonisation to some extent, in which UNESCO in particular played a role.

This process of standardisation may be traced and may have an history. This history may be charted, but we may bring it to a single point of agitation, the subject field in Dublin Core.

5. Let’s hear it for the subject.

The subject is a puzzling word and a puzzling concept.

We are subjects of our monarch. The monarch is the object of that sentence. We are its subject. The subject of this paper is the subject. The subject I called earlier discipline. This is the nature of complex words.

The subject has an elaborate institution in the same way that theory has. It is remarkable how little there is on the epidemiology of the subject, or of theory. But we want to hear it for the organisation of subjects.

Subjects have names. Names are words and the names of subjects are words like any others. But subjects are concepts, concepts have names, and the names of concepts are not words like any others.

The birds and the bees have a taxonomy, as do the elements, and the parts of an aircraft. But the taxonomy of subjects is a special type of issue.

Historically we have schemes from at least Francis Bacon through to Ranganathan on how we may do this. The Dewey decimal classification scheme is probably the most commonly available and at a variety of times has been widely taught in primary and secondary schools, and in higher education institutions. The Library of Congress classification scheme and the subject headings demonstrate the variety of dictionary and classificatory approaches while the primitives and protocols of the concepts are widely available and known.

My argument is these may easily now be mapped onto one another, and built into visualisations.

This does however raise the issue of general or universal schemes, and the particular detail which is needed within the subject discipline, where these general schemes cease to be useful, or have to be augmented.

It also raises the issue of how many general categories may be assumed within a polyhierarchical scheme.

The subject as citizen, client, consumer, comrade or victim has to do the work of synthesising all categories in the management of private life, and in the balancing of public and private, and rights and obligations.

While the collection is tangibly organised within a building, you go to that building, knowing where you are going, and you walk around the shelves, knowing tangibly whether you have found what you are looking for or not; knowing whether you have been informed.

First the catalogue becomes dematerialised, rendered intangible, a strange world, virtualised, when it has lost virtue. Now we have only grep.

Then some parts of some collections become dematerialised, JSTOR. What does it contain? What is there? Other parts become Illumina, Ingenta, Emerald

Some become electronic journals, some become electronic books.

Finally, the BCS

We have the chance to run a very rich experiment with the resources of the BCS. The web site and the charter.

UNESCO and the decade of education for sustainable development.

The issue is that Comput* now engages every part of the life of the citizen, everywhere. When it comes to the state we do not have the option of walking away. Actually most of us cannot walk away from the whole of banking, electricity, gas, water, retailing, music, transport, education, health, we are the subjects of our subjects.

6. What comes between grep and google?

Actually, nothing. Google is grep.

Grep is character and string matching.

J.S. is not Js.

We have built a suite of cases in the hyphen-society of the smallest units of matter.

We have the historic character, phonetic and graphic, which takes quite a lot of explaining. We have the morpheme, grapheme, sememe, which we usually call the syllable. This is the smallest unit of meaning? Then we have words, dictionaries. Sentences, paragraphs, graphs, (paragraph is not like parachute), images, sections, hyperlinks, documents, collections, records, titles, pages, indexes, foot/endnotes, references, bibliographies, citations, colophon. ISBN. ISSN. ISRC. (That’s enough).

Historically, before google, we had the catalogue, the abstracting and indexing service, the bibliography of bibliography.

Now we have grep.

Actually we still have the others but we also have a new infopolecon.

We had potentially the X.500 but it went away.

We have Z39.19 but it runs to 188 pages.

We have Dublin Core but at what level, and what is the subject?

6.1 Person

The Oxford Dictionary of National Biography shows us one thing we can do.

There is a person field on which you search.

This produces a list of persons who match that string.

This is more than grep.

It means they have built a person field and populated it with the authoritative form of the names the editors have decided upon. This contains the form of the surname, family name, the form and order, and the dates. Which this standard you may then retrieve.

This could be built into metafind?

Getty has done something similar.

6.2 Encyclopedia Britannica

They weren’t at the meeting, but I suppose they could have been. They have built a rudimentary navigation tool based on their propedia; it is in implementation really rudimentary, but at least shows what could be done. Developments of the idea, or something similar are available at the London Business School, but again, rudimentary. Whether these concepts genuinely aren’t scalable is something we can only find out.

6.3 Ulrichs

I take this as an example of a tool of which there are others, but it is possibly the most important for the matter in hand?

I will for the moment presume that all readers know what it is and know what I am referring to in terms of its potential significance. We will want to map its category structure as a reverse index into Dewey (which is already there).

This is a little test, for if you don’t know what Ulrichs is, then we have a case of an absence of a common sense. We could do the same thing with the ASLIB directory but it is much weaker. I am not sure how many general things like this there are. We do not want specific things as that is a subject issue.

6.4 Googlemaps

The scale in googlemaps lets you move across spatial orders of magnitude then choose areas and a simple map then brings in the place names relevant to that combination.

What we can immediately imagine is that we can harvest those strings and run them against any combination of databases.

This cracks something I tried to crack in the 1980s with the Times Gazetteer and failed. But no one else seems to have succeeded.

6.5 RDN

I’m working hard here to point to things which actually work and make a difference.

You can search across the topic fields, but I think actually it is still only doing grep? But it does come between grep and google.

Whether it is worth the editorial work is open to question?

6.6 Metafind

Though I fear this is only a grand grep, at least there are the following exercises in discrimination.

Metafind has been implemented by the ULRLS (try googling on that one) and we may best see it as having three classes of objects. The first is the library catalogues which constitute the ULRLS. This makes a difference.

The second is they have grouped the abstracting and indexing services to which they subscribe. This is already in danger of becoming too long a list in which you have to know too much. You click their buttons and they grep.

This will work only within Senate House, unless you have special privileges, but you have to have special privileges to work in Senate House anyhow. This is an infopolecon matter.

Already we have the collapsing of categories, for the abstract is an element of the electronic text and we might have the full text of the document rather than simply the abstract?

Then we have what are called electronic journals or electronic books. These have aggregators. This turns back to the previous paragraph, for we might have had a paper based document which has been digitised, indeed this might be the whole corpus, or something like Project Gutenberg which has now been around for thirty years, or we might have things which are designed for the new media.

6.7 ACM

Now let’s bring this nearer to home, and consider the digital library of the ACM in terms of what we have, or let us consider INSPEC?

We need to test these with cases which are matter to hand, so consider the Semantic Web? Now try near-field communication. Or metadata. In what sense may these be said to work?

6.8 BCS

We have had the experiment of building a taxonomy. So this will have to be run and then we will learn what the subsequent steps need to be.

7. Wrapping

It seems to me indisputable that very small, very tightly defined, and very simple things can be done. The ISBN mapped into the ANA and that works.

But simple things have to be made interoperable with other simple things if we are going to KIDMM.

Hadrian’s bottom is a case.

All I want to do is work out the statues from Hadrian’s Villa near Tivoli which are known, who took them, where they have ended up, and perhaps, who had them in between, plus perhaps where they have been written about? I know about LIMC. I know what I have written in the previous six sections. I know about the British Museum Enlightenment Gallery, indeed that is partly why I have developed this case. I know about the V&A, Getty, and Antinous. This is KIDMM.

 
John Lindsay, University of Kingston