The addition of metadata to the contents of ncse was challenging. Not only was there around 100,000 pages of content to which we had to add information, but this content was structured in a complex hierarchy, creating tiers of metadata at edition, title, volume, department and article level. This page gives an account of how we addressed this challenge, as well as full details of the resulting ncse metadata schema.

 

ncse Metadata Schema

The existing ncse metadata scheme is as follows:

Field: Description: Method Applies to:
Uniform periodical title This describes the label given to the whole of a title, regardless of changes over the run. hand issue > department > item
Actual periodical title Gives the title as it appears on the masthead. Equivalent to Dublin Core ‘Title.’ hand issue > department > item
Date Gives the date of publication (as far as known). Equivalent to Dublin Core ‘Date.’ hand issue > department > item
Source Labels content as part of ncse Equivalent to Dublin Core ‘Source.’ hand issue > department > item
Volume number Gives the volume and series number of the issue hand issue > department > item
Issue number Gives the number of the issue hand issue > department > item
Edition Labels the issue as being a particular edition (town, country, 1-9) hand issue > department > item
Number of editions Give the number of editions of an issue. hand issue > department > item
Image description Gives a number of keywords that describe what the image contains. Drawn from DMVI schema. hand item
Price Gives the price of an issue if known. hand issue > department > item
Bibliographic location Labels an article as appearing in the wrapper or in the issue itself. hand department > item
Size Gives the paper size of an issue. Equivalent to Dublin Core ‘Size.’ hand issue > department > item
Editor Labels issues with the name of their editor. hand not used
Publisher Labels issues with the name of their publisher. hand not used
Department genre Labels departments with a genre descriptor. hand / text mining not used
Department title Labels departments with a title. hand not used
Page number provides a page label that corresponds to that printed hand / script page
Persons Identifies all the people mentioned in the edition named entity extraction item
Institutions Identifies all the institutions mentioned in the edition named entity extraction item
Places Identifies all the places mentioned in the edition named entity extraction item
Genre Marks up items that correspond to a list of predefined genre text mining item
Events Marks up items that correspond to a list of predefined events text mining item
Subject Labels items with subject keywords. Drawn from USAS. USAS semantic tagger item

When making the decision whether or not to include multiple editions (for information on this decision click here), we also explored a variety of strategies to automate as much of the metadata entry as possible. There were three moments when metadata was added to content, and at each moment a combination of human and automated input was employed. Where a field was not used, this was usually because experiments proved unsuccessful or we did not have the time to do the necessary research. Full details of these methods are below; however, one of the strategies we implemented in order to save labour is relevant here. In order to reduce the amount of manual data entry we designed the facility for metadata to be inherited from issue to department and then to item level. In the schema above, you can see the ways in which we took advantage of this so as not to have to fill the same field for each item within a department or issue.

 

History of the ncse metadata schema

In the early stages of the project we undertook a survey of the material to try and understand what possible data categories we could identify in serials and how they related to each other. The result was a very complex diagram that can be downloaded here. Even given our initial estimates of the scale of ncse we recognized that this was an unreasonable amount of data and set about creating a more manageable schema.

Our early designs broke the diagram down into a number of areas: bibliographic metadata; structural metadata; generic metadata; advanced metadata; and concept mapping:

Bibliographic metadata

This applied both to the serials and the digital resource. Fields included ‘title’ (of article), ‘creator’ (of article or digital resource), ‘date’ (of article or digital resource), ‘publisher’ (name and place of publisher of article), ‘printer’ (name and place of printer of article), ‘editor’ (or article or digital resource), ‘pagination’ (span of the article), ‘price’ (of the issue), ‘creator’ (of digital resource), ‘format’ (of digital resource), ‘origin’ (of digital resource, i.e. ncse). In creating these fields we referred to the Dublin Core schema in an attempt to ensure compatibility.

Structural metadata

The structural metadata fields were designed to indicate where a particular item occurred in the edition and what its relationships to other constituent parts were. Fields included ‘given title’ (of whole run), ‘actual title’ (i.e. that printed on the masthead of a single issue), ‘series number’, ‘volume number’, ‘edition’ (intended to label which edition a particular component occurred within), ‘prelims / numbers’ (designed to distinguish between items that appeared on front matter and those within issues themselves), ‘department’, ‘item’.

Formal metadata

This category was principally designed to accommodate images. Initially we were simply concerned to specify a field that would mark up images, but when combined with bibliographic and advanced metadata we would achieve a fuller description.

Generic metadata

This was a single field that would label each item with a genre such as advertisements, obituaries, correspondence, leading articles, news etc. We kept this field and used it to explore text mining techniques for metadata entry.

Advanced Metadata

Advanced metadata referred to those categories that described the content of articles. Initially we conceived these fields as being a form of index, populated by the content of the periodicals. The categories were ‘people’, ‘places’, ‘events’, ‘objects’, ‘publications’, and ‘institutions.’ Although we attempted some experiments to see how long these indexes would take to create by hand, our decision to include multiple editions and so edit an edition of c100,000 pages rendered this impractical. We pursued the advanced metadata categories through other means, however, as you can read in named entity extraction, text mining and semantic tagging.

Concept Mapping

Concept mapping was intended to map thematic concepts across different types of content in the edition. For a description of our work in this area click here. For an account of how we actually implemented thematic metadata go to named entity extraction, text mining and semantic tagging.

This metadata system was fairly loosely conceived as we designed it alongside experiments in segmentation. Without a clear idea of how we were going to produce digital copy, the form this would take, and the way different components were linked to each other, we could not design a metadata schema or the means for implementing it.

As we began to progress with the preparation of content and the segmentation, we also began to refine the metadata schema. Over the course of Autumn 2006 we developed it into a form that more closely resembles the one given above. To download this earlier schema, click here; to see a visual representation of it click here. As you may note, we had already separated the advanced metadata categories out, and had begun to think carefully about the values that would appear in each field. This was especially problematic for subject and image. At this stage we were not sure how we could classify the content for each category and were exploring various existing schemes, as well as our own concept maps, in order to decide on a strategy that was suitable for the project’s requirements. Accounts of how we developed subject metadata can be read here and image metadata here.

 

Adding metadata during segmentation

Once we established the methodology for the production of ncse, we began to analyze the points at which we could implement our metadata schema. Having a more developed workflow allowed us to begin to conceptualize which elements of the structure would be encoded into the xml, and what relationships needed to be labelled with metadata. The first place where we could begin to add metadata was when working with the segmented pdfs to amend the segmentation applied by Olive Software. At this stage we were mainly working to correct the content: making sure the right pages were bound into the right issues; that items were correctly distinguished from each other; and that departments were correctly labelled. The Olive plugin for Adobe Acrobat permits the addition of metadata but, rather than add all the metadata at this stage, we simply used the plugin to make any changes to the page numbers that were allocated to the pages of each pdf document. For more information about page numbers, click here .

The Olive Administrator Application is a web application that allows you to both organize the content of data repositories and add metadata to parts of it. While the pdfs were being segmented, output by Olive in Israel, and the resulting data then being uploaded onto the server at King’s, we went over the content in order to resolve any outstanding values that needed to be finalized. These lists of values were then loaded into the Olive Administrator Application, allowing our editorial assistants to insert metadata into the xml through a relatively easy to use interface and reducing the amount of metadata entered without a controlled vocabulary. We conceived of the metadata task as two sweeps: one for bibliographic metadata and one for image metadata. We began adding metadata at the end of November 2007 and this took a team of six part time editorial assistants until March 2008 to complete. For more information about the generation of the vocabulary for the addition of image metadata, click here.

The fields completed at this stage were ‘Uniform periodical title’, ‘Actual periodical title’, ‘Date’, ‘Source’, ‘Volume number’, ‘Issue number’, ‘Edition’, ‘Number of editions’, ‘Image description’, ‘Price’, ‘Bibliographic location’, ‘Size.’ As you can see from the schema above and from Viewpoint, this is the bulk of the metadata in ncse, and all the metadata entered by hand. These fields attach important bibliographic metadata to all items within the edition that labels them as to their relationship with each other, allowing complex searches across the edition and the production of bibliographic citations in search results.

 

Creating advanced metadata through named entity extraction, text mining and semantic tagging

The fields that remained were those that started out as advanced metadata and concept mapping. For an account of concept mapping click here. Of the advanced metadata categories - ‘people’, ‘places’, ‘events’, ‘objects’, ‘publications’, and ‘institutions’ – we had selected ‘people’ (as ‘persons’), ‘places’, ‘institutions’, ‘genre’, ‘events’, and ‘subject’ to pursue. We used GATE to extract a list of proper nouns, which we sorted using a combination of sources including the indices from John North’s Waterloo Directory of English Periodicals and Newspapers, 1800-1900 . On the basis of this work we were confident on producing indexes of persons, places and institutions. We attempted to use text mining techniques to see if we could find a way of marking up ‘events’ and ‘genre’, but were unable to obtain results that could be applied across the edition. For ‘subjects’ we used UCREL’s (University Centre for Corpus Research on Language) USAS (UCREL Semantic Analysis System) tagger to provide semantic tags for individual articles, which we then refined to present to users. For a fuller account of this research click here.