ncse attempts to represent its serials in their own terms: in other words, to edit them in digital form in ways that maintain their links with nineteenth-century print culture. In practice, this was not straightforward, as conflicts arose between our desire not to misrepresent the complexities of the source material and the technical demands of another medium: the digital. Below are a number of short accounts of editorial decisions that attempt to reconcile the forms of the source material with the necessity of imposing order upon them for digitization.

 

Editorial Rationale

Serials are not well-ordered objects from which to begin a digitization project. The historical production and preservation of serials means that it is difficult to identify one distinct manifestation of their form. Serials consist of articles, often themselves consisting of different types of print and images, gathered into departments, published in issues, and often bound in volumes. There is a tendency to consider volumes as discrete uniform objects that gather together diverse content. However, as many critics have noted, the process of binding individual issues, plus any front or end matter that might also be issued, involved a considerable transformation. Part of this was the common practice of stripping out matter that pertains to the selling of individual issues, most notoriously advertisements, but also multiple editions, supplements, and wrappers. Equally, just as a volume is something different than a combination of the year’s issues, so issues are more than a series of articles. Items in the periodical and newspaper press are remarkably diverse, including advertisements, filler, title pages and imprints, headings and signatures, news, leaders, correspondence, literature, reviews, essays, science, and tables. Visual material might comprise the cover design, mastheads, illustrations, graphic devices such as ornaments, rules, or space, typography, and layout. Not only are they diverse in terms of their genre, but their significance is also affected by where in the issue they appear and how they are presented on the page. Editors were keenly aware of how different constituents of an issue related to each other, and considered both how an item should look and where it should appear.

Serial texts become problematic when one attempts to identify them as a single object, rather than texts that differ remarkably between titles and that change over time. It is the seriality of these texts that is often overlooked, even though it is this that determines both the form of individual issues and the various ways in which individual issues were archived by institutions such as libraries, as well as readers themselves. Our methodology responded to this by considering serials as processes: as objects with histories that meant that they did not always look like the forms in which they have survived into the present. When deciding on how to edit them – i.e. when considering what aspects of their text was important – we tried to ensure that we maintained those parts of the journal that gesture towards its prior forms. When you browse ncse, you will see that we have found a place for multiple editions, advertisements, and supplements; departments, issues and volumes. Although we often had to intervene in order to impose a degree of order onto the material for its translation into digital form, we edited it with a view to permitting the periodicals and newspapers to tell their own histories as far as possible.

 

Choice of the Cluster

For details about why these six titles were chosen, click here

 

Multiple editions

When we first began to explore the state of the British Library holdings of our six titles, we were pleased to find that the librarians had preserved more than one edition of each issue of two of our weeklies, the Northern Star and the Leader . The presence of these multiples created editorial challenges: to include them would expand the edition considerably; to exclude them would involve a complex process of pruning that would necessarily have to nominate one edition instead of another to represent that week’s issue. In accordance with the responsibility we felt to represent the serials as serials – i.e. as part of a broader print culture – we felt that excluding the multiples would misrepresent the production of the titles and so we decided to keep them in. For more about our decisions regarding multiples, click here.

 

Page numbering

The pages that you browse in ncse are obtained from pdf documents that correspond to individual issues. To make these pdf documents, we had to ‘bind’ together individual page images that were filmed from either microfilm or the hard copy itself. Each stage in this process interposed its own way of recording the order in which pages should appear: the page images (tiffs) had their own numbers that corresponded to the number of frames created from either a reel or bound volume; and the pdf documents labelled each page in an issue, then restarted the sequence for the next. This meant that in order for the page numbers in the edition to correspond to those printed on the pages, they would have to be inputted as data rather than being produced as part of the production process. This represented a huge task as every single one of our c100,000 pages would have to have page number data entered upon it. Although it was common practice for periodicals to number their pages continuously throughout a volume (so that, for instance, vol. 1. no. 1 of the Leader is pp. 1-24 and vol. 1 no. 2 is pp. 25-48 ) this was not the case for newspapers such as the Northern Star, which tended to number each issue separately (i.e. vol. 1. no. 1 is pp. 1-8 and vol. 1. no. 2 is also pp. 1-8). This meant that although we had to be careful, it was possible – more or less – to enter page numbers automatically as a sequence. However, this was complicated by the nineteenth-century practice of paginating front and end matter, supplements, and advertising wrappers separately; and of course the presence of multiple editions meant that there would be two or more of every page for certain titles. Our policy here was to follow the page numbers printed upon the pages themselves. This way we followed the way in which the journals represented their contents to their readers.

This policy presented two interesting problems. The first was that many of the pages simply were not paginated. The digitisation medium required us to enter a value in order to show that this was the case (and not simply a page that we had missed out) and so we decided to use ‘unpag’ to indicate a page that had no page number. In this way we ensured that navigation within the edition still worked (the system knows which ‘unpag’ page is which and where it goes) while demonstrating to our users the way in which the pages were originally paginated. The second problem was what to do when the journals printed a page number that appeared to be incorrect. This was a difficult question as our overall policy meant that we tended to privilege the actual state of the material, but leaving incorrect numbers might be confusing for our users. Practical considerations finally persuaded us to keep the incorrect page numbers in the edition. We were concerned about users who would come to the edition to look up a reference from another source that referred to a page number that we had altered. We also thought it might be confusing to go to a page that was labelled differently on the page to how it was labelled within the edition. The most important consideration, however, was that if we were to correct what appeared to be an incorrect pagination sequence, we would have to trust that the journal itself eventually reverted to what we thought was the correct sequence. As this was not always the case, with incorrect page sequences sometimes continuing until the end of a volume, we felt we had no choice than to follow the page numbers printed on the journal.

 

Structure of the Edition

In the early stages of the project we attempted to map all the types of data that we felt were important and how they related to each other. Needless to say, this resulted in an incredibly complex map, which you can download here. While we undertook this exercise, we were also working with Olive Software to investigate the degree of accuracy with which a computer program could identify units of content and their relationship according to a set of predefined rules. Although the system had some success distinguishing between items, we found that the typographic information in the original that dictated what an item was (and so how it related to other items) was not distinct enough and/or not consistent enough to be reliably detected by automatic processing. Considering both of these exercises, we decided to adopt a simple structure for ncse:

Edition > title > volume > issue > department > item

Not only was this an easy structure to model, but it was flexible enough to contain all the different types of content within ncse. Rather than ask the system to distinguish between types of content, we simply wrote rules that asked it to distinguish between items on a page. We included some rules to identify how to recognize an item that signalled a change in department, but these were largely for the guidance of human editors who checked the output. These sets of rules were called segmentation policies. For a definition of our terminology, click here.

 

Departments / Items

The organization of articles into sections that we call departments was common across nineteenth-century newspapers and magazines. As an important structural demarcation, we felt we had to encode this into the edition and, according to our editorial principles, it had to be done according to each title’s own categories. Olive Software’s applications supported a ‘Table of Components’ that would list each article in a journal and could also group them into sections. However, in order to do this, we would have to segment a page in a way that would identify whole articles. This was certainly possible but a page of a periodical or newspaper is not constituted solely by the articles upon it: rather, it contains a range of items including headlines, mastheads, lists of contents, advertisements, rules etc. Isolating articles from this miscellaneous collection of content would privilege one type of material above another. Instead, we nominated items as our basic unit within the edition, and defined articles as one type of item amongst others. A department then would be a collection of items, most of which would be articles, but might also be headlines, advertisements, or other types of content.

We investigated the potential for software to identify different items on a page with a view to automating as much of the classification and subsequent structuring into a hierarchy as possible. However, although some of these experiments were encouraging, they demanded a great deal of time from all involved. Not only was programming made difficult by the way the titles changed over time, but we had to carefully scope each issue before running the programmes and then review the results. Once we made the decision to keep multiple editions within the edition, we were committed to processing c100,000 pages (although to see a report on our proposed alternative – the core – click here). We had the greatest success setting parameters that would distinguish between every item on the page. This included all the articles, whatever their content, as well as things like advertisements, mastheads, filler, publisher’s imprints, etc… We recognized that some users would not find a ‘Table of Components’ that contained all this diverse material very useful and, as some issues contained over two hundred items, such a table would be too long to navigate easily. In addition, early pilots of the Olive application populated the Table of Components from the OCR transcript that was produced during processing. Because of the quality of the surviving newsprint, we knew our OCR transcript would be too inaccurate to display in an uncorrected form and, as we were dealing with 100,000 pages, we would not have time to undertake its correction. Instead, we decided to restrict the Table of Components to department level, and populate it with images of department headers rather than text strings from the OCR transcript. This also suited one of our primary aims in the edition, to foreground the structure of serials and not only what is normally defined as their ‘contents’.

The distinction between items and departments in ncse is encoded into the way each issue is structured. When we corrected the segmentation of each issue, we identified which items – usually in this case a headline – signalled the beginning of a department. These then appeared in the Table of Components, and all the items that followed it were associated with that department. This structure was vital in order to allow us to populate the edition with metadata.

 

Segmentation Policies

Segmentation policies were documents designed to guide a non-expert to decide on how to distinguish between items and then to decide on which items constituted a division between one department and the next. For each publication we wrote an overall description that laid out rules, and then prepared a selection of marked-up pdf documents to show how rules should be applied. Examples of our segmentation policies can be downloaded here:

Monthly Repository:

Northern Star:

Leader:

English Womans' Journal:

Tomahawk:

Publishers' Circular:

 

Volumes

Although we were committed to recording the histories of our titles up to the point at which we digitized them, we were also conscious that we had to make editorial interventions in order to present the material in a way that would make sense to users. Our proposed structure of the edition set out how each element in the hierarchy related to all the others, but the actual condition of the hard copy did not always fall neatly into the categories that we set out. This was particularly the case with the internal arrangement of content within volumes.

Even though nearly all the hard copy from which we sourced ncse was in volume form, these bound volumes were not consistent in terms of their contents or appearance. Often the volumes had been bound in different ways according to individual binders – both within the same institution and when compared between insitutions – and binding practices altered with the content over time. Although we could leave the content in the condition in which we found it, thus preserving the history of the particular bound volume’s production and preservation, our decision to merge runs from different sources meant we were already committed to radically altering the ‘found’ state of large parts of the edition. For instance, our run of Tomahawk consists of two separate runs: one stored by the British Library at St Pancras and one by the British Library at Colindale. While the run at St Pancras is more complete in terms of issues than the one at Colindale, it lacks all of the advertising wrappers; the run at Colindale is not as complete as that at St Pancras, but it has the advertising wrappers and a few issues missing from the St Pancras run. By merging the two runs we were committed to reconstructing the form of the volume, and so had to come up with a model of what a volume might be. We adopted the following basic model:

  • Front matter
  • Issues
  • End matter

Front matter contains things like prefaces, volume title pages, volume tables of contents, or any illustrative material that was intended to be bound at the beginning of a volume. Issues contain the run of issues defined by the number that each issue carries on its front page. End matter contains the indices that it was common practice for publishers to provide for the end of a volume.

As you may notice, the way we have implemented this model varies from title to title. It was common for editors to issue, sometimes at extra cost, the textual apparatus (prefaces, indices, volume title pages) for collecting single issues into bound volumes. As these components often (but not always) survive within the bound volumes, we could be guided in designing our model of a volume by the way in which they were conceived by the titles themselves. Where possible, this is what we’ve done: for instance, we’ve placed the supplements to the Monthly Repository before the end matter as they are indexed within the index that it contains. The Northern Star, although labelling its issues with issue numbers and volume numbers, did not issue indices or volume title pages. As a result, neither front matter nor end matter appears in the folder tree for volumes of the Northern Star.

The decision to reorder content into consistent volumes particularly concerned two titles: Publishers’ Circular and Tomahawk. The editors of the Publishers’ Circular tended to issue a volume title page in the advertising section of one of the issues published in January for the volume that ended the previous December. In the run of Publishers’ Circular at the British Library, some of these volume title pages had been removed from the advertising sections and bound at the beginning of the previous volume, but some had not. In addition, the positions to which title pages were moved were not always consistent: the volume title page for 1882, for instance, was bound as the third page of the English Catalogue of Books, rather than placed at the front of the volume as a whole. By moving the volume title page so that it always appears as front matter at the start of a volume we might elide the histories of the specific bound volumes with which we worked, but we have provided a consistent resource that is based upon the printed intentions of its producers. The same is true of Tomahawk. Tomahawk published two volumes each year, one in January and one in July. However, in December they also published an almanack that looked forward to the coming year. Although these were separate publications (they were a penny more expensive than regular issues), the run in the British Library has them bound in with the rest of Tomahawk. However, sometimes they have been bound in variously, at the beginning of the January volume and at the beginning of the July volume. Again, our decision to move the almanacks so that they appear at the beginning of the January volume of the year to which they apply masks the history of the particular hard copy with which we are dealing, but does make this content appear in a consistent position, easily located within the context of the rest of the run.

 

Division of content into issues

The decision to work with an abstract model of the constituents of a volume meant that it was necessary to separate front and end matter from the issues themselves. Olive Software work with pdf documents that correspond to individual documents within their system. Each issue is a single pdf document, and it is the pdfs that are segmented in order to delineate and structure their contents. Just as every issue is therefore a single pdf document, so too are pieces of front matter, indices, or supplements. Our policy with this material was to use a separate pdf document for every component that was issued separately. As such, we have made each individual supplement a separate issue, but have usually kept all front matter (i.e. titlepages and prefaces) as a single issue.

 

Wrappers

Although wrappers are notoriously hard to find in bound volumes of periodicals, we have quite a few wrappers for two of our six titles: Publishers’ Circular and Tomahawk. The Leader also reorganises its advertising section as a wrapper in 1858, allowing it to be stripped off prior to binding should the reader choose to do so. For further information about wrappers, click here.

We were committed to including wrappers in ncse but had to devise a way in which content in wrappers could be distinguished from that from within issues. As all content in ncse is taken from pdf documents that correspond to whole issues, there was no distinction between wrappers and issues at this level. We structured content within issues with segmentation that distinguished between items and then grouped items into departments. By inserting a metadata field called ‘Bibliographic location’ with two values, ‘number contents’ and ‘wrapper’, it was possible to mark up items with their location on the wrapper or in the issue. This still left us with a potential problem as this was a metadata field that needed to be applied to all the items in the edition. However, using a combination of segmentation and metadata, we devised a way to avoid having to enter this information manually to all the content. First of all we marked all content as ‘number contents’ and then we segmented wrappers so that we could easily change this value later. The way we did this was to segment wrappers so that the first item on a wrapper marked a new department. By ensuring that the first item in an issue marked the next department, it was possible to label the first item on a wrapper as ‘wrapper’ and have that information inherited to all the other items on the wrapper up to the department that marked the beginning of the number. For more information on metadata, click here.

 

Advertisements

Advertisements are a crucial part of serial texts, providing valuable income streams that might complement money raised by direct sale and subscription. Not only this, but they have considerable historical value as often they are the one means of gauging the readership of a particular title, or establishing how much things actually cost. Advertisements were also appealing to nineteenth-century readers: they often provided the most visually striking pages in an issue and it might also be argued that they provided an important news function. As such, we ensured that advertisements were distinguished from other items. Due to the size of the edition, we could not mark up all advertisements by hand. Instead, we used Olive’s segmentation process to make a first attempt at identifying advertisements, and then corrected this by hand during segmentation.

Although this strategy succeeded in demarcating all the advertisements, making them searchable through the Olive interface, it restricted their functionality. As advertisements often contain images we were keen that they interacted with our image metadata schema. However, Olive’s way of marking up content meant that items had to be either ‘Image’, ‘Article’ or ‘Advertisement’. Our solution was to retain the Olive distinctions between content at item level, but use our own metadata to permit one type of content to appear in a search for another. This means that although an advertisement is not recognized as an image in the Olive system, it will still be returned through an image metadata search in the 'keywords' section of the resource.

 

Images

Images, like advertisements, are demarcated as one of three types of content (articles is the third) by Olive during segmentation. However, images might occur within articles and, if this is the case, this relationship should be preserved. Olive’s segmentation tool permits images to be embedded within articles: what this means for ncse is that it is possible for an image to be embedded into an item, allowing users to either view the item as a whole (with the image within it) in the component viewer, or to simply view the image.

For information on how we labelled images with image metadata, click here to view the schema, and here for how we applied it.

 

Front and end matter

See Division of content into issues above.