The keyword index allows users to search for keywords that have been assigned to items within ncse using a combination of controlled and free-text entry fields. The index was compiled by first algorithmically and manually applying keywords to articles in the ncse corpus. The outputs of these processes were then aggregated, and indexed using the high-performance text search engine Lucene. Persons, places, institutions, and subjects were generated by processing the uncorrected OCR output housed in the facsimile repository using a variety of data mining and natural language processing techniques. Image descriptors were manually applied to articles within the corpus using a controlled vocabulary. The index represents approximately 317,402 items within the facsimile repository, including titles, articles, advertisements, front matter, and images. To learn more about the processes undertaken in generating the keyword index, read the Technical Introduction.
This section describes how to enter search terms and refine search queries, and how to interpret and refine search results. It does not describe how to search the full-text appearing within fascimiles. For this information, please read: Searching Facsimiles, below.
Free-text Entry Search Fields
To search for Persons, Places or Institutions, select the keyword type from the drop down menu of the corresponding field, and then key in text. Multi-word terms may be entered in full (e.g. John Smith), or in a partial form (e.g. Smith).
Multiple Terms in One Search Box
Multiple terms typed into one search box are automatically interpreted as a Boolean AND, and are treated as separate queries. If conducting a Boolean search using different operators (e.g. OR, NOT), enter terms into separate search boxes. Boolean terms entered directly into the search fields will result in zero results being returned.
Search terms can be entered in upper or lower case.
Approximate Match Searching
By default, queries entered using the free-text fields are compared to the keyword index based on an exact match to the terms entered. However, the user can select a percentage to request that search terms be matched that are of higher or lower degrees of similarity to the terms indexed in the database. Entering a higher percentage (e.g. 90%) will return matches that are very similar to the entered text, while entering a lower percentage (e.g. 50%) will return results that are much less similar to the entered text.
Tip: Adjust this threshold to account for differences in the terms that result from the OCR extraction process on which this index is based.
As an alternative to approximate matching, use an asterisk (*) or question mark (?) to broaden search results. You can use wildcard searches in the middle of a term , however wildcard symbol cannot be the first character of a search term.
Use an asterisk (*) to search for different words beginning with the same stem.
- e.g. place: waterl* will match waterloo and waterlow (a possible result based on OCR output).
The question mark (?) performs a single character wildcard search, and will return variants of the term entered with the single character replaced.
- e.g. person: darw?n will match darwin and darwjn (a possible result based on OCR output).
Tip: Experiment with wildcards to account for differences in the terms that result from the OCR extraction process on which this index is based.
Thesaurus Search Fields
Subject and Image descriptors are searched by selecting terms from a controlled vocabulary, or "thesaurus". To select a term, choose a vocabulary and then click the vocabulary icon next to the input field. This will launch a pop-up window that displays the entire hierarchical controlled vocabulary. Navigate through the hierarchy by expanding and contracting classes. To select a term, click the checkbox appearing next to it, and then press the 'Insert' button that appears in the bottom of the window. Your term will appear in the search form. You may select more than one term from different points in the hierarchy; these will be joined during the search with a Boolean AND. To close the vocabulary, click the close button at the bottom of the window.
Tip: Searching within the image vocabulary will only return items from the corpus that contain visual material.
Logical Operators: AND, OR, NOT
Separate search terms into individual fields to conduct searches using additional logical or Boolean operators (AND, OR, or NOT).
Tip: Multiple terms entered into a single field are automatically joined with a Boolean AND.
To generate fewer search results, add additional terms to the query. For instance the query bacon AND spedding returns more results than bacon AND spedding AND ellis.
- e.g. person: bacon spedding (entered in one person field)
- e.g. person: bacon AND person: spedding AND person: ellis
To broaden search results, use the OR operator. For instance to find all articles with occurrences of oxford university or cambridge university construct the following query.
- institution: oxford university AND institution: cambridge university
A NOT search can be used to filter items from results. For example to distinguish between the person Darwin, and the place Darwin, construct the following query.
- e.g. person: darwin NOT place: darwin
Search queries can be constrained to examine only a subset of publication years; for example to find articles referencing The Great Exhibition of 1851 that were published in the months leading up to the event, try the following query.
- e.g. institution: glass house AND year from: 1850, year to: 1852
Queries can be constrained to examine only a subset of the six core titles in the corpus by selecting the filter criteria (AND, OR, or NOT) and selecting one or more titles from the select box. Front matter materials, supplements and end matter material will be searched according to the core title to which they belong. When more than one title is selected, these are treated within the query as a Boolean OR.
Tip: By default, searches operate across all publication titles.
- e.g. AND Leader (1850-1860) - only returns items appearing in the Leader.
- e.g. NOT Leader (1850-1860) - returns items appearing in titles other than the Leader.
- e.g. AND ( Leader (1850-1860) OR Northern Star (1838-1852)) - only returns items appearing in the Leader or the Northern Star.
Results may be sorted according to publication date, by publication title, or by relevance. The default sorting criterion is by relevance, a scoring mechanism used within the underlying search engine, Lucene. Within Lucene, relevance is determined by how many times a query term appears in a document relative to the number of times the term appears in all documents in the index. When sorting by date, results are ordered by year, month, and then day. Sorting by publication results in items that are ordered according to the actual publication title for the issue in which they occur, in ascending order.
Tip: With large resultsets, try sorting by publication to page through results alphabetically.
Search results are returned for each item in the keyword index matching the supplied search criteria. Each result includes a bibliographic citation for the item, links to the facsimile repository, and a panel containing keywords that have been assigned to the item. View the keywords by clicking the '+' sign next to each item.
Within the panel that displays, keywords of each type (persons, places, institutions, subjects and images) are alphabetically arranged, and include the frequency with which they appear in the item. Each keyword is also a link.
Click a keyword appearing with items to to add it search query as a Boolean AND. This will reduce the number of results returned, as the refined result set will only include items that met the original query terms AND the keyword term. Multiple filters may be added in this way, and may be removed in any order, by clicking the 'x' symbol next to the term as displayed in the search results summary.
To view items from search results in the facsimile repository, use the links provided with the bibliographic citations provided for each result. The 'item' link will display the individual item, while the page and issue links display the item within a larger context.
Browsing indexes have been created out the hierarchical schemes used to assign controlled subject and image descriptors to items in the facsimile repository. Subjects have been automatically assigned using the UCREL Semantic Analysis System (USAS)tagset; Image descriptors have been manually assigned using an adapted form of the Database of Mid-Victorian Illustration (DMVI) To learn more about these controlled vocabularies and the process used to assign keywords, please read the Technical Introduction.
Begin browsing by selecting a top-level term from the chosen vocabulary. When a term is selected, the items in the corpus assigned the term will be displayed. Items are ordered by relevance, a scoring mechanism used within the underlying search engine, Lucene. Within Lucene, relevance is determined by how many times a query term appears in a document relative to the number of times the term appears in all documents in the index.
To the right of the intermediate results, the hierarchy will display. Use the tree display to navigate and combine terms from different sections of the hierarchy in order to narrow the search results returned. Terms added to queries may be removed by clicking the 'x' symbol next to the term as displayed in the search results summary.
For example, to find images of children at the sea , browse by image as follows:
- Settings NARROWED TO Natural environments > The sea COMBINED WITH People > Figures > Children
To search for items about war activities at sea , try the following query:
- Government & the Public Domain NARROWED TO Warfare, defence and the army; weapons COMBINED WITH Movement, Location, Travel & Transport > Sailing, swimming, etc.
Tip: The keyword index includes over a million terms from the subject hierarchy. Combine terms to return a manageable resultset.
The Facsimile component of ncse is delivered using Olive Software's Viewpoint software. This comes at the end of a sequence of processing carried out by Olive which involves a number of stages:
- digitization of each page from micro-film
- optical character recognition (OCR) to produce a text version of each page
- segmentation of each issue and page into 'articles', which may often be quite small elements, e.g. an obituary notice
- integration of the page image and text materials and the segmentation information into a repository that can be made available to readers via the Viewpoint software
The Viewpoint software provides a number of different types of browse and search functions to aid readers in reading and exploring the digitized materials.
Searching in the Facsimile component
The Olive software provides very sophisticated search facilities, which are described, with examples, below. There are a number of important factors to bear in mind in relation to the results that are shown:
- The searches are carried out on the text produced in the OCR process. The success rate of the OCR process is often quite remarkable, but it is constrained by the quality of the source materials. In some cases the source materials cannot even be read by the human eye, and in a number of other cases there is smudging or other distortion of the print that makes reliable OCR impossible even where it is decipherable to the human eye. It is important to remember in carrying out searches that the results can only be as good as the OCR. For any given search, therefore, the results will show all the instances where the search term has been correctly converted to text, but there may of course be other cases where the term was indeed in the original print but cannot be retrieved because the software has no way of identifying it.
- Results are displayed in the same format, and the result list has the
same features, for all the different types of search.
- Up to 1,000 'hits' will be returned, the results shown by default in the form of snippets, with up to five results per screen, and with the search term highlighted in each display.
- The hits are shown in order of 'relevancy', based on an algorithm in the Olive system.
- The search results are based on 'segments', with one result included for each segment in which the search term is found.
- Full bibliographic information is shown for each item, and where the display is in the form of snippets, it is possible to view the OCR text of the snippet (i.e. the text produced in the OCR process).
- Click on the snippet itself to see the full text of the item, which opens in a new window.
- It is also possible to view the full page in which the item appears, or the cover of the issue in which it appears, by clicking on the relevant thumbnail.
- The full ncse resource includes multiple editions where these exist. This may have a distorting effect on the number of hits, with segments apparently repeated when in fact it is the same item appearing in a number of editions.
- In general it would be true to say that the simple browse and search functions require little prior knowledge or understanding of the ncse publications and their history, but effective use of the more sophisticated search techniques tends to require a greater degree of familiarity and understanding.
The reader may enter any term in the Search box, and may either search in all types of material (the default), or may restrict the search to Articles, Pictures or Advertisements.
Example: If you type 'wellington' in the search box, 1,000 hits are returned, the exact number 1,000 suggesting that there are more than 1,000 occurrences of Wellington in the repository. Keeping the same term, and restricting the search to Articles has little effect on the number of hits, not surprisingly. Restricting the same search to Advertisements, however, reduces the number of hits to 807.
Changing the search term to 'walmer' reduces the number of hits to 100, many from the coverage of the death of the Duke of Wellington at Walmer Castle. Changing the search term again, to 'Walmer Castle' reduces the hits to 44: the software looks for instances of 'Walmer' and 'Castle' in reasonably close proximity. It is also possible to search for this as an exact term by using "Walmer Castle" - i.e. enclosed in double quotes - and this reduces the number of hits further, to 28.
A search for 'darwin' produces 539 hits. 'charles darwin' reduces the hits to 337. With the exact-term form "charles darwin" there are only 47. Searching on 'erasmus darwin' produces 29 hits, but since this is not the exact-term form a number of the hits refer in fact to Erasmus's grandson Charles. The exact-term form "Erasmus Darwin" produces 6 hits.
For more on 'exact term' searches, and other features of the search function, see Extending your use of the Search Function below.
The advanced search screen provides a number of ways to filter the search results in order to help readers find what they are looking for more precisely. Note that in order to refine an advanced search you have already executed, you need to click again on the Advanced Search link in the results screen. The filter mechanisms include:
- restricting the search to a single title and type of material
- providing a 'metadata' profile, in which a combination of metadata elements can be specified: Issue title; Volume/Series; Issue number; Edition; Page; Price; Bibliographic Location; Size. Where feasible, the metadata terms can be selected from a drop-down list, so the reader is guided on what are the valid search terms. To use this feature effectively, some understanding is required of the ncse publications and their structures.
- search within a specified date range. E.g. a search for 'dickens' within the date range 1/1/1806-31/12/1830 produces 4 hits, none to do with Charles Dickens. The same search in the range 1/1/1831-31/12/01850 gives 361 and for the range 1/1/1851-31/12/1870 there are 997 hits; most of the hits in the later periods are of course to do with the author. Note that in using the date range filter, it is important to take into account the date ranges of the publications in ncse .
- the Edition buttons allow you to choose whether to search across all the editions, or to limit it to the Main editions, thereby excluding the 'extra' editions, e.g. of the Northern Star, which are listed in the Select Titles box.
- you may also choose to view your results as 'Snippets' (the default), or in the form of full Articles, or as full pages. Snippets is recommended until your filtering has produced a very small set of results.
- finally, you may choose to sort the results list by relevance (the default), title, date or size.
Extending your use of the Search Function
Click on 'Advanced search tips' in the Advanced Search screen for detailed information on the range of search techniques that are available within the Facsimile system, enabling you to take advantage of the sophisticated search facilities supported by the Olive software. Although these additional features are indeed advanced, in every case they extend or modify the search phrase, and will therefore be typed in the main 'search' box, and do not in fact require the reader to be in the advanced search screen. Where the Advanced Search screen will be needed is when these techniques are combined with other advanced search features, such as filtering by date. The examples below may help to make this clearer.
The additional features described are: Wildcards; Logical Operators; Word Modifiers; and Escaping. The last is rather technical and will be very seldom used. The first three, however, may sometimes or often be useful. Rather than repeating what is described in detail in the Olive 'tips' sections, below are some examples to suggest ways in which the advanced features may be used.
The use of the asterisk to represent any or no characters may be familiar from use of other computer systems. In ncse it may help to compensate for OCR problems, but at the same time it is likely to introduce a certain amount of 'noise' into the results. For example, as noted above, a search for 'darwin' produces 539 hits. A search for 'darw*' - meaning the characters 'darw' followed by any number of further characters or by none - gives 913 hits, including not only some additional 'genuine' Darwin references, but items also for 'Darwood' and 'Darwen' and so on.
Also known as 'Boolean Operators', these techniques may be useful in expanding or narrowing a set of search results while increasing the degree of accuracy. The < OR > operator increases the number of hits by including all items which contain any one of the search terms, while the < AND > operator decreases the number by requiring that both/all terms are present simultaneously within each item.
For example, a search for 'austen < OR > eliot' produces over 1,000 hits, including references to Sir John Eliot and Admiral Austen, while the exact-terms search for "jane austen" < OR > "george eliot" gives 209 hits, but loses such references as 'something ... here reminds us of the incomparable Miss Austen...'.
The search 'austen < AND > eliot' gives 54 hits, all from catalogues in the Publishers' Circular. A search for 'wellington < OR > napoleon' produces over 1,000 hits, but 'wellington < AND > napoleon' brings the number down to 684.
The < NOT > operator allows you to exclude terms from your search. More useful perhaps is the < NEAR > operator, which allows you to search for combinations of terms within a range of up to 9 words of each other. The search 'wellington < NEAR/9 > napoleon' reduces the 684 hits found above to 124 items - i.e. there are 124 items in which the names Wellington and Napoleon occur within 9 words of each other.
The search 'palmerston < AND > grey' gives 738 hits, while 'palmerston < NEAR/9 > grey' produces 76. Similarly, 'palmerston < AND > gladstone < AND > disraeli' provides 310 items, while 'palmerston < NEAR/9 > (gladstone < OR > disraeli)' reduces this to 162 hits. The search 'palmerston < NEAR/9 > (gladstone < NOT > disraeli)' produces 79 items - i.e. those in which the names Palmerston and Gladstone are within nine words of each other but Disraeli's name is not within this range.
The most useful of these is likely to be the < FUZZY > operator, which allows the reader to indicate how many 'errors' to allow in matching the search term. It is an alternative and perhaps a bit more sophisticated method than wildcards for making allowance for OCR errors. As indicated above, a simple search for 'darwin' produces 539 hits; changing the search term to '< FUZZY/1 > darwin' tells the system you're prepared to accept one character being different, and the number of hits goes up to 758. Among these are some extra 'genuine' Darwin references, but also 'darlin' and 'darwen' (see the wildcard example above for comparison). Increasing the degree of latitude to '< FUZZY/2 > darwin' gives over 1,000 hits.
It is possible to combine more or less all the advanced search techniques in quite complex ways, and in creating these the Advanced Search screen will be needed.
The most frequent type of compound search is likely to involved adding a date filter to some other technique. For example you could carry out the same search for each of several decades: 'gladstone < NEAR/9 > disraeli' for the period 1/1/1840-31/12/1849 produces 8 hits; the same search for 1/1/1850-31/12/1859 provides 95 hits; and for 1/1/1860- 31/12/1869 there are 13 hits; and so on.
There may be occasions when the reader wishes to search within a particular publication. The search for Gladstone and Disraeli in close proximity in the period 1/1/1850-31/12/1859 reduces the number of hits from 95 to 53 when only the Leader - Town Edition is searched. There is no point carrying out this search in the Northern Star since its coverage is 1837-1852, but in fact the search for Gladstone near Disraeli over the whole of the Northern Star - Main Edition yields only 4 hits.
It takes time to become familiar with the various ways in which the advanced search features in the Facsimile part of ncse can be used. Experimentation is the key, and with time and patience readers will be able to become expert in retrieving the information they wish to find and in developing techniques to explore the edition for new questions and lines of enquiry.
A further level of complexity is introduced by the parallel searching and browsing that is possible within the Keyword part of ncse. The final section of this guide will give some comparative examples of searching for the same kind of information in each of the two components.