Présentation scientifique

Aspects techniques

Partenariats
PRÉSENTATION SCIENTIFIQUE                                                                         

Head of the “Bibliothèques Virtuelles Humanistes”, from September 1rst, 2016:

  • Chiara LASTRAIOLI (Pr, CESR)
Head of BVH, 2002-2016:
  • Marie-Luce Demonet (Pr Emeritus, CESR, IUF)

The « Bibliothèques Virtuelles Humanistes » (Virtual Humanistic Libraries in Tours) : A Collection, or a Corpus?

Conference read at the Digital Humanities Meeting, University of Maryland, June 23, 2009, and updated)

The Bibliothèques Virtuelles Humanistes (BVH, or Virtual Humanist Libraries) is a project run since 2002 by a research team in the Centre d’Etudes Supérieures de la Renaissance (Center for Renaissance Studies) at the University of Tours, France. This Center is a laboratory belonging to the National Center for Scientific Research, Department of Social Sciences and Humanities. The BVH team is a group of four scholars and one PHD student in literature, history, book history, classics, and five research assistants, and we enjoy the help of the computing department (Tours University).

1. Project features

The goal of this team is to develop a digitization project begun in 2003 (http://www.bvh.univ-tours.fr/), in order to offer two types of digital representations: facsimile and reliable transcription.

Even if there is no limit to such a venture, the milestone for 2012-14 is a set of 2,000 books in facsimile, about 200 texts, 12,000 notarial acts and a few manuscripts and archives. They are selected mainly out of regional collections, to be displayed in a single website with different levels of queries. In this virtual library (now -July 25, 2011- of 700 volumes, 500 of which are already on-line), original collections may be somewhat dismembered, because we pick out the documents that are of major interest for us: in the first campaigns, 300 have been digitized in Tours and Poitiers, 200 in Orléans and Vendôme, 150 will be digitized in Blois, and the same number in Bourges as well. The libraries and institutions holding the collections have signed an agreement in which they accept to have the images of the books digitized and published online, without any royalties, but with the certainty that the origin of the document will be registered in the metadata and in the pdf, and that the quality would be satisfying. The ID of each digitized item includes the national number of the institution and the call number of the original document.

In light of these two types of processing (as image and as text), and taking into account the reading habits and requests that recent navigational tools have encouraged, the BVH is devoted to conducting research on the two-pronged front of indexing text in image mode, extracting images from scanned pages, classifying and indexing them, and acquiring significant corpora of transcribed texts.

The BVH project has been built depending on the requests of large communities, composed of disciplines with a variety of requirements: historians, art historians, specialists of literature, philosophy, languages, and historians of sciences. Their desires can be dispatched in four directions:

  1. Archive (document content-oriented)
  2. Book history (document form-oriented)
  3. Linguistics (language-oriented)
  4. Style (aesthetics-oriented)

The digital libraries as they were first comprised at the CESR would not be a “corpus,” but rather a “collection” as their only commonality is their period of publication, from 1470 to 1650, identified as the “Renaissance” in the largest sense, including Antique or Medieval texts edited during that period. We try to satisfy everyone, with a kind of eclecticism that could lead to mass digitization. We do not aspire to “shelf” digitization, because we could have it done by Google, or by the Ministry of Culture, or by the BnF (French National Library), our partner for Renaissance collections.
Even if the database includes broad categories (classics of the Renaissance – sources of science – legal and political history – philosophy and theology), a priori each book on any shelf could fall under one of these categories, all the more as a fifth category, “special projects,” allows additions to one or several subcategories (like the “Rabelais” database, the “Montaigne à l'œuvre” project, the dictionaries, etc.), the rest being composed of stand-alone books or curiosities.

The first release of the website (2003-2006) was mainly a facsimile and image database, searchable through the catalogue. The second one (2007-2009) was an aggregate of several sub-sites, including texts, manuscripts, and acts, without any unique search engine. The fifth version of this heterogeneous website, is now released (July, 2011). It will become a virtual collection of documents created during the Renaissance, and located exclusively on the website: images of a book and images extracted from the facsimile on the one hand, and transcriptions on the other hand.

The transcriptions are displayed according to two kinds of restitution: diplomatic, or without additions, and a second version that is regularized to conform to what we call “cultural heritage.” The latter inserts corrections or variations that are essential for understanding the text, but they are TEI encoded, so that internet users may navigate between the diplomatic version and the regularized one at all times. This high quality requirement originates from the community of linguists and book historians; art historians and philosophers, for example, may not pay such attention to exact spelling, and they could accept transcriptions that are totally modernized.

We could say that libraries deal with collections, and research programs with corpora, processed by specific software. In this third version, the BVH project includes the criteria of actual corpora.

Selection is the first step towards creating a corpus. The fact that there is a reliable selection process undermines the possibility of a rough bunch of facsimiles. Each work chosen corresponds to an analysis of its form and its content. The researchers who are in charge1 examine its quality from the point of view of the history of the book and of the directions of research that reflect the options of researchers at the Center or their colleagues. They hope to render the object upon which they are working accessible to the scientific community, in order to share knowledge. This method goes against the traditional editorial process that consists of offering a “definitive” paper edition once the apparatus has selected the best lectiones. Selection is therefore a tool of collaboration for the researcher, who enriches the available collections in the process. For example, the library of the Museum of Sologne, in Romorantin, holds a copy of the first (1580) edition of Montaigne’s Essais: although it is not extraordinarily rare, this state of the text was worth being offered to the public in order to compare it with that of the BnF’s copy (Gallica), and it will be easier to set the facsimile in front of the transcription, page by page.


2. Renaissance linguistic corpora and TEI


Although this lengthy selection process is a challenge in itself, the goal of a public library of facsimiles differs from that of a research library: for old prints, the query is generally done only on metadata and, at best, on the table of contents which constitutes a minimal indexation (see Gallica 2). Sometimes it is done through a quick round of OCR (mrc format by Adobe), but the way the text is handled remains superficial, lacunar, and faulty. Literature and language specialists use linguistic systems (Hyperbase, TXM, Philologic…) that must run on elaborate data, and the software is very important to decide the nature and the encoding of the cluster. Currently, technical improvement has allowed these different approaches to come together, although they are not easy to combine. In 2009, the BVH team has been asked by Europeana to join, as a content provider, the consortium within an “eContentplus” project, that aims precisely at refining the way of searching multimedia collections: this might increase the availability of new software.

Linguistic corpora and text databases for works before 1800 are often constituted of modern editions, which are under copyright and impossible to show next to their facsimiles – which often do not exist, as they were established from several different reference editions and do not respect the physical presentation. These editions have the obvious advantage of easily lending themselves to searches for data and to detailed encoding. But our principles are to deal only with original copies, and to apply only minimal regularizations. The result of the transcription is a new work, identified with an ID that includes the call number of the original copy, with a specific extension depending on the format (TEI or other).

The obvious ambition of creating literary/linguistic corpora within the BVH revived an older project: actually, the “Epistemon” website was created in Poitiers in 1998, and it offered a dozen sixteenth century texts patiently keyworded by ourselves, and published in html format. In Tours, I had the opportunity of highlighting the rare book departments of the regional libraries, and a few years later the integration of the Epistemon was made possible. However, the idea of reconsidering the constitution of a linguistic corpus does not dismiss other author-oriented or thematic corpora.
We have also to consider the status of new virtual objects such as the hybrid “text-and-image” screen with matching items, and the virtual sub-set of manuscripts and of extracted data. Corpora multiply as time goes by.

To define the term “corpus” in the way we use it, I shall refer to the explanations given in the TEI guidelines, that are quite helpful in this domain, and to the historical evolution of the term, specifically in the Early Modern period: inspired by the first “corpus” bearing this name in Antiquity, the Corpus Juris Civilis, the Renaissance humanist adopted this term to denote either the complete works of an author such as Homer, or the thematic or formal unity of a collection. Our Rabelais database could be an example of such an author-oriented corpus : its name, FORSE, “Fonds Rabelais et ses Sources En ligne” (a “Rabelais Collection and its online sources”), means the gathering not only of the editions themselves, but also the main sources and related texts or documents. This project is still waiting for sufficient funding…

The formal unit is characteristic of the text database “Epistemon.” In the Guidelines, “corpora” refers to a set of TEI tags used to encode linguistic data, in a rather wide definition, and the term “collection” is the genre of this species. For us, it is important to define both levels (generic for the collection, specific for the corpus) because it determines the choices of metadata, of tags, of search engines to process one cluster or another. As we are a research team, and not only a research library, we have specific research targets, while we wish to immediately disseminate the results of scholarship, so that as wide a public as possible is able to retrieve the data and to search within it using specific tools.

The components of our linguistic corpus (now about 31 encoded texts, out of a hundred) have been selected according to some predetermined criteria: for example, middle French or early modern French (1450-1650), but other languages can be processed because they are also tagged (mainly Latin, regional patois or dialects). The choice of this “core” is obvious to my eyes: I studied especially the French language of this period (and I teach it), and also the ideas about language, and the birth of a philosophical terminology in French. The ARTFL project and Frantext are by now not very rich in documents of this period in their original spelling. Collaboration with ARTFL and the University of Chicago is currently under agreement, mainly through the use of PhiloLogic™.

Latin, Italian, English, and Spanish texts are not excluded a priori. A multilingual corpus is all the more conceivable as many Renaissance texts contain large fragments of text in Latin (the Essays, for example). Even if we initially chose texts in French, originals or translations, leaving to researchers from other linguistic regions the care of developing their own corpora, the fluidity of TEI encoding allows the possibility of making a “Renaissance” corpus that would not be limited to French and would contain even bilingual or multilingual “aligned” corpora (a text and its translation, for example). Thanks to the “TEI Renaissance and modern times” application (Tours, July 2008), appropriate encoding of different versions of a text makes a model available that renders the physical description of a text compatible with its logical structure2. These procedures are taught in a specialized Master’s program and during workshops open to students, researchers and librarians. The development of a larger French TEI program is now developed by a French network and we collaborate with the Ecole Nationale des Chartes, the Ecole Normale Supérieure in Lyons, the French Language Agency in Nancy, the French National Archives, the Machiavelli project in Lyons, the Institute for Humanities in Caen, and others.

The general structure of the BVH is designed to allow specific queries at three levels, and only the last two apply to what I call corpora:

  • 1. General website query. The BVH project is a website, and not a portal and no external resources are pointed at. It is recently operated by a search engine such as XTF, able to search within metadata, pdf files, and TEI-encoded texts. XTF allows the simultaneous display of the image in front of the text itself (not in a pop-up). We must adapt the original division (<div>) of TEI encoding, to properly align image and text, and to constitute a sub-site of these hybrids. Results of queries on string characters might have various quality levels (from raw OCR to TEI-encoded files).
  • 2. Corpus query with basic filters and options: we have adapted PhiloLogic to the whole text database, encoded with the same DTD and Renaissance TEI application —thanks to Mark Olsen and Tim Allen’s precious help—, to the “Epistemon” corpus: no lemmatization is offered, but concordances and KWIC, similarity searching, statistics, searching by bibliographic metadata, genre, sub-genre, division titles are supported.
  • 3. A “tailored” corpus query and annotation (off-line, for the present), with a specific software, named “Analog”: it is a lemmatization tool developed by a linguist, Marie-Hélène Lay, at the University of Poitiers, and a computing engineer. The corpus has been thought, structured, customized, described and annotated to overcome the major problems for linguists studying Early Modern texts: heterography, word segmentation, neology, and morphological variation. Using modern editions biases all statistics, displaying “arranged” spelling and punctuation while at the same time denying a comprehensive development of linguistic historical scholarship. The monitor corpus is presently the Rabelais database (much smaller than the “Rabelais corpus” itself), containing the major editions of the novels, with genuine spellings and punctuation marks3. The extension of this corpus is not definitive: any other text could be added, in order to compare the data already processed to new data. We did not choose to use the “corpora” specifications of the TEI, because our goals are not only linguistic, and because we would prefer to organize several types of combinations of texts. These documents are quite often composite themselves.

    We are also testing The current TXM platform prototype, which helps to build and analyze tagged and structured corpora (HTML edition for each textual unit of a corpus, import environment, search for complex lexical patterns, concordances, factorial analysis, etc.)

The improvements of a “Renaissance” OCR (named “RETRO”) also developed in Tours by the laboratory of Computer Sciences4, allow text acquisition from difficult-to-read printed matter and easier corpus extension providing “diplomatic” spellings, with abbreviations and old usage of ij and uv. The use of form dictionaries (thesauri), compiled from transcriptions and properly sorted, increases the possibilities of semi-automatic corrections of a homogeneous corpus. Even if acquisition in text mode with an accuracy rate of over 97% still represents a considerable cost for these early printed books (post-correction is always necessary), it allows incrementation of corpora that offer highly varying written forms.
Each thesaurus corresponds to a mode of transcription:

  1. 1) diplomatic, which is the output of the OCR processing or of the non-Western operators (keyboarding is often subcontracted with off-shore companies)
  2. 2) “cultural heritage” transcription, which is the transformation of the previous one, or the direct result of scholarly keyboarding.

Lately, a “dissimilation” tool allows the automatic transformation from 1) to 2). It creates at the same time new thesauri. These acquisitions in turn facilitate the treatment of new texts and allow for linguistic analyses about the uses found with double-checking the text: in the context of course, and in the facsimile facing the text5.


3. Manuscripts and archives


This could be enough for the team, but new search projects insist upon integrating manuscripts. Originally conceived as a virtual library of books, the BVH database is currently enriched with some manuscripts and with several collections of archives. We already use the “msDescription” module of the TEI to encode all kind of sources. An example of literary manuscript is that of the Cinquiesme Livre de Pantagruel (1564), a precious version of this doubtful work attributed to Rabelais, and belonging to the Rothschild collection of the National Library in Paris. Even if it is not an autograph, it is certainly from the same period, and is quite important to determine the authenticity of the novel. Neither the image of the manuscript, nor its exact transcription, has ever been given to the public6. The facsimile of the manuscript will be included in the BVH project, the Rabelais Corpus, and its transcription will be searchable through PhiloLogic and annotated by Analog. Other valuable manuscripts could be Marguerite de Navarre’s Heptameron, and collections of humanist poetry held by the Public Library of Orléans and Bourges. Some deserve exact transcription, others do not. Documents can be displayed only in facsimile, or offered in transcription, which demands subsidizing at a large scale. The Queen’s accounting register, from a private library, could be processed as an archive, searched with XTF or PhiloLogic but not with Analog7,
Other manuscripts, recently discovered, deserve a full display because they are not only cultural heritage items, but proof for current research: for instance, the account register about the Royal Palace in Romorantin imagined and sketched by Leonardo da Vinci in France; documents about the Royal Entry in Bourges by Louis XII (1506), and so on. These do not constitute a corpus in themselves, obviously. Nevertheless, their transcription, if any, will be properly TEI encoded and will enrich the database processed with XTF and PhiloLogic.

The cluster of notarial acts is, at the present time, a plain database, elaborated with the many paleographic transcriptions by Dr. Pierre Aquilon. During his thirty-year career at the Center for Renaissance Studies, he transcribed a great deal of archives, to which we can add the summaries or transcriptions by Prof. Bernard Chevalier, a specialist of regional history. They both stored nearly 12,000 items coming from notarial offices of the county (Indre-et-Loire and Tours). Even if these data have little literary interest, they provide many historical details and are widely used for genealogic records. They will be searched through XTF and the whole database is migrating towards a XML standard.


4. Extracted images
A collection of images has started from the beginning of the BVH project. The exploitation of image mode, in particular illustrated elements, is specific to the digitized collection: the AGORA program, developed also by the laboratory of computer sciences of the University, analyzes the page layout and allows semi-automatic extraction of illustrations, graphs, portraits, typographic material and initials. It provides searchable databases: the motives of the graphics and the initials are indexed with the Iconclass Thesaurus that harvests our image database8. The data extracted in this way provides a corpus, which can be subject to precise queries. If the portrait gallery is one of the favorites, the most ambitious computing development is that of ornamental letters: a special program of the National Research Agency (“Navidomass”) deals with thousands of them. It aims at providing a sophisticated tool for “initial mining” based on motifs, features, letters, and background.
To be a corpus, and not only a collection, a corpus needs to provide added value to the mere gathering of data. The BVH project tests diverse configurations of corpora, and a narrow connection exists between the needs of the end-user, the available software, and the raw or structured data. Other developments can be foreseen: if there is a Rabelais corpus, why not a “Montaigne Library” that virtually brings together the content of his shelves? Why not search the texts with the same keywords that build the Iconclass thesaurus, offering in this way a selection of topoi present in the texts, in their tables of contents and in the illustrated elements?

These corpora are unified by the metadata that feed the main catalogue and the specific grouping, but they still maintain their independence: the accuracy of the results must be preserved. Extraction, TEI encoding and indexing are combined in this “cultural heritage” undertaking, the limits of which are expanding every day.


Marie-Luce Demonet, July 25, 2011


1 Under the supervision of Toshinori Uetani, CESR, with the collaboration of Marie-Elisabeth Boutroue, IRHT

2 Nicole Dufournaud, CESR, Jean-Daniel Fekete, INRIA.

3 Details can be found in “Sustainability of Language Resources and Tools for Natural Language Processing,” http://www.lrec-conf.org/proceedings/lrec2008/.

4 Prof. Jean-Yves Ramel, Dr. Nicolas Ragot, Dr. Mathieu Delalandre.

5 “Dissimilog” is developed in Perl by Thierry Vincent, Poitiers.

6 The later Pleiade edition (Oeuvres de Rabelais, M. Huchon ed., Paris, 1994) takes up the old transcription of the Montaiglon edition (XIXth century).

7 The private owner has given the authorization for digitizing and on-line publishing, but the tables, numbers and columns must be processed separately.

8 See Hans Brandhorst’s presentation.