Humanist Discussion Group

Humanist Archives: Feb. 25, 2026, 6:46 a.m. Humanist 39.342 - 18C text normalisation & correction

				
              Humanist Discussion Group, Vol. 39, No. 342.
        Department of Digital Humanities, University of Cologne
                      Hosted by DH-Cologne
                       www.dhhumanist.org
                Submit to: humanist@dhhumanist.org




        Date: 2026-02-24 12:04:37+00:00
        From: Róbert PÉTER <robert.peter@ieas-szeged.hu>
        Subject: Updated ECCO TCP and EVANS TCP: long s normalisation and harmonised metadata

Dear Colleagues,

The Text Creation Partnership transcriptions of Eighteenth-Century
Collections Online (ECCO TCP) and Evans Early American Imprints (EVANS TCP)
are accessible through several online platforms. Depending on the platform,
these texts either preserve the so-called long s without normalisation
(e.g., ſuch → such) or replace the long s with s. They generally do not
correct even basic OCR-like errors (e.g., thefe → these). In addition, the
transcriptions contain keying errors, including instances where a long s
has been entered in place of an f (e.g., ſor instead of for). On some
platforms—such as ARTFL’s ECCO TCP—the long s appears to have been replaced
with s in most cases. This replacement introduces errors into the digital
corpus (e.g., sirst rather than first, sather rather than father). As a
consequence, the same word can occasionally occur in three different forms
within a single text or authorial corpus (e.g., theſe, thefe, these).
Tokens containing the long s account for 13.2% of ECCO TCP and 10.3% of
EVANS TCP. Because of these multiple spellings, transcription artefacts,
and OCR-related noise, many digital humanities analyses that rely on these
corpora risk producing inaccurate results.

For this reason, we undertook a normalisation of long s forms and the
correction of a set of basic OCR errors. Using NLP methods, we have
normalised 98.8% of long s forms in ECCO TCP and 99.1% in EVANS TCP. In
practical terms, this means that, in the revised corpora, a single
modernised form is available for the overwhelming majority of cases (e.g.,
ſhall → shall, beſore → before). Full normalisation is not possible in
every instance, because some contexts remain genuinely ambiguous. For
example, given the ſ/f ambiguity in the transcription layer, sequences such
as “must be ſought” cannot always be disambiguated reliably (must be sought
vs must be fought).

Both the original and the normalised versions of ECCO TCP and EVANS TCP
with harmonised metadata are now available in the AVOBMAT textmining tool (
avobmat.hu), alongside additional collections of eighteenth-century novels
and drama. The platform also enables straightforward document-level
comparison between the two versions.  If you notice any errors, please do
not hesitate to contact me.

Best regards,

Róbert Péter

**********************
Róbert Péter, Ph.D.
Associate Professor
Institute of English and American Studies
Head of Digital Humanities Laboratory
University of Szeged
Robert_Peter@Fedihum.org
Researchgate <https://www.researchgate.net/profile/Robert-Peter-3>,
Academia.edu <https://u-szeged.academia.edu/RobertPeter>


_______________________________________________
Unsubscribe at: http://dhhumanist.org/Restricted
List posts to: humanist@dhhumanist.org
List info and archives at at: http://dhhumanist.org
Listmember interface at: http://dhhumanist.org/Restricted/
Subscribe at: http://dhhumanist.org/membership_form.php