DH-400 / 4 credits

Teacher: Ehrmann Maud

Language: English


Summary

This course introduces historical document processing, focusing on concepts and methods that enable the transformation of digitised materials into searchable information. Grounded in machine learning and document processing, it also covers data curation and copyright considerations.

Content

Over the past few decades, large-scale digitization efforts have steadily produced a growing number of facsimiles of historical documents. Beyond their obvious value for preservation and access, these digitised sources also create opportunities for the automatic processing and analysis of their textual and visual content. How to extract and link the complex multimodal information enclosed in digitized historical documents - especially historical newspapers? This course considers the entire historical document processing pipeline, focusing on the concepts and methods that enable the transformation of digitised materials into structured, searchable information.

After an introduction to the main aspects of mass digitisation and the specifics of historical media, the course examines the major building blocks that transform scans into structured data. Topics covered include text acquisition challenges and approaches (OLR and OCR), text preprocessing techniques, information extraction (historical named entity processing and linking), collection characterisation (document clustering and topic modelling), visual content classification and search systems. These approaches are grounded in the core concepts of machine learning and information retrieval that underpin them.

In addition to technical methods, the course introduces standards and best practices for data preparation, as well as copyright considerations. Finally, beyond technical aspects, the course situates historical information extraction within the broader contexts of digital scholarship and the cultural heritage ecosystem.

By the end of the course, students will have acquired a solid understanding of historical text processing, including key concepts, methods, and practical applications.

 

Outline (tentative)

  • Part 1 - Fundamentals and Pipeline Overview of Historical Document Processing (two classes): Introduction to challenges and high-level stages of the historical document processing pipeline; Review and illustration of core concepts of machine learning, information extraction and information retrieval.
  • Part 2 - Representation and access of historical text and image data (three classes): text acquisition (OCR, OLR), text and image representaiton, search systems.
  • Part 3 - Computational methods and practice (five classes): named entity processing, topic modeling, data curation, image classification, navigating the copyright landscape.

Keywords

historical document, natural language processing, machine learning, information extraction, information retrieval, digital humanities

Learning Prerequisites

Recommended courses

 

 

Important concepts to start the course

Basic knowledge of Machine Learning is recommended.

For those who wish to go deeper into ML, CS-433 Machine learning is recommended (to be followed in parallel or later).

Learning Outcomes

By the end of the course, the student must be able to:

  • Characterize the main steps of historical document processing.
  • Analyze a collection of historical documents to determine appropriate processing methods and workflows for a given information or research need.
  • Apply the presented methods to historical documents in practice.
  • Assemble a high-quality dataset for machine learning training purposes.
  • Contextualise the ecosystem surrounding historical text processing, including challenges, approaches, resources, actors, recent developments, and interdisciplinary aspects.
  • State the main questions surrounding copyrights and use of digitised archive collections.

Teaching methods

  • Lectures in the class room
  • Exercice / lab sessions: Hands-on practical work with historical document datasets.

Expected student activities

  • Attend lectures
  • Attend exercises and lab sessions; carry out hands-on practical work.
  • Engage with course materials.

Assessment methods

  • Mid-term: written exam or QCM or paper presentation
  • Small group project
  • Final written exam (during the semester)

Resources

Moodle Link

In the programs

  • Semester: Fall
  • Exam form: During the semester (winter session)
  • Subject examined: Historical Document and Media Processing
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Type: mandatory
  • Semester: Fall
  • Exam form: During the semester (winter session)
  • Subject examined: Historical Document and Media Processing
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Type: mandatory
  • Semester: Fall
  • Exam form: During the semester (winter session)
  • Subject examined: Historical Document and Media Processing
  • Courses: 2 Hour(s) per week x 14 weeks
  • Exercises: 2 Hour(s) per week x 14 weeks
  • Type: optional

Reference week

Monday, 8h - 10h: Lecture BC02

Monday, 10h - 12h: Exercise, TP BC02

Related courses

Results from graphsearch.epfl.ch.