קונגרס העולמי ה-18 למדעי היהדות

From Documents to Data through Artificial Intelligence: The Case of the Jewish Community Archive

The Archive of the Jewish Community of Pisa is one of those hidden cultural assets that few people can directly consult and study. Today, thanks to the use of the Internet and modern technologies, it is possible to consult at least part of it online. Moreover, thanks to the use of modern Artificial Intelligence techniques, it is possible to facilitate the work of the scholar in the tedious task of extracting information from texts.

In this lecture, I explain how I automatically extracted information from one of the archival documents, the document Nati Morti e Ballottati.


The document analyzed is a register of births, deaths, and ballottati (new converts to the Jewish religion) from 1750-1850. The document has the following structure:

* Newborns
- Record 1
- Record N
* Deaths
- Record 1
- Record N
* New Arrivals
- Record 1
- Record N

The document is written in Italian and contains some parts in Hebrew. In this article, I deal only with the section on newborns.

Each record contains the following information: name, sex, and date of birth of each newborn, name and surname of the father, name, and surname of the mother (not always available), name of the paternal grandfather (not always available).

The goal of this lecture is to automatically extract the annotations described in Figure 1 for all records contained in the document. For this purpose, I developed a model that combines part-of-speech tagging techniques and regular expressions. There are some cases where the applied model cannot extract the information correctly. These cases were handled manually.

As output, the model produces a table.

The extracted information was processed to calculate some statistics, such as the trend of births of five-year-olds and the ten most common names.