Machine Learning Application for Hebrew Paleography Based on SfarData - The 18th World Congress of Jewish Studies

We present an ongoing project for automatic classification of the fourteen subtypes of the medieval Hebrew script. We aim to automatically distinguish between cursive and square scripts regardless of the script main type by applying deep learning algorithms to the dataset of medieval Hebrew manuscripts. The manuscripts for the VML-HP dataset were chosen based on the criteria of the contemporary Hebrew paleography, which is one of the most advanced in the world and it is the only one that possesses the complete database of all dated Hebrew manuscripts prior to 1540 – the Sfardata (https://sfardata.nli.org.il/).

The main challenge of this project was the limited amount of available digitized manuscripts. For some script types (Italian, Byzantine) the shortage was more pronounced; for others (Ashkenazi, Sephardic) we had manuscripts in abundance; for few (Oriental square) we had to use b/w microfilms. Keeping the dataset balanced was a challenge by itself. At first, we ran experiments on a smaller dataset but eventually we enlarged it by three times. The dataset is then split into train, test, and blind test set. The results of the blind test (the algorithm is tested on manuscripts it never saw previously) represent the real-life performance of the algorithm.

Another important challenge is the endless minor variations of individual hand writings within one script sub-type. This is especially pronounced in those subtypes that were used during a significant length of time and at numerous geographical locations (Sefardic semi-square is the most prominent example). We overcome this by running experiments on different algorithms and choosing the best performing one.

We developed a clean patch generation algorithm that generates patches that contain approximately five text lines, which is a sufficient size for human paleographers to classify the script type. This algorithm also deals with noisy background, decorations, marginal drawings, and other irrelevant information. Originally, we ran experiments on approximately 170K patches; today we utilize 350K.

Our current results on the blind text are now approaching the output of a trained human paleographer who showed the result of 70% accuracy on patch level.