Academic

EDDA-Coordinata: An Annotated Dataset of Historical Geographic Coordinates

arXiv:2602.23941v1 Announce Type: new Abstract: This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d'Alembert's eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline pre

arXiv:2602.23941v1 Announce Type: new Abstract: This paper introduces a dataset of enriched geographic coordinates retrieved from Diderot and d'Alembert's eighteenth-century Encyclopedie. Automatically recovering geographic coordinates from historical texts is a complex task, as they are expressed in a variety of ways and with varying levels of precision. To improve retrieval of coordinates from similar digitized early modern texts, we have created a gold standard dataset, trained models, published the resulting inferred and normalized coordinate data, and experimented applying these models to new texts. From 74,000 total articles in each of the digitized versions of the Encyclopedie from ARTFL and ENCCRE, we examined 15,278 geographical entries, manually identifying 4,798 containing coordinates, and 10,480 with descriptive but non-numerical references. Leveraging our gold standard annotations, we trained transformer-based models to retrieve and normalize coordinates. The pipeline presented here combines a classifier to identify coordinate-bearing entries and a second model for retrieval, tested across encoder-decoder and decoder architectures. Cross-validation yielded an 86% EM score. On an out-of-domain eighteenth-century Trevoux dictionary (also in French), our fine-tuned model had a 61% EM score, while for the nineteenth-century, 7th edition of the Encyclopaedia Britannica in English, the EM was 77%. These findings highlight the gold standard dataset's usefulness as training data, and our two-step method's cross-lingual, cross-domain generalizability.

Executive Summary

This article presents EDDA-Coordinata, a comprehensive dataset of historical geographic coordinates retrieved from Diderot and d'Alembert's 18th-century Encyclopédie. The dataset was created through a two-step process involving a classifier to identify coordinate-bearing entries and a transformer-based model for retrieval and normalization. The resulting dataset contains 4,798 manually annotated geographical entries with coordinates. The authors demonstrate the effectiveness of their approach by achieving high EM scores on various testing datasets, including an 86% score on the Encyclopédie and 77% on the 7th edition of the Encyclopaedia Britannica. The EDDA-Coordinata dataset and the proposed method have significant implications for the field of historical geographic information retrieval, enabling the analysis of large-scale historical texts and facilitating cross-lingual and cross-domain generalizability.

Key Points

  • EDDA-Coordinata is a comprehensive dataset of historical geographic coordinates retrieved from the Encyclopédie.
  • The dataset was created through a two-step process involving a classifier and a transformer-based model.
  • The authors demonstrate the effectiveness of their approach on various testing datasets.

Merits

Strength in Annotated Dataset

EDDA-Coordinata contains 4,798 manually annotated geographical entries with coordinates, providing a gold standard dataset for training models.

Cross-Lingual and Cross-Domain Generalizability

The proposed method demonstrates significant cross-lingual and cross-domain generalizability, enabling the analysis of large-scale historical texts in different languages and domains.

Demerits

Limited to 18th-Century Encyclopédie

The EDDA-Coordinata dataset is limited to the 18th-century Encyclopédie, which may not be representative of other historical texts or languages.

Dependence on Manual Annotations

The accuracy of the EDDA-Coordinata dataset relies heavily on manual annotations, which may be time-consuming and prone to errors.

Expert Commentary

This article represents a significant contribution to the field of historical geographic information retrieval. The creation of EDDA-Coordinata and the proposed two-step method demonstrate a deep understanding of the complexities involved in retrieving and normalizing historical geographic coordinates. While the dataset is limited to the 18th-century Encyclopédie, the cross-lingual and cross-domain generalizability of the proposed method makes it a valuable tool for researchers and practitioners working with historical texts in different languages and domains. The implications of this research are far-reaching, with potential applications in cultural heritage, historical preservation, and education.

Recommendations

  • Future research should focus on expanding the EDDA-Coordinata dataset to include other historical texts and languages.
  • The proposed method should be tested on a wider range of historical texts and languages to further demonstrate its cross-lingual and cross-domain generalizability.

Sources