A Methodological Approach to Knowledge Representation in Historical Corpus Analysis

Authors

DOI:

https://doi.org/10.21814/h2d.6580

Keywords:

Digital Humanities, Corpus Analysis, Ontologies, Optical Character Recognition, SPARQL, Historical Newspapers

Abstract

This article presents a theoretical model for the semantic analysis of historical corpora using interoperable digital tools. By applying PaddleOCR for text extraction, an ontology based on the CIDOC CRM model, and manually defined SPARQL queries, a scalable approach is proposed for processing and exploring digitized journalistic sources. The case study focuses on media coverage of the Apollo 11 mission in Portuguese newspapers from the 20th century, within the context of the Estado Novo regime. The methodology enabled the identification of discursive patterns, the structuring of semantic relationships between events, actors, and publications, and demonstrated the potential of ontological modelling for critical discourse analysis. Although still in an exploratory phase, the model shows promise for future applications in the field of Digital Humanities.

Downloads

Download data is not yet available.

References

Bekiari, C., Bruseker, G., Canning, E., Doerr, M., Michon, P., Ore, C.-E., Stead, S., & Velios, A. (2024). Definition of the CIDOC Conceptual Reference Model. CIDOC CRM SIG.

Du, Y., Li, C., Guo, R., Yin, X., Liu, W., Zhou, J., Bai, Y., Yu, Z., Yang, Y., Dang, Q., & Wang, H. (2020). PP-OCR: A practical ultra lightweight OCR system (arXiv:2009.09941). arXiv. https://doi.org/10.48550/arXiv.2009.09941

Fafalios, P., Marketakis, Y., Samaritakis, G., Patramanis, D., & Tzitzikas, Y. (2021). Towards Semantic Interoperability in Historical Research: Documenting Research Data and Knowledge with Synthesis. In Hotho, A., et al. (Eds.), The Semantic Web – ISWC 2021. Lecture Notes in Computer Science, Vol. 12922. Springer, Cham. https://doi.org/10.1007/978-3-030-88361-4_40

Fafalios, P., Kritsotaki, A., & Doerr, M. (2023). The SeaLiT Ontology – An Extension of CIDOC-CRM for the Modeling and Integration of Maritime History Information. ACM Journal on Computing and Cultural Heritage, 16(3), Article 60, 21 pages. https://doi.org/10.1145/3586080

Fairclough, N. (2009). Discourse and social change. Polity Press.

Kadilierakis, G., Fafalios, P., Marketakis, Y., Tzitzikas, Y., & Doerr, M. (2020). Keyword Search over RDF using Document-Centric Information Retrieval Systems. In A. Harth et al. (Eds.), The Semantic Web – ESWC 2020. Lecture Notes in Computer Science, Vol. 12123. Springer. https://doi.org/10.1007/978-3-030-49461-2_8

Liang, Y., Xie, B., Tan, W., & Zhang, Q. (2025). Ontology-based construction of embroidery intangible cultural heritage knowledge graph: A case study of Qingyang sachets. PLOS ONE, 20(1), e0317447. https://doi.org/10.1371/journal.pone.0317447

Smith, R. W. (2007). An overview of the Tesseract OCR engine. Proceedings of the Ninth International Conference on Document Analysis and Recognition, 629–633. https://doi.org/10.1109/ICDAR.2007.4376991

Meroño Peñuela, A., Ashkpour, A., van Erp, M. G. J., Mandemakers, K., Breure, L., Scharnhorst, A., Schlobach, K. S., & van Harmelen, F. A. H. (2015). Semantic Technologies for Historical Research: A Survey. Semantic Web, 6(6), 539–564. https://doi.org/10.3233/SW-140158

Published

2025-12-22

How to Cite

Prezado, R., & Vieira, R. (2025). A Methodological Approach to Knowledge Representation in Historical Corpus Analysis. H2D|Digital Humanities Journal, 7(1), e6580. https://doi.org/10.21814/h2d.6580

Similar Articles

1 2 3 4 5 6 7 8 > >> 

You may also start an advanced similarity search for this article.