A Methodological Approach to Knowledge Representation in Historical Corpus Analysis
DOI:
https://doi.org/10.21814/h2d.6580Keywords:
Digital Humanities, Corpus Analysis, Ontologies, Optical Character Recognition, SPARQL, Historical NewspapersAbstract
This article presents a theoretical model for the semantic analysis of historical corpora using interoperable digital tools. By applying PaddleOCR for text extraction, an ontology based on the CIDOC CRM model, and manually defined SPARQL queries, a scalable approach is proposed for processing and exploring digitized journalistic sources. The case study focuses on media coverage of the Apollo 11 mission in Portuguese newspapers from the 20th century, within the context of the Estado Novo regime. The methodology enabled the identification of discursive patterns, the structuring of semantic relationships between events, actors, and publications, and demonstrated the potential of ontological modelling for critical discourse analysis. Although still in an exploratory phase, the model shows promise for future applications in the field of Digital Humanities.
Downloads
References
Bekiari, C., Bruseker, G., Canning, E., Doerr, M., Michon, P., Ore, C.-E., Stead, S., & Velios, A. (2024). Definition of the CIDOC Conceptual Reference Model. CIDOC CRM SIG.
Du, Y., Li, C., Guo, R., Yin, X., Liu, W., Zhou, J., Bai, Y., Yu, Z., Yang, Y., Dang, Q., & Wang, H. (2020). PP-OCR: A practical ultra lightweight OCR system (arXiv:2009.09941). arXiv. https://doi.org/10.48550/arXiv.2009.09941
Fafalios, P., Marketakis, Y., Samaritakis, G., Patramanis, D., & Tzitzikas, Y. (2021). Towards Semantic Interoperability in Historical Research: Documenting Research Data and Knowledge with Synthesis. In Hotho, A., et al. (Eds.), The Semantic Web – ISWC 2021. Lecture Notes in Computer Science, Vol. 12922. Springer, Cham. https://doi.org/10.1007/978-3-030-88361-4_40
Fafalios, P., Kritsotaki, A., & Doerr, M. (2023). The SeaLiT Ontology – An Extension of CIDOC-CRM for the Modeling and Integration of Maritime History Information. ACM Journal on Computing and Cultural Heritage, 16(3), Article 60, 21 pages. https://doi.org/10.1145/3586080
Fairclough, N. (2009). Discourse and social change. Polity Press.
Kadilierakis, G., Fafalios, P., Marketakis, Y., Tzitzikas, Y., & Doerr, M. (2020). Keyword Search over RDF using Document-Centric Information Retrieval Systems. In A. Harth et al. (Eds.), The Semantic Web – ESWC 2020. Lecture Notes in Computer Science, Vol. 12123. Springer. https://doi.org/10.1007/978-3-030-49461-2_8
Liang, Y., Xie, B., Tan, W., & Zhang, Q. (2025). Ontology-based construction of embroidery intangible cultural heritage knowledge graph: A case study of Qingyang sachets. PLOS ONE, 20(1), e0317447. https://doi.org/10.1371/journal.pone.0317447
Smith, R. W. (2007). An overview of the Tesseract OCR engine. Proceedings of the Ninth International Conference on Document Analysis and Recognition, 629–633. https://doi.org/10.1109/ICDAR.2007.4376991
Meroño Peñuela, A., Ashkpour, A., van Erp, M. G. J., Mandemakers, K., Breure, L., Scharnhorst, A., Schlobach, K. S., & van Harmelen, F. A. H. (2015). Semantic Technologies for Historical Research: A Survey. Semantic Web, 6(6), 539–564. https://doi.org/10.3233/SW-140158
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Rafael Prezado, Renata Vieira

This work is licensed under a Creative Commons Attribution 4.0 International License.



