Transforming Classical Chinese Texts into Searchable Databases with AI
November 7 @ 12:00 pm – 1:00 pm
Speaker: Guenther Lomas, Founder, Sigtica
As artificial intelligence becomes integral to the digital humanities, it offers innovative methods that transform research capabilities and uncover new insights into historical texts and cultural narratives. This talk will demonstrate how AI-powered pipelines can process large volumes of unstructured classical Chinese texts, such as genealogies and Qing dynasty government employee records, including those from the Da Qing jin shen quan shu, into organized, searchable databases.
The pipeline addresses a longstanding challenge in classical Chinese studies: the labor-intensive manual data entry process. It is designed to efficiently process millions of pages from historical Chinese texts, tackling complexities like layout identification and precision in text extraction. Central to this effort is customized Optical Character Recognition (OCR), which enhances data extraction accuracy and identifies key fields using Named Entity Recognition (NER) models. The result is clean, tabular databases that improve accessibility, allowing researchers to analyze Chinese historical content with unprecedented efficiency. Furthermore, this methodology holds potential applications for other languages, including Japanese, Korean, Arabic and Latin, broadening its impact.
By exploring these methodologies and their implications, this presentation aims to show how integrating advanced technological tools enriches scholarly inquiry in the digital humanities, providing deeper insights into patterns and narratives within Chinese history and beyond. This approach promises to revolutionize data collection, paving the way for alternative research practices across various linguistic contexts.
Lunch will be provided. Registration required