Summary: The project aims to build a platform capable of doing large scale content analysis of digitised Sri Lankan Tamil Texts. This is related to the field of semantic culturomics in which researchers data mine large digital archives to investigate cultural phenomena reflected in language and word usage. It is a form of computational lexicology that studies human behaviour and cultural trends through the quantitative analysis of digitised texts. The underlying data is from Noolaham Foundation, a Digital Archive and a Digital Library undertaking the critical work of documenting, digitally preserving and providing free and open access to knowledge bases and cultural heritage of Sri Lankan Tamil speaking communities. The archive contains digitised text from Sri Lankan newspapers, books, magazines, pamphlets etc from various sources totalling up to approximately 100,000+ documents.
The project consists of a language pre-processing layer, language resource layer, processing resource layer and finally a knowledge engineering component which all forms an AI-based ecosystem for analysing Noolaham foundation’s content. This will also involve development of a custom-GPT model, an intelligent assistant capable of answering questions from the content. The plan would consist of creating detailed proposals for a number of sub-projects involved in each of these layers.
Partner Institution | Noolaham Foundation, Sri Lanka |
Supervisors | Dr. Saatviga Sudhahar |
Researchers | Charangan Vasantharajan, Nilakshan Kunananthaseelan |