With tools such as Transkribus (https://www.transkribus.org/) and eScriptorium (https://www.sofer.info/), AI-based image processing techniques are unblocking what has been a hugely expensive and time-intensive process of turning historical documents into machine readable transcriptions, in a manner that enables a scaling up of the process. However such processes are not error-free (for example due to handwriting mis-transcribing the word ‘billing’ into ‘killing’) resulting in parts of transcriptions being ‘dirty’ or of poor quality. This naturally creates problems further down the pipeline in the embeddings space – meaning and search.
The Virtual Record Treasury of Ireland (virtualtreasury.ie) is a major government-funded initiative making historical records available to researchers and public in a manner that easily navigated and searched.
The idea of this project would be to extend an existing tool developed and used within VRTI (called Bookworm) that shows a historian the image of a page of a document and the output of the transcription tool side by side. The extension would allow the user to identify a part of the image (e.g. using a bounding box) and ask for that part to be re-transcribed. The challenge will be to explore whether this limited re-transcription process would be better implemented using a LLM or a more traditional CNN model).
This project provides a rich opportunity both to research current state of art approaches to user interface development and Image to Text based techniques
This project is suitable as a Final Year Project or MSc Dissertation project, with the challenge scope and ambition being tailored accordingly. The research will be co-supervised by Pallavit Aggarwal (a member of the VRTI Technical Team) and undertaken in collaboration with an expert on the data from the School of History
KEYWORDS:User Interface, Image to Text techniques
PREREQUISITES: No prerequisite per se, but definitely helpful to have interest/experience in Image to Text techniques and web-based User Interface design/development