Research Recap: How Much Latin Does ChatGPT "Know"?

Patrick J. Burns

Institute for the Study of the Ancient World, May 19, 2023

Abstract

ChatGPT’s ability to perform various tasks in Latin, such as grammar correction and sentence generation, is a result of its exposure to a vast amount of Latin text during its training process. The model’s training data includes billions of tokens, a significant portion of which are Latin, potentially exceeding the amount of Latin any human could reasonably read in a lifetime. While the exact amount of Latin in the training data is difficult to ascertain, estimates suggest it could be as high as 339.1 million tokens. The abundance of Latin data enables ChatGPT to develop a sophisticated understanding of the language, making it capable of generating coherent and grammatically correct Latin text. This study examines the scale of Latin data in ChatGPT’s training set and highlights the significance of this factor in the model’s performance. – AI-generated abstract.

Research Recap: How Much Latin Does ChatGPT "Know"?

Abstract

PDF