Bridging the data gap between children and AI models
Mike Frank
Benjamin Scott Crocker Professor of Human Biology, Stanford University
Large language and language-vision models show intriguing emergent behaviors, yet they receive at least three to four – and sometimes as much as six – orders of magnitude more language data than human children. What accounts for this vast difference in sample efficiency? I will describe steps towards a paradigm in which we can address this question. In particular, I’ll discuss the use of child language and egocentric video data for model training, and the use of developmental data for model evaluation. This paradigm provides a model-based framework for exploring the nature of children’s early language learning.