Do Children Texts Hold The Key To Commonsense Knowledge?

Julien Romero, Simon Razniewski


Abstract
Compiling comprehensive repositories of commonsense knowledge is a long-standing problem in AI. Many concerns revolve around the issue of reporting bias, i.e., that frequency in text sources is not a good proxy for relevance or truth. This paper explores whether children’s texts hold the key to commonsense knowledge compilation, based on the hypothesis that such content makes fewer assumptions on the reader’s knowledge, and therefore spells out commonsense more explicitly. An analysis with several corpora shows that children’s texts indeed contain much more, and more typical commonsense assertions. Moreover, experiments show that this advantage can be leveraged in popular language-model-based commonsense knowledge extraction settings, where task-unspecific fine-tuning on small amounts of children texts (childBERT) already yields significant improvements. This provides a refreshing perspective different from the common trend of deriving progress from ever larger models and corpora.
Anthology ID:
2022.emnlp-main.752
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
10954–10959
Language:
URL:
https://aclanthology.org/2022.emnlp-main.752
DOI:
10.18653/v1/2022.emnlp-main.752
Bibkey:
Cite (ACL):
Julien Romero and Simon Razniewski. 2022. Do Children Texts Hold The Key To Commonsense Knowledge?. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10954–10959, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Do Children Texts Hold The Key To Commonsense Knowledge? (Romero & Razniewski, EMNLP 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.emnlp-main.752.pdf
Dataset:
 2022.emnlp-main.752.dataset.zip