Navigating Data Scarcity: Pretraining for Medical Utterance Classification

Do June Min; Verónica Pérez-Rosas; Rada Mihalcea

doi:10.18653/v1/2023.clinicalnlp-1.8

Navigating Data Scarcity: Pretraining for Medical Utterance Classification

Do June Min, Veronica Perez-Rosas, Rada Mihalcea

Abstract

Pretrained language models leverage self-supervised learning to use large amounts of unlabeled text for learning contextual representations of sequences. However, in the domain of medical conversations, the availability of large, public datasets is limited due to issues of privacy and data management. In this paper, we study the effectiveness of dialog-aware pretraining objectives and multiphase training in using unlabeled data to improve LMs training for medical utterance classification. The objectives of pretraining for dialog awareness involve tasks that take into account the structure of conversations, including features such as turn-taking and the roles of speakers. The multiphase training process uses unannotated data in a sequence that prioritizes similarities and connections between different domains. We empirically evaluate these methods on conversational dialog classification tasks in the medical and counseling domains, and find that multiphase training can help achieve higher performance than standard pretraining or finetuning.

Anthology ID:: 2023.clinicalnlp-1.8
Volume:: Proceedings of the 5th Clinical Natural Language Processing Workshop
Month:: July
Year:: 2023
Address:: Toronto, Canada
Editors:: Tristan Naumann, Asma Ben Abacha, Steven Bethard, Kirk Roberts, Anna Rumshisky
Venue:: ClinicalNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 59–68
Language:
URL:: https://aclanthology.org/2023.clinicalnlp-1.8
DOI:: 10.18653/v1/2023.clinicalnlp-1.8
Bibkey:
Cite (ACL):: Do June Min, Veronica Perez-Rosas, and Rada Mihalcea. 2023. Navigating Data Scarcity: Pretraining for Medical Utterance Classification. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 59–68, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):: Navigating Data Scarcity: Pretraining for Medical Utterance Classification (Min et al., ClinicalNLP 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.clinicalnlp-1.8.pdf
Video:: https://aclanthology.org/2023.clinicalnlp-1.8.mp4

PDF Cite Search Video