This research investigates how many
question-answer pairs are required for fine-tuning
language models on question-answering (QA) tasks. We
fine-tuned nine different language models on subsets of
the SQuAD v2.0 dataset by Rajpurkar et al. (2018)
and measured the threshold at which the marginal
benefit of additional question-answer pairs diminishes.
We show that most fine-tuned language models for QA
often do not require the full SQuAD v2.0 dataset with
130,319 training samples to perform well; 78,191 samples
(60%) are in most cases enough to achieve near-peak
performance. Thus, smaller datasets may suffice to
fine-tune QA models. Competitive performance on
smaller datasets enables less resource-intensive training
of models, makes QA tasks more accessible without
requiring powerful hardware, and helps curators of
datasets to decide how much data to collect for a given
task.