GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. We show that this type of few-shot learning can be unstable: the choice of prompt format, training examples, and even the order of the training examples can cause accuracy to vary from near chance to near state-of-the-art. We demonstrate that this instability arises from the bias of language models towards predicting certain answers, e.g., those that are placed near the end of the prompt or are common in the pre-training data. To mitigate this, we first estimate the model's bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as "N/A". We then fit calibration parameters that cause the prediction for this input to be uniform across answers. On a diverse set of tasks, this contextual calibration procedure substantially improves GPT-3 and GPT-2's average accuracy (up to 30.0% absolute) and reduces variance across different choices of the prompt.
翻译:GPT-3在提供包含几个培训实例的自然语言提示时,可以执行许多任务。我们显示,这种短片学习可能不稳定:选择快速格式、培训实例,甚至培训实例的顺序,都可能造成准确性从近乎机会到近于艺术水平的差异。我们证明,这种不稳定性源于语言模型偏向于预测某些答案的偏向,例如,那些在快速状态接近尾声或培训前数据中常见的答案。为了减轻这一偏向,我们首先在培训快速和无内容测试输入(如“N/A”)时要求预测模型对每个答案的偏向。然后我们设置校准参数,使这种输入得到一致的答案。在一系列不同的任务中,这种背景校准程序大大改进了GPT-3和GPT-2的平均准确性(达到30.0%的绝对值),并减少了对提示的不同选择的差异。