MMLU:我们真的完成了它吗?

大型语言模型(LLM)的出现,标志着自然语言处理领域取得了重大进展,使我们能够通过自然语言与计算机进行交互。然而,这些模型的评估需要可靠的基准测试,而现有的基准测试却存在着不少问题。

MMLU:一个广受欢迎但存在问题的基准测试

MMLU(Massive Multitask Language Understanding,大规模多任务语言理解)基准测试,因其涵盖了数学、历史、计算机科学、逻辑、法律等多个领域的知识而备受关注。然而,我们发现,尽管MMLU很受欢迎,但它存在着大量错误,这些错误会误导模型评估和比较。

MMLU中的错误:一个需要解决的问题

研究人员发现,MMLU中存在着各种各样的错误,从简单的解析和抓取错误,到更复杂的上下文、解释和数据集质量问题。例如,在病毒学子集中,57% 的问题都存在错误,其中一些错误甚至建议将美军派往西非以阻止埃博拉疫情的爆发。

MMLU-Redux:一个更可靠的基准测试

为了解决MMLU中存在的错误问题,研究人员手动分析了MMLU数据集,并创建了MMLU-Redux。MMLU-Redux 包含3000个经过手动重新标注的问题,涵盖了MMLU的30个子集。研究人员发现,MMLU-Redux 的结果与原始MMLU的评估结果存在显著差异,这表明MMLU中存在的错误对模型评估结果产生了重大影响。

MMLU-Redux:一个更可靠的基准测试

MMLU-Redux 的创建,为我们提供了重新评估LLM性能的工具。研究人员发现,在MMLU-Redux 上,一些LLM的性能表现与原始MMLU评估结果存在显著差异,这表明MMLU中的错误会影响模型的排名。

自动修复MMLU:一个挑战

研究人员还尝试了使用LLM自动修复MMLU中的错误。他们使用了多种方法,包括零样本提示、少样本提示、链式思维提示和检索增强生成。然而,即使是最先进的模型,在自动错误检测方面的表现仍然有限。

结论:MMLU需要改进

MMLU是一个重要的基准测试,但它存在着不少问题。MMLU-Redux 的出现,为我们提供了一个更可靠的基准测试。研究人员呼吁社区共同努力,改进MMLU,使其成为评估下一代LLM的可靠工具。

参考文献

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[2] Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.

[3] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[4] Radford, Alec, et al. “Language models are unsupervised multitask learners.” OpenAI blog (2019).

[5] Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” Journal of Machine Learning Research 21.140 (2020): 1-67.

[6] Dai, Hanxiao, et al. “Finetuned language models are zero-shot learners.” arXiv preprint arXiv:2005.14165 (2020).

[7] Zhang, Sheng, et al. “Learning to prompt for continual pre-training.” Advances in Neural Information Processing Systems 35 (2022): 20398-20410.

[8] Touvron, Hugo, et al. “Llama: Open and efficient large language models.” arXiv preprint arXiv:2302.09439 (2023).

[9] Gardner, Matt, et al. “Evaluating large language models trained on code.” arXiv preprint arXiv:2107.03374 (2021).

[10] Bommasani, Rishi, et al. “On the opportunities and risks of foundation models.” arXiv preprint arXiv:2108.07258 (2021).

[11] Hendrycks, Dan, et al. “Measuring massive multitask language understanding.” arXiv preprint arXiv:2009.11692 (2020).

[12] Wei, Jason, et al. “Finetuned language models are zero-shot learners.” arXiv preprint arXiv:2005.14165 (2020).

[13] Wei, Jason, et al. “Chain of thought prompting elicits reasoning in large language models.” arXiv preprint arXiv:2201.11903 (2022).

[14] Guu, Kelvin, et al. “Retrieval-augmented generation for knowledge-intensive tasks.” arXiv preprint arXiv:2005.11401 (2020).

[15] Lin, Jimmy, et al. “Pyserini: A python toolkit for reproducible information retrieval research.” Proceedings of the 45th International ACM SIGIR Conference on Research & Development in Information Retrieval. 2022.

[16] Beyer, Ludwig, et al. “Are we done with imagenet?” arXiv preprint arXiv:2007.02133 (2020).

[17] Deng, Jia, et al. “Imagenet: A large-scale hierarchical image database.” 2009 IEEE conference on computer vision and pattern recognition. IEEE, 2009.

[18] Nallapati, Ramesh, et al. “Summarization evaluation: From human judgments to metrics.” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016.

[19] Fabbri, Alessandro, et al. “Semeval-2015 task 11: Automatic short answer grading.” Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015). 2015.

[20] Williams, Adina, et al. “A broad-coverage challenge corpus for sentence understanding through inference.” Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

[21] Bowman, Samuel R., et al. “A large annotated corpus for learning natural language inference.” Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015.

[22] Glockner, Max, et al. “Fine-tuning language models for natural language inference.” Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.

[23] Nie, Yixin, et al. “Adversarial examples for natural language inference.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[24] Bender, Emily M., et al. “On the dangers of stochastic parrots: Can language models be too big?” Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021.

[25] Belinkov, Yonatan, et al. “Evaluating adversarial robustness of natural language processing systems.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[26] Zhou, Peng, et al. “Towards robust and reliable natural language inference.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.

[27] Zhang, Sheng, et al. “Learning to prompt for continual pre-training.” Advances in Neural Information Processing Systems 35 (2022): 20398-20410.

[28] Gururangan, Suchin, et al. “Don’t stop pretraining: Adapt language models to domains and tasks.” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.

[29] Snow, Rion, et al. “Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks.” Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2008.

[30] Diao, Qun, et al. “Human errors in annotation: A case study of natural language inference.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[31] Ratner, Alexander, et al. “Data programming: Creating large training sets via synthetic data.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[32] Sheng, Victor, et al. “Weak supervision for natural language processing.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[33] Sap, M., et al. “The influence of annotator bias on natural language inference data.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[34] Pratapa, Adithya, et al. “Annotator bias in natural language inference: A case study of the snli corpus.” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2019.

[35] Rajpurkar, Pranav, et al. “Medqa: A dataset for medical question answering.” Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018.

[36] Hendrycks,

Leave a Comment