原文依据:「 In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 1 页)
解析: 原文摘要明确指出,本研究旨在调查大型语言模型 (LLMs) 解释幽默的能力是否因幽默形式的不同而有所差异。研究者们比较了模型在解释简单双关语和需要真实世界知识的复杂时事性幽默上的表现。因此,选项 C 准确地概括了研究的核心目的。
A. 共 600 个笑话,分为同形双关语、异形双关语、非时事性幽默和时事性幽默四类,每类 150 个✅
B. 共 400 个笑话,只分为双关语和时事性幽默两类✅
C. 共 600 个笑话,全部来自 SemEval 2017 数据集✅
D. 共 150 个笑话,分为四种类型,每类数量不均等✅
正确答案: A
原文依据:「 Overall, our dataset consists of 600 jokes, containing 150 of each type: homographic puns, heterographic puns, non-topical Reddit humour, and topical Reddit humour. 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 3 页)
原文依据:「 We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types…」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 1 页)
解析: 论文摘要的结论部分明确指出,研究发现,没有一个被测试的模型 (包括那些为推理任务设计的模型) 能够为所有类型的笑话都可靠地生成充分的解释。这揭示了当前 LLM 在幽默理解方面仍存在显著的局限性。选项 D 是这一核心发现的准确转述。
原文依据:「…where humour arises from the interpretation of the punning word (i.e., "dyed") with a phonetically similar word with different meaning (i.e., "died") in the case of heterographic puns…」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 1 页)
原文依据:「…or the polysemy of a word for homographic puns (Attardo, 2008). 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 1 页)
原文依据:「…much of the humour…is based on contemporary topical knowledge of evolving pop-culture phenomena and news events, where a full appreciation of the humour relies on potentially esoteric knowledge, rather than common-sense…」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 1 页)
原文依据:「 H2. Heterographic puns will be more difficult for models to explain than homographic puns due to the former's reliance on phonetic similarity, which is not explicitly encoded in orthographic text. 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 4 页)
解析: 研究假设 H2 明确预测:对于模型来说,解释异形双关语将比解释同形双关语更加困难。其原因在于,异形双关语依赖于模型难以直接从书面文本中获取的语音相似性。选项 D 与此假设完全一致。
原文依据:「 H4. Larger model variants will perform better than smaller variants, particularly for topical humour, due to being able to store larger amounts of information in their parameters regarding specific events and individuals…」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 4 页)
解析: 研究假设 H4 指出,更大规模的模型变体将比小规模的变体表现更好。这种优势在处理时事性幽默时尤其明显,因为大模型能够在其参数中存储更多关于特定事件和人物的信息。选项 B 准确地总结了这一假设。
原文依据:「 To assess the quality of joke explanations generated by the models, we propose a scoring rubric with two core criteria, accuracy and completeness, each evaluated on a 6-point scale (0-5). 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 5 页)
D. 在准确性 (Accuracy) 和完整性 (Completeness) 两个标准上得分都达到 4 分或以上✅
正确答案: D
原文依据:「…in Figure 4 we present a binary categorisation of explanations, treating scores of 4 or above on both criteria as being "good" quality explanations…」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 6 页)
原文依据:「 In terms of overall performance, completeness scores are generally lower than accuracy scores across all models. This difference indicates that the models are more likely to miss key details in their explanations than to hallucinate incorrect information…」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 5 页)
原文依据:「 When comparing model performance, GPT-4o consistently outperforms all others, demonstrating the highest accuracy and completeness scores across joke types. 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 5 页)
原文依据:「 Homographic jokes, where a word has multiple meanings but identical spelling, consistently yield the highest proportion of successful explanations across nearly all models. 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 6 页)
原文依据:「 The challenge likely stems from the additional complexity introduced by needing to recognise the phonetic similarity of the pun word whilst being trained on orthographic tokens. 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 6 页)
知识点: 「汰渍洗衣球挑战」 案例研究的发现 题目: 在关于 「汰渍洗衣球挑战」(Tide Pod Challenge) 的案例研究中,大型号模型和小型号模型的主要表现差异是什么? 选项:
A. 小型号模型提供了更简洁准确的解释✅
B. 大型号模型成功推断出笑话与该网络热点事件的关联,而小型号模型则失败或误解了✅
C. 所有模型都未能理解这个笑话背后的文化背景✅
D. 大型号模型和小型号模型都将笑话解释为与青少年饮食习惯有关✅
正确答案: B
原文依据:「 Interestingly, all of the full-size models inferred the reference to this challenge in their explanations. However, the smaller counterparts consistently omitted this information or presented misinterpretations in their explanations. 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 7-8 页)
原文依据:「 Consequently, our work focuses only on a subset of possible jokes and does not assess the ability of models to explain highly recent jokes, of which there may be only limited knowledge contained within a given model's training data…」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 9 页)
B. 双关语来自 SemEval 2017,时事性和非时事性笑话来自 Reddit 的 r/Jokes 板块✅
C. 所有笑话都来自 《纽约客》 的漫画标题比赛✅
D. 双关语来自 Reddit,其他类型的笑话来自 Twitter✅
正确答案: B
原文依据:「 We source an additional 300 simple pun-based jokes from SemEval 2017 Task 7… Both the topical and non-topical subsets are checked by another linguist… the 300 jokes that form the topical and non-topical categories are the top 300 highest-scoring jokes based on upvote-downvote ratio…」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 4 页)
原文依据:「 We therefore do not view the creation of an explanation for a joke to be equatable with an endorsement of said joke, but rather as an ideologically neutral task. 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 9 页)
原文依据:「 Lastly, topical jokes, which require contextual awareness of specific events, present the most notable challenges. 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 6 页)
解析: 论文在 5.1 节的分析中明确指出,需要对特定事件有情境意识的时事性笑话,呈现了 「最显著的挑战」(the most notable challenges) 。逻辑回归分析也证实,时事性笑话是所有类型中最难的。这是因为它们依赖于模型进行复杂的推理和隐性知识检索。
原文依据:「 H3. Topical humour will be more difficult for models to explain than non-topical Reddit humour, due to the former's reliance on subtle references to contemporary pop culture and events, rather than common sense reasoning and general knowledge. 」 (出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第 4 页)
解析: 研究假设 H3 明确预测,时事性幽默对模型来说会比非时事性 Reddit 幽默更难解释。理由是前者依赖于对当代流行文化和事件的微妙引用,而后者更多地依赖于常识推理和一般知识。选项 C 准确地复述了这一假设。
Loakman, T. , Thorne, W., & Lin, C. (2025). ✅Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes. arXiv:2507.13335v1 [cs.CL].