原文依据: 「In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events.」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第1页)
原文依据: 「Overall, our dataset consists of 600 jokes, containing 150 of each type: homographic puns, heterographic puns, non-topical Reddit humour, and topical Reddit humour.」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第3页)
原文依据: 「We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types…」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第1页)
原文依据: 「…where humour arises from the interpretation of the punning word (i.e., “dyed”) with a phonetically similar word with different meaning (i.e., “died”) in the case of heterographic puns…」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第1页)
原文依据: 「…or the polysemy of a word for homographic puns (Attardo, 2008).」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第1页)
原文依据: 「…much of the humour…is based on contemporary topical knowledge of evolving pop-culture phenomena and news events, where a full appreciation of the humour relies on potentially esoteric knowledge, rather than common-sense…」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第1页)
原文依据: 「H2. Heterographic puns will be more difficult for models to explain than homographic puns due to the former’s reliance on phonetic similarity, which is not explicitly encoded in orthographic text.」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第4页)
原文依据: 「H4. Larger model variants will perform better than smaller variants, particularly for topical humour, due to being able to store larger amounts of information in their parameters regarding specific events and individuals…」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第4页)
原文依据: 「To assess the quality of joke explanations generated by the models, we propose a scoring rubric with two core criteria, accuracy and completeness, each evaluated on a 6-point scale (0-5).」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第5页)
D. 在准确性(Accuracy)和完整性(Completeness)两个标准上得分都达到4分或以上✅
正确答案: D
原文依据: 「…in Figure 4 we present a binary categorisation of explanations, treating scores of 4 or above on both criteria as being “good” quality explanations…」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第6页)
知识点: 模型表现趋势:准确性 vs. 完整性 题目: 总体来看,模型在准确性(Accuracy)和完整性(Completeness)两个评估维度上的得分表现出什么趋势? 选项:
A. 准确性得分和完整性得分基本相同✅
B. 准确性得分通常低于完整性得分✅
C. 完整性得分通常低于准确性得分✅
D. 两个得分之间没有明显的相关性✅
正确答案: C
原文依据: 「In terms of overall performance, completeness scores are generally lower than accuracy scores across all models. This difference indicates that the models are more likely to miss key details in their explanations than to hallucinate incorrect information…」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第5页)
原文依据: 「When comparing model performance, GPT-4o consistently outperforms all others, demonstrating the highest accuracy and completeness scores across joke types.」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第5页)
原文依据: 「Homographic jokes, where a word has multiple meanings but identical spelling, consistently yield the highest proportion of successful explanations across nearly all models.」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第6页)
原文依据: 「The challenge likely stems from the additional complexity introduced by needing to recognise the phonetic similarity of the pun word whilst being trained on orthographic tokens.」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第6页)
知识点: “汰渍洗衣球挑战”案例研究的发现 题目: 在关于“汰渍洗衣球挑战”(Tide Pod Challenge)的案例研究中,大型号模型和小型号模型的主要表现差异是什么? 选项:
A. 小型号模型提供了更简洁准确的解释✅
B. 大型号模型成功推断出笑话与该网络热点事件的关联,而小型号模型则失败或误解了✅
C. 所有模型都未能理解这个笑话背后的文化背景✅
D. 大型号模型和小型号模型都将笑话解释为与青少年饮食习惯有关✅
正确答案: B
原文依据: 「Interestingly, all of the full-size models inferred the reference to this challenge in their explanations. However, the smaller counterparts consistently omitted this information or presented misinterpretations in their explanations.」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第7-8页)
原文依据: 「Consequently, our work focuses only on a subset of possible jokes and does not assess the ability of models to explain highly recent jokes, of which there may be only limited knowledge contained within a given model’s training data…」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第9页)
B. 双关语来自SemEval 2017,时事性和非时事性笑话来自Reddit的r/Jokes板块✅
C. 所有笑话都来自《纽约客》的漫画标题比赛✅
D. 双关语来自Reddit,其他类型的笑话来自Twitter✅
正确答案: B
原文依据: 「We source an additional 300 simple pun-based jokes from SemEval 2017 Task 7… Both the topical and non-topical subsets are checked by another linguist… the 300 jokes that form the topical and non-topical categories are the top 300 highest-scoring jokes based on upvote-downvote ratio…」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第4页)
原文依据: 「We therefore do not view the creation of an explanation for a joke to be equatable with an endorsement of said joke, but rather as an ideologically neutral task.」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第9页)
原文依据: 「Lastly, topical jokes, which require contextual awareness of specific events, present the most notable challenges.」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第6页)
解析: 论文在5.1节的分析中明确指出,需要对特定事件有情境意识的时事性笑话,呈现了“最显著的挑战”(the most notable challenges)。逻辑回归分析也证实,时事性笑话是所有类型中最难的。这是因为它们依赖于模型进行复杂的推理和隐性知识检索。
原文依据: 「H3. Topical humour will be more difficult for models to explain than non-topical Reddit humour, due to the former’s reliance on subtle references to contemporary pop culture and events, rather than common sense reasoning and general knowledge.」(出自:Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes,第4页)
Loakman, T. , Thorne, W., & Lin, C. (2025). ✅Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes. arXiv:2507.13335v1 [cs.CL].