NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity ClassesNPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes
Q: 这篇论文试图解决什么问题? A: 这篇论文旨在解决大型语言模型(LLMs)在推理能力评估方面的局限性。现 […] [...]