Brown, N., & Sandholm, T. (2019). Superhuman AI for multiplayer poker. Science, 365(6456), 885-890.
Zinkevich, M., Johanson, M., Bowling, M., & Piccione, C. (2008). Regret minimization in games with incomplete information. Advances in neural information processing systems, 20.
Lanctot, M., Waugh, K., Zinkevich, M., & Bowling, M. (2009). Monte Carlo sampling for regret minimization in extensive games. Advances in neural information processing systems, 22.
Brown, N., Sandholm, T., & Amos, B. (2018). Depth-limited solving for imperfect-information games. Advances in Neural Information Processing Systems, 31.
Zha, D., Lai, K. H., Cao, Y., Huang, S., Wei, R., Guo, J., & Hu, X. (2021). RLCard: A Platform for Reinforcement Learning in Card Games. IJCAI.
import rlcard
from rlcard.agents.dmc_agent import DMCTrainer
创建斗地主环境
使用RLCard创建斗地主游戏环境:
env = rlcard.make("doudizhu")
print("Number of actions:", env.num_actions)
print("Number of players:", env.num_players)
print("Shape of state:", env.state_shape)
print("Shape of action:", env.action_shape)
输出结果:
Number of actions: 27472
Number of players: 3
Shape of state: [[790], [901], [901]]
Shape of action: [[54], [54], [54]]
Zha, D., Lai, K. H., Cao, Y., Huang, S., Wei, R., Guo, J., & Hu, X. (2021). RLCard: A Toolkit for Reinforcement Learning in Card Games. IJCAI.
Zha, D., Lai, K. H., Huang, S., Cao, Y., Reddy, K., Vargas, J., … & Hu, X. (2020). DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning. arXiv preprint arXiv:2106.06135.
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., … & Hassabis, D. (2017). Mastering the game of go without human knowledge. Nature, 550(7676), 354-359.
# 初始化神经网络
initialize_neural_network()
# 设定模拟次数和其他超参数
num_simulations = 1000
discount_factor = 0.99
for iteration in range(num_iterations):
samples = []
# 样本生成
for simulation in range(num_simulations):
state = initial_state()
episode = []
while not is_terminal(state):
action = select_action(state) # 使用当前策略选择动作
next_state, reward = take_action(state, action)
episode.append((state, action, reward))
state = next_state
# 计算折扣累积回报
G = 0
for state, action, reward in reversed(episode):
G = reward + discount_factor * G
samples.append((state, action, G))
# 神经网络训练
train_neural_network(samples)
# 策略更新
update_policy()
# 检查收敛条件
if check_convergence():
break
对抗性遗憾最小化(CFR)
# 初始化策略和遗憾值
initialize_strategy_and_regret()
for iteration in range(num_iterations):
# 策略更新
update_strategy()
# 策略模拟
for game in range(num_games):
play_game_and_update_regret()
# 检查收敛条件
if check_convergence():
break
深度蒙特卡洛算法(Deep Monte Carlo, DMC)和深度蒙特卡洛树搜索算法(Monte Carlo Tree Search, MCTS)都是利用蒙特卡洛方法和深度学习技术来解决复杂决策问题的算法。虽然它们共享一些基础,但在实现细节、应用场景和算法流程上存在显著差异。以下是对这两种算法的详细比较:
Dao, T., & Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality. International Conference on Machine Learning (ICML).
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv preprint arXiv:2312.00752.
近年来,大语言模型(Large Language Models, LLMs)的快速发展为人工智能领域带来了革命性的变革。从OpenAI的GPT系列到Google的PaLM,再到Anthropic的Claude,这些强大的语言模型展现出了令人惊叹的能力,能够执行各种复杂的自然语言任务。然而,如何有效地利用这些模型、激发它们的潜力,成为了研究人员和实践者面临的一大挑战。在这一背景下,提示工程(Prompting)应运而生,并迅速成为了人工智能领域的热门话题。
[1] Schulhoff, S., & Ilie, M. (2023). The Prompt Report: A Systematic Survey of Prompting Techniques. arXiv preprint arXiv:2311.06462.
[2] Zheng, M., Pei, J., & Jurgens, D. (2023). Is “A Helpful Assistant” the Best Role for Large Language Models? A Systematic Evaluation of Social Roles in System Prompts. arXiv preprint arXiv:2311.10054.
[3] Willison, S. (2023). Personal communication on Twitter regarding role prompting effectiveness.
Huang, Y. J., & Hadfi, R. (2024). How Personality Traits Influence Negotiation Outcomes? A Simulation based on Large Language Models. arXiv preprint arXiv:2407.11549.
Goldberg, L. R. (1992). The development of markers for the Big-Five factor structure. Psychological assessment, 4(1), 26.
Costa Jr, P. T., & McCrae, R. R. (1995). Domains and facets: Hierarchical personality assessment using the Revised NEO Personality Inventory. Journal of personality assessment, 64(1), 21-50.
Falcão, P. F., Saraiva, L. A. S., & dos Santos, E. A. (2018). The influence of personality traits on negotiation performance. International Journal of Business and Management, 13(8), 75-84.
Barry, B., & Friedman, R. A. (1998). Bargainer characteristics in distributive and integrative negotiation. Journal of personality and social psychology, 74(2), 345.
Meta AI的研究人员提出了一种名为System 2 Attention(S2A)的新机制,旨在改进LLMs的推理能力。这项研究的灵感来源于丹尼尔·卡尼曼和阿莫斯·特沃斯基在《Thinking Fast and Slow》中对行为心理学的深入探讨。他们将人类的思考过程分为两种系统:快速、直觉的“系统1”和缓慢、理性的“系统2”。S2A正是模仿了这种“系统2”的思考方式,通过更加深思熟虑的方式来处理信息。