概念组合性:解读基础模型的新视角

近年来,基础模型在各个领域都取得了令人瞩目的成就,然而,其黑箱特性也为调试、监控、控制和信任这些模型带来了巨大挑战。概念解释作为一种新兴方法,试图利用诸如物体属性(例如条纹)或语言情感(例如快乐)等单个概念来解释模型的行为。通过将模型学习到的表示分解为多个概念向量,可以推导出这些概念。例如,模型对一张狗的图像的嵌入可以分解为代表其毛发、鼻子和尾巴的概念向量的总和。

现有方法的不足

现有的基于 PCA 或 KMeans 等方法的工作能够很好地提取基本概念的向量表示。例如,图 1 展示了从 CLIP 模型中提取的 CUB 数据集中的图像,这些图像包含了 PCA 学习到的概念。这些技术能够正确地提取诸如“白色鸟类”和“小型鸟类”等概念的表示,然而,将它们的表示相加并不能得到“小型白色鸟类”这一概念的表示。

概念组合性的重要性

概念的组合性对于以下几个用例至关重要:

  • 模型预测解释: 通过组合概念来解释模型预测。
  • 模型行为编辑: 组合性概念允许编辑细粒度的模型行为,例如在不影响其他行为的情况下提高大型语言模型的真实性。
  • 新任务训练: 可以训练模型组合基本概念来完成新任务,例如使用喙的形状、翅膀的颜色和环境等概念对鸟类进行分类。

概念组合性的评估

为了评估概念组合性,我们首先在受控环境下验证了概念的真实表示的组合性。我们观察到,概念可以被分组为属性,其中每个属性都包含关于某些共同属性的概念,例如物体的颜色或形状。来自不同属性的概念(例如蓝色和立方体)可以组合,而来自同一属性的概念(例如红色和绿色)则不能。我们还观察到,来自不同属性的概念大致正交,而来自同一属性的概念则不然。

概念组合性提取 (CCE)

为了提取组合性概念,我们提出了 CCE 方法。该方法的关键思想是一次性搜索整个概念子空间,而不是单个概念,从而允许 CCE 强制执行上述组合性概念的属性。CCE 算法主要包含以下步骤:

  1. 学习子空间 (LearnSubspace): 优化一个子空间,使得该子空间中的数据能够根据固定的聚类中心进行良好的聚类。
  2. 学习概念 (LearnConcepts): 在学习到的子空间中执行球形 K 均值聚类,以识别概念。
  3. 迭代优化: 交替执行学习子空间和学习概念步骤,直到收敛。

实验结果

我们在视觉和语言数据集上进行了广泛的实验,结果表明:

  • 在受控环境下,CCE 比现有方法更能有效地组合概念。
  • 在真实数据环境下,CCE 能够成功地发现新的、有意义的组合性概念。
  • CCE 提取的组合性概念可以提高下游任务的性能。

结论

本文从组合性的角度研究了基础模型的概念解释。我们验证了从这些模型中提取的真实概念是组合性的,而现有的无监督概念提取方法通常不能保证组合性。为了解决这个问题,我们首先确定了组合性概念表示的两个显著属性,并设计了一种新的概念提取方法 CCE,该方法在设计上尊重这些属性。通过对视觉和语言数据集进行的大量实验,我们证明了 CCE 不仅可以学习组合性概念,还可以提高下游任务的性能。

参考文献

  • Andreas, J. (2019). Measuring compositionality in representation learning.
  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2001). Latent Dirichlet allocation.
  • Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings.
  • Bricken, A., Liang, P., & Gilpin, L. H. (2023). Dictionary learning with transformers for interpretable image classification.
  • Chen, T. Q., Li, X., Grosse, R. B., & Duvenaud, D. K. (2020). Isolating sources of disentanglement in variational autoencoders.
  • Espinosa Zarlenga, J. M., Cogswell, M., Goh, H., & Romero, A. (2022). Improving robustness and calibration in medical imaging with semantic-aware contrastive learning.
  • Fel, T., Bau, D., & Regev, I. (2023). Craft: Concept-driven representation learning by adaptive feature transformation.
  • Frankland, S. J., & Greene, M. R. (2020). Generative models of visual imagination.
  • Ghorbani, A., Abid, A., Zhu, J., Liu, C., Huang, X., & Schuetz, A. (2019). Towards automatic concept-based explanations.
  • Havaldar, P., Stein, A., & Naik, M. (2023). Explaining by aligning: Interpreting decisions by aligning model behavior across groups.
  • Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., … & Lerchner, A. (2016). Beta-vae: Learning basic visual concepts with a constrained variational framework.
  • Hill, F., Cho, K., & Korhonen, A. (2018). Learning distributed representations of sentences from unlabelled data.
  • Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., & Girshick, R. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning.
  • Kim, B., Wattenberg, M., Gilmer, J., Cai, C., Wexler, J., Viegas, F., & Sayres, R. (2018). Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav).
  • Koh, P. W., Nguyen, T., Tang, Y. S., Wang, S., Pierson, E., Kim, B., … & Liang, P. (2020). Concept bottleneck models.
  • Kwon, Y., Kim, S., & Yoon, J. (2023). Editing compositional transformations in latent space.
  • Lake, B. M. (2014). Concept learning in humans and machines: Fundamental issues and a possible solution.
  • Lewis, H., Purdy, W., & Steinhardt, J. (2022). Transformers learn in-context compositionality.
  • Lovering, J., & Pavlick, E. (2022). Relational probes: A technique for analyzing the compositional capabilities of language models.
  • Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Efficient estimation of word representations in vector space.
  • Mitchell, T. (1999). Machine learning and data mining.
  • Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … & Sutskever, I. (2021). Learning transferable visual models from natural language supervision.
  • Rigotti, M., Spooner, T., Dodge, J., Gould, S., & Gordon, G. J. (2022). Learning to explain by concept discovery.
  • Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.
  • Santurkar, S., Friedman, C., Wallace, E., Kulkarni, A., Zeng, A., & Thorat, A. (2021). Editing memories: Towards controllable and safe generation with transformers.
  • Schaeffer, R., Wang, S., Huang, D., Dhingra, B., Sanchez-Lengeling, B., Blunsom, P., & Cohen, W. W. (2024). Beyond language: Towards a unified foundation model evaluation benchmark.
  • Srivastava, A., Mittal, A., Thiebaut, J., Yen-Chun Chen, T., Manjunath, A., Jain, A., … & Salakhutdinov, R. (2023). Beyond the imitation game: Quantifying and extrapolating capabilities of language models.
  • Stein, A., Havaldar, P., & Naik, M. (2023). Towards group-fair concept-based explanations.
  • Tamkin, A., Brundage, M., Clark, J., & Amodei, D. (2023). Understanding the capabilities, limitations, and societal impact of large language models.
  • Todd, S., Srivastava, S., Htut, P. M., & Chang, M. W. (2023). Discovering latent knowledge in language models without supervision.
  • Todd, S., Srivastava, S., Htut, P. M., & Chang, M. W. (2024). Probing for robust, factual knowledge in language models.
  • Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., … & Lample, G. (2023). Llama 2: Open foundation and fine-tuned chat models.
  • Trager, M., Elhage, A., Bharadwaj, S., Kenton, Z., Bhatia, N. S., Cobb, A., … & Amodei, D. (2023). Linear algebra audits: Explaining and controlling language model behavior.
  • Tschandl, P., Rosendahl, C., & Kittler, H. (2018). The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.
  • Turpin, M., Khandelwal, U., Roberts, A., Lee, J., Raffel, C., Shazeer, N., & Irwin, J. (2024). Foundation models for decision-making: Problems, opportunities, and risks.
  • Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 dataset.
  • Wang, C., Locatello, F., Schmidhuber, J., & Lapedriza, A. (2022). Pivron: Privacy-preserving representation learning with variational inference.
  • Wang, Z., Xie, Y., Wang, X., Zhou, W., Yu, S., Sun, S., … & Zhou, J. (2023). Compositional visual reasoning with large language models.
  • Wong, E., Schmidt, L., Torralba, A., & Jegelka, S. (2021). Discovering concepts for realistic counterfactual explanations.
  • Wu, Y., Wang, X., Tan, C., Huang, D., Wei, F., Zhou, M., … & Zhou, J. (2023). Reasoning with heterogeneous knowledge: Uncovering emergent reasoning abilities of large language models in e-commerce.
  • Xu, Y., Zhao, S., Song, J., Zhao, H., Eskenazi, M., LeCun, Y., & Romero, A. (2022). How well do vision transformers learn inductive biases? a case study in object recognition.
  • Yang, J., Wang, S., Zhou, D., Liu, M., Chang, P.-Y., & Zhao, W. X. (2023). Conceptgraph: Mining concept knowledge graph from pretrained language models for interpretable logical reasoning.
  • Yeh, C.-K., Hsieh, C.-Y., Suggala, A., Ravikumar, P., & Kumar, P. (2020). On textembeddings for numerical features.
  • Yuksekgonul, M., Gokmen, S., Elhoseiny, M., & Cicek, O. (2023). Zero-shot concept recognition for robot manipulation with large language models.
  • Yun, S., Oh, S. J., Bastani, O., & Lee, K. (2021). Transformers provide surprisingly effective representations for online and offline dictionary learning.
  • Zhai, C. (1997). Exploiting context to identify ambiguous terms.
  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2018). Learning deep features for discriminative localization.
  • Zou, Y., Wang, L., Hu, Z., Li, Z., Wang, W., Tu, Z., … & Sun, M. (2023a). Controllable generation from pre-trained language models via inverse prompting.
  • Zou, Y., Wang, L., Hu, Z., Li, Z., Wang, W., Tu, Z., … & Sun, M. (2023b). Controllable generation from pre-trained language models via inverse prompting.

Leave a Comment