Academic Project Page

An increasingly common socio-technical problem is people being taken in by offers that sound “too good to be true”, where persuasion and trust shape decision-making. This paper investigates how AI can help detect these deceptive scenarios. We analyze how humans strategically deceive each other in Diplomacy, a board game that requires both natural language communication and strategic reasoning. This requires extracting logical forms representing proposals—agreements that players suggest during communication—and computing their relative rewards using agents’ value functions. Combined with text-based features, this can improve our deception detection. Our method detects human deception with a high precision when compared to a Large Language Model approach that flags many true messages as deceptive. Future human-AI interaction tools can build on our methods for deception detection by triggering friction to give users a chance of interrogating suspicious proposals.

Diplomacy offers a rich environment to study deception because strategic betrayal has tangible consequences.

Diplomacy is a unique environment that combines natural language communication with strategic gameplay, making it a compelling testbed for studying deception. Deception in this game has real consequences—players who fall for misleading proposals often lose key territories, while deceivers gain strategic advantages. The game’s structured yet open-ended setting provides a controlled environment to explore trust, persuasion, and betrayal, all of which are relevant to real-world scenarios involving misinformation. As AI systems increasingly participate in human-facing interactions, detecting and mitigating deception—especially when it arises strategically—becomes a necessary and impactful problem.

We quantify deception through three metrics---bait, switch, and edge---that capture the risks and benefits of betrayal.

The paper introduces three novel metrics to characterize deception in strategic proposals: Bait, Switch, and Edge. "Bait" measures how much better off a victim appears if they trust the deceiver; "Switch" captures how much worse the victim becomes if the deceiver betrays them instead of cooperating; and "Edge" quantifies the benefit the deceiver gains from betrayal. These signals are grounded in value differences from counterfactual outcomes. For example, Figure 2 in the paper demonstrates how Austria's seemingly supportive message to Italy is actually a setup for betrayal, with all three metrics quantifying this deception.

CTRL-D combines language understanding and counterfactual reasoning to detect deceptive negotiation proposals.

CTRL-D (CounTerfactual RL against Deception) is a novel pipeline that begins by parsing negotiation messages into logical proposals using Abstract Meaning Representation (AMR). These proposals are then evaluated using CICERO’s reinforcement learning-based value function to compute bait, switch, and edge. These numerical indicators are combined with BERT-based textual embeddings and fed into a neural network classifier. This hybrid model enables robust detection of deceptive proposals, effectively surfacing hidden strategies that would be missed by language-only models.

Our experiments use real annotated data and compare against Llama 3 zero-shot prompting baselines.

To evaluate CTRL-D, the authors use two datasets: one from human-only Diplomacy games annotated with deception labels (Peskov et al., 2020), and a large-scale human-AI interaction dataset from webdiplomacy.net. For comparison, they benchmark against both traditional baselines (like LSTM with power dynamics) and LLM-based deception detection using LLaMA 3.1. Importantly, CTRL-D focuses on high-precision alerts to create strategic friction, helping players reevaluate risky decisions. While deceptive moves are rare (<2%), the framework is designed to flag suspicious messages without overwhelming users with false positives.

CTRL-D achieves high-precision deception detection, outperforming LLMs by integrating strategic reasoning.

CTRL-D achieves remarkably high precision (0.95) in detecting human-labeled lies, significantly outperforming both LLM baselines and prior methods. Though its recall is moderate (0.238), this tradeoff is acceptable given the rarity and difficulty of identifying deceptive acts. In large-scale experiments, CTRL-D remains conservative but precise, triggering alerts only when deception is highly likely. The results validate that deception in language can be detected more reliably when contextualized through game dynamics and reinforcement learning signals, offering a new direction for safer human-AI collaboration in negotiation settings.

Interests in more of our works?

How we ground the language and detect persuasion and deception when humans and AI play Diplomacy! 📝 (paper)
How the best AI Diplomacy player can help noobs play Diplomacy! 📝 (paper)

BibTeX

@inproceedings{wongkamjan-etal-2025-trust,
    title = "Should {I} Trust You? Detecting Deception in Negotiations using Counterfactual {RL}",
    author = "Wongkamjan, Wichayaporn  and
      Wang, Yanze  and
      Gu, Feng  and
      Peskoff, Denis  and
      Kummerfeld, Jonathan K.  and
      May, Jonathan  and
      Boyd-Graber, Jordan Lee",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.1287/",
    pages = "25099--25113",
    ISBN = "979-8-89176-256-5",
    abstract = "An increasingly common socio-technical problem is people being taken in by offers that sound ``too good to be true'', where persuasion and trust shape decision-making. This paper investigates how AI can help detect these deceptive scenarios. We analyze how humans strategically deceive each other in Diplomacy, a board game that requires both natural language communication and strategic reasoning. This requires extracting logical forms representing proposals{---}agreements that players suggest during communication{---}and computing their relative rewards using agents' value functions. Combined with text-based features, this can improve our deception detection. Our method detects human deception with a high precision when compared to a Large Language Model approach that flags many true messages as deceptive. Future human-AI interaction tools can build on our methods for deception detection by triggering friction to give users a chance of interrogating suspicious proposals."
}
}

Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL