An increasingly common socio-technical problem is people being taken in by offers that sound “too good to be true”, where persuasion and trust shape decision-making. This paper investigates how AI can help detect these deceptive scenarios. We analyze how humans strategically deceive each other in Diplomacy, a board game that requires both natural language communication and strategic reasoning. This requires extracting logical forms representing proposals—agreements that players suggest during communication—and computing their relative rewards using agents’ value functions. Combined with text-based features, this can improve our deception detection. Our method detects human deception with a high precision when compared to a Large Language Model approach that flags many true messages as deceptive. Future human-AI interaction tools can build on our methods for deception detection by triggering friction to give users a chance of interrogating suspicious proposals.
Diplomacy is a unique environment that combines natural language communication with strategic gameplay, making it a compelling testbed for studying deception. Deception in this game has real consequences—players who fall for misleading proposals often lose key territories, while deceivers gain strategic advantages. The game’s structured yet open-ended setting provides a controlled environment to explore trust, persuasion, and betrayal, all of which are relevant to real-world scenarios involving misinformation. As AI systems increasingly participate in human-facing interactions, detecting and mitigating deception—especially when it arises strategically—becomes a necessary and impactful problem.
The paper introduces three novel metrics to characterize deception in strategic proposals: Bait, Switch, and Edge. "Bait" measures how much better off a victim appears if they trust the deceiver; "Switch" captures how much worse the victim becomes if the deceiver betrays them instead of cooperating; and "Edge" quantifies the benefit the deceiver gains from betrayal. These signals are grounded in value differences from counterfactual outcomes. For example, Figure 2 in the paper demonstrates how Austria's seemingly supportive message to Italy is actually a setup for betrayal, with all three metrics quantifying this deception.
CTRL-D (CounTerfactual RL against Deception) is a novel pipeline that begins by parsing negotiation messages into logical proposals using Abstract Meaning Representation (AMR). These proposals are then evaluated using CICERO’s reinforcement learning-based value function to compute bait, switch, and edge. These numerical indicators are combined with BERT-based textual embeddings and fed into a neural network classifier. This hybrid model enables robust detection of deceptive proposals, effectively surfacing hidden strategies that would be missed by language-only models.
To evaluate CTRL-D, the authors use two datasets: one from human-only Diplomacy games annotated with deception labels (Peskov et al., 2020), and a large-scale human-AI interaction dataset from webdiplomacy.net. For comparison, they benchmark against both traditional baselines (like LSTM with power dynamics) and LLM-based deception detection using LLaMA 3.1. Importantly, CTRL-D focuses on high-precision alerts to create strategic friction, helping players reevaluate risky decisions. While deceptive moves are rare (<2%), the framework is designed to flag suspicious messages without overwhelming users with false positives.
CTRL-D achieves remarkably high precision (0.95) in detecting human-labeled lies, significantly outperforming both LLM baselines and prior methods. Though its recall is moderate (0.238), this tradeoff is acceptable given the rarity and difficulty of identifying deceptive acts. In large-scale experiments, CTRL-D remains conservative but precise, triggering alerts only when deception is highly likely. The results validate that deception in language can be detected more reliably when contextualized through game dynamics and reinforcement learning signals, offering a new direction for safer human-AI collaboration in negotiation settings.
@misc{wongkamjan2025itrustyoudetecting,
title={Should I Trust You? Detecting Deception in Negotiations using Counterfactual RL},
author={Wichayaporn Wongkamjan and Yanze Wang and Feng Gu and Denis Peskoff and Jonathan K. Kummerfeld and Jonathan May and Jordan Lee Boyd-Graber},
year={2025},
eprint={2502.12436},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.12436},
}