RuleSmith

Multi-Agent LLMs for Automated Game Balancing

Game balancing is a longstanding challenge requiring repeated playtesting, expert intuition, and extensive manual tuning. We introduce RuleSmith, the first framework that achieves automated game balancing by leveraging the reasoning capabilities of multi-agent LLMs. It couples a game engine, multi-agent LLMs self-play, and Bayesian optimization operating over a multi-dimensional rule space. As a proof of concept, we instantiate RuleSmith on CivMini, a simplified civilization-style game containing heterogeneous factions, economy systems, production rules, and combat mechanics, all governed by tunable parameters. LLM agents interpret textual rulebooks and game states to generate actions, to conduct fast evaluation of balance metrics such as win-rate disparities. To search the parameter landscape efficiently, we integrate Bayesian optimization with acquisition-based adaptive sampling and discrete projection: promising candidates receive more evaluation games for accurate assessment, while exploratory candidates receive fewer games for efficient exploration. Experiments show that RuleSmith converges to highly balanced configurations and provides interpretable rule adjustments that can be directly applied to downstream game systems. Our results illustrate that LLM simulation can serve as a powerful surrogate for automating design and balancing in complex multi-agent environments.

LLM self-play + Bayesian optimization can automatically balance asymmetric strategy games from natural-language rulebooks.

📄 arXiv Paper 🐙 GitHub Code

Ziyao Zeng¹, Chen Liu¹, Tianyu Liu¹, Hao Wang², Xiatao Sun¹, Fengyu Yang¹, Xiaofeng Liu¹, Zhiwen Fan²

¹ Yale University ² Texas A&M University

For any questions, please contact: ziyao.zeng@yale.edu

Contributions

Executable self-play from natural-language rulebooks. We show that multi-agent LLMs can perform zero-shot self-play in an executable, asymmetric strategy game from natural-language rulebooks and structured game states, producing legal and verifiable actions without training.
Multi-agents pipeline for automated game balancing. We present a general framework that integrates multi-agent LLMs self-play with Bayesian optimization and acquisition-based adaptive sampling to automatically adjust rule parameters and achieve balanced outcomes in asymmetric strategic environments.
Comprehensive empirical validation. Through systematic evaluations on CivMini across different model sizes (2B, 8B) and faction configurations, we demonstrate that RuleSmith consistently achieves near-balanced outcomes (50% ± 5% win rates), with interpretable parameters that transfer across evaluation settings.

Method Overview

We consider balancing an asymmetric, parameterized, turn-based strategy game by optimizing its rule parameters so that two roles (Empire and Nomads) achieve approximately equal win rates when controlled by LLM agents. RuleSmith uses two LLM agents to play the game from a natural-language rulebook; a Bayesian optimizer with acquisition-based adaptive sampling searches the rule space, allocating more evaluation games to promising candidates. The game CivMini exposes 12 tunable parameters (economy, combat, production, scoring); continuous proposals are discretized to valid rule configurations before evaluation.

Experimental Results

Optimized winning chances under different training (rows) and evaluation (columns). E = Empire, N = Nomads. Each cell: Empire wins | Nomads wins. Near-balanced (50% ± 5%) in bold. Model sizes: 2B, 8B.

Train \ Eval	E_2B vs N_2B	E_2B vs N_8B	E_8B vs N_2B	E_8B vs N_8B
E_2B vs N_2B	48 \| 52	32 \| 68	27 \| 73	55 \| 45
E_2B vs N_8B	81 \| 19	47 \| 53	91 \| 9	75 \| 25
E_8B vs N_2B	37 \| 63	6 \| 94	52 \| 48	29 \| 71
E_8B vs N_8B	53 \| 47	24 \| 76	81 \| 19	51 \| 49

Ablation on optimization methods. Random Search and (1+1)-ES use fixed N=64 games per iteration. BO with adaptive sampling uses N ∈ [16, 64]. Win rates as Empire | Nomads.

Random Search	(1+1)-ES	BO (adaptive)
13 \| 87	26 \| 74	51 \| 49

BO (N=16)	BO (N=32)	BO (N=64)
34 \| 66	61 \| 39	48 \| 52

Ablation on game designs. RuleSmith achieves balanced win rates across map sizes and turn limits (turns in parentheses).

5×5 (16)	7×7 (16)	9×9 (32)	11×11 (32)
53 \| 47	51 \| 49	48 \| 52	51 \| 49

Game Rollout Visualizations

Representative CivMini games under balanced parameters (InternVL3.5-8B for both factions).

Game 1: Nomad high-score victory — **Nomad high-score victory (16 turns).** The Nomads expanded by producing six cavalry units and overwhelmed the Empire's defense. The Empire attempted to hold a defensive line with farmers and soldiers but was gradually eliminated, resulting in a board dominated by Nomad cavalry.

Game 3: Empire defense victory — **Empire defense victory.** The Empire successfully defended against the Nomads' early aggression. Although the Nomads deployed four cavalry units to threaten the top-left, the Empire established a defense with three soldiers. After heavy attrition, the Empire secured the win through economic dominance, utilizing Farmer units to generate resources throughout the battle.

Game 4: Rapid Nomad conquest — **Rapid Nomad conquest (6 turns).** Nomad Cavalry Unit 0 executed consecutive strikes on the Empire's city from turns 3 to 6, causing sustained damage. This aggression resulted in the destruction of the Empire's city and a rapid win.

Game 13: Empire score victory — **Empire score victory.** The Empire successfully defended against a direct Nomad attack on their city. The Nomads rushed the top-left and began attacking on turn 5, but the soldiers systematically eliminated the cavalry. The Empire retained two soldiers and two farmers, securing a higher score through unit survival and resource accumulation.

Train \ Eval	E_2B vs N_2B	E_2B vs N_8B	E_8B vs N_2B	E_8B vs N_8B
E_2B vs N_2B	48 \| 52	32 \| 68	27 \| 73	55 \| 45
E_2B vs N_8B	81 \| 19	47 \| 53	91 \| 9	75 \| 25
E_8B vs N_2B	37 \| 63	6 \| 94	52 \| 48	29 \| 71
E_8B vs N_8B	53 \| 47	24 \| 76	81 \| 19	51 \| 49