Multi-Agent LLMs for Automated Game Balancing
Game balancing is a longstanding challenge requiring repeated playtesting, expert intuition, and extensive manual tuning. We introduce RuleSmith, the first framework that achieves automated game balancing by leveraging the reasoning capabilities of multi-agent LLMs. It couples a game engine, multi-agent LLMs self-play, and Bayesian optimization operating over a multi-dimensional rule space. As a proof of concept, we instantiate RuleSmith on CivMini, a simplified civilization-style game containing heterogeneous factions, economy systems, production rules, and combat mechanics, all governed by tunable parameters. LLM agents interpret textual rulebooks and game states to generate actions, to conduct fast evaluation of balance metrics such as win-rate disparities. To search the parameter landscape efficiently, we integrate Bayesian optimization with acquisition-based adaptive sampling and discrete projection: promising candidates receive more evaluation games for accurate assessment, while exploratory candidates receive fewer games for efficient exploration. Experiments show that RuleSmith converges to highly balanced configurations and provides interpretable rule adjustments that can be directly applied to downstream game systems. Our results illustrate that LLM simulation can serve as a powerful surrogate for automating design and balancing in complex multi-agent environments.
LLM self-play + Bayesian optimization can automatically balance asymmetric strategy games from natural-language rulebooks.
Figure 1: Overview of RuleSmith. Multi-agent LLMs perform zero-shot self-play using solely the rule book under parameterized rule sets to automatically optimize asymmetric strategy games and other rule-driven systems.
We consider balancing an asymmetric, parameterized, turn-based strategy game by optimizing its rule parameters so that two roles (Empire and Nomads) achieve approximately equal win rates when controlled by LLM agents. RuleSmith uses two LLM agents to play the game from a natural-language rulebook; a Bayesian optimizer with acquisition-based adaptive sampling searches the rule space, allocating more evaluation games to promising candidates. The game CivMini exposes 12 tunable parameters (economy, combat, production, scoring); continuous proposals are discretized to valid rule configurations before evaluation.
Figure 2: Overview of the RuleSmith method. We represent CivMini as a parameterized rule space θ. Given a candidate θ_t, two LLM agents (Empire and Nomads) play N_t self-play games, producing a balance loss L(θ). A Bayesian optimizer maintains a surrogate g(θ) and selects new candidates by maximizing an acquisition function. The number of games N_t is adaptively set by Expected Improvement; continuous proposals are mapped to discrete rulesets via D(·) before evaluation.
Optimized winning chances under different training (rows) and evaluation (columns). E = Empire, N = Nomads. Each cell: Empire wins | Nomads wins. Near-balanced (50% ± 5%) in bold. Model sizes: 2B, 8B.
| Train \ Eval | E2B vs N2B | E2B vs N8B | E8B vs N2B | E8B vs N8B |
|---|---|---|---|---|
| E2B vs N2B | 48 | 52 | 32 | 68 | 27 | 73 | 55 | 45 |
| E2B vs N8B | 81 | 19 | 47 | 53 | 91 | 9 | 75 | 25 |
| E8B vs N2B | 37 | 63 | 6 | 94 | 52 | 48 | 29 | 71 |
| E8B vs N8B | 53 | 47 | 24 | 76 | 81 | 19 | 51 | 49 |
Ablation on optimization methods. Random Search and (1+1)-ES use fixed N=64 games per iteration. BO with adaptive sampling uses N ∈ [16, 64]. Win rates as Empire | Nomads.
| Random Search | (1+1)-ES | BO (adaptive) |
|---|---|---|
| 13 | 87 | 26 | 74 | 51 | 49 |
| BO (N=16) | BO (N=32) | BO (N=64) |
|---|---|---|
| 34 | 66 | 61 | 39 | 48 | 52 |
Ablation on game designs. RuleSmith achieves balanced win rates across map sizes and turn limits (turns in parentheses).
| 5×5 (16) | 7×7 (16) | 9×9 (32) | 11×11 (32) |
|---|---|---|---|
| 53 | 47 | 51 | 49 | 48 | 52 | 51 | 49 |
Representative CivMini games under balanced parameters (InternVL3.5-8B for both factions).