Multi-Agent LLMs for Automated Game Balancing
Game balancing is a longstanding challenge requiring repeated playtesting, expert intuition, and extensive manual tuning. We introduce RuleSmith, the first framework that achieves automated game balancing by leveraging the reasoning capabilities of multi-agent LLMs. It couples a game engine, multi-agent LLMs self-play, and Bayesian optimization operating over a multi-dimensional rule space. As a proof of concept, we instantiate RuleSmith on CivMini, a simplified civilization-style game containing heterogeneous factions, economy systems, production rules, and combat mechanics, all governed by tunable parameters. LLM agents interpret textual rulebooks and game states to generate actions, to conduct fast evaluation of balance metrics such as win-rate disparities. To search the parameter landscape efficiently, we integrate Bayesian optimization with acquisition-based adaptive sampling and discrete projection: promising candidates receive more evaluation games for accurate assessment, while exploratory candidates receive fewer games for efficient exploration. Experiments show that RuleSmith converges to highly balanced configurations and provides interpretable rule adjustments that can be directly applied to downstream game systems. Our results illustrate that LLM simulation can serve as a powerful surrogate for automating design and balancing in complex multi-agent environments.
LLM self-play + Bayesian optimization can automatically balance asymmetric strategy games from natural-language rulebooks.
Figure 1: Overview of RuleSmith. Multi-agent LLMs perform zero-shot self-play using solely the rule book under parameterized rule sets to automatically optimize asymmetric strategy games and other rule-driven systems.
We consider balancing an asymmetric, parameterized, turn-based strategy game by optimizing its rule parameters so that two roles (Empire and Nomads) achieve approximately equal win rates when controlled by LLM agents. RuleSmith uses two LLM agents to play the game from a natural-language rulebook; a Bayesian optimizer with acquisition-based adaptive sampling searches the rule space, allocating more evaluation games to promising candidates. Continuous proposals are discretized to valid rule configurations (using the precision above) before evaluation.
Figure 2: Overview of the RuleSmith method. We represent CivMini as a parameterized rule space θ. Given a candidate θ_t, two LLM agents (Empire and Nomads) play N_t self-play games, producing a balance loss L(θ). A Bayesian optimizer maintains a surrogate g(θ) and selects new candidates by maximizing an acquisition function. The number of games N_t is adaptively set by Expected Improvement; continuous proposals are mapped to discrete rulesets via D(·) before evaluation.
CivMini is a 7×7 grid, turn-based asymmetric game. Empire has Farmers (gather only) and Soldiers (combat only); Nomads have Cavalry (combat + move 2 cells/turn, gain resources by killing). Each turn, each unit takes one action: GATHER, MOVE, BATTLE, PRODUCE_RESOURCE, PRODUCE_UNIT, or PASS. Win: destroy the opponent’s city, or else highest score at turn limit (score = weighted resources + battles won + surviving units).
12 tunable parameters (RuleSmith optimizes these for balance):
| Parameter | Range | Precision |
|---|---|---|
| Economy | ||
| Initial resources | [2, 10] | integer |
| Empire farmer gather | [1, 5] | integer |
| Nomads kill resource gain | [1, 10] | integer |
| Combat | ||
| Empire damage | [1, 5] | integer |
| Nomads damage | [1, 5] | integer |
| Empire soldier HP | [4, 16] | integer |
| Nomads cavalry HP | [4, 16] | integer |
| Production | ||
| Empire unit cost | [2, 10] | integer |
| Nomads unit cost | [2, 10] | integer |
| Scoring | ||
| Resource weight | [0.1, 0.5] | 0.1 |
| Battle weight | [1, 5] | integer |
| Unit weight | [1, 5] | integer |
Optimized winning chances under different training (rows) and evaluation (columns). E = Empire, N = Nomads. Each cell: Empire wins | Nomads wins. Near-balanced (50% ± 5%) in bold. Model sizes: 2B, 8B.
| Train \ Eval | E2B vs N2B | E2B vs N8B | E8B vs N2B | E8B vs N8B |
|---|---|---|---|---|
| E2B vs N2B | 48 | 52 | 32 | 68 | 27 | 73 | 55 | 45 |
| E2B vs N8B | 81 | 19 | 47 | 53 | 91 | 9 | 75 | 25 |
| E8B vs N2B | 37 | 63 | 6 | 94 | 52 | 48 | 29 | 71 |
| E8B vs N8B | 53 | 47 | 24 | 76 | 81 | 19 | 51 | 49 |
Ablation on optimization methods. Random Search and (1+1)-ES use fixed N=64 games per iteration. BO with adaptive sampling uses N ∈ [16, 64]. Win rates as Empire | Nomads.
| Random Search | (1+1)-ES | BO (adaptive) |
|---|---|---|
| 13 | 87 | 26 | 74 | 51 | 49 |
| BO (N=16) | BO (N=32) | BO (N=64) |
|---|---|---|
| 34 | 66 | 61 | 39 | 48 | 52 |
Ablation on game designs. RuleSmith achieves balanced win rates across map sizes and turn limits (turns in parentheses).
| 5×5 (16) | 7×7 (16) | 9×9 (32) | 11×11 (32) |
|---|---|---|---|
| 53 | 47 | 51 | 49 | 48 | 52 | 51 | 49 |
Representative CivMini games under balanced parameters (InternVL3.5-8B for both factions).
If you find our work useful, please cite:
@article{zeng2026rulesmith,
title = {Rulesmith: multi-agent llms for automated game balancing},
author = {Zeng, Ziyao and Liu, Chen and Liu, Tianyu and Wang, Hao and Sun, Xiatao and Yang, Fengyu and Liu, Xiaofeng and Fan, Zhiwen},
journal = {arXiv preprint arXiv:2602.06232},
year = {2026}
}