RuleSmith

Multi-Agent LLMs for Automated Game Balancing

Game balancing is a longstanding challenge requiring repeated playtesting, expert intuition, and extensive manual tuning. We introduce RuleSmith, the first framework that achieves automated game balancing by leveraging the reasoning capabilities of multi-agent LLMs. It couples a game engine, multi-agent LLMs self-play, and Bayesian optimization operating over a multi-dimensional rule space. As a proof of concept, we instantiate RuleSmith on CivMini, a simplified civilization-style game containing heterogeneous factions, economy systems, production rules, and combat mechanics, all governed by tunable parameters. LLM agents interpret textual rulebooks and game states to generate actions, to conduct fast evaluation of balance metrics such as win-rate disparities. To search the parameter landscape efficiently, we integrate Bayesian optimization with acquisition-based adaptive sampling and discrete projection: promising candidates receive more evaluation games for accurate assessment, while exploratory candidates receive fewer games for efficient exploration. Experiments show that RuleSmith converges to highly balanced configurations and provides interpretable rule adjustments that can be directly applied to downstream game systems. Our results illustrate that LLM simulation can serve as a powerful surrogate for automating design and balancing in complex multi-agent environments.

LLM self-play + Bayesian optimization can automatically balance asymmetric strategy games from natural-language rulebooks.

Overview of RuleSmith

Figure 1: Overview of RuleSmith. Multi-agent LLMs perform zero-shot self-play using solely the rule book under parameterized rule sets to automatically optimize asymmetric strategy games and other rule-driven systems.

Ziyao Zeng1, Chen Liu1, Tianyu Liu1, Hao Wang2, Xiatao Sun1, Fengyu Yang1, Xiaofeng Liu1, Zhiwen Fan2  

1 Yale University     2 Texas A&M University

For any questions, please contact: ziyao.zeng@yale.edu

Contributions

Method Overview

We consider balancing an asymmetric, parameterized, turn-based strategy game by optimizing its rule parameters so that two roles (Empire and Nomads) achieve approximately equal win rates when controlled by LLM agents. RuleSmith uses two LLM agents to play the game from a natural-language rulebook; a Bayesian optimizer with acquisition-based adaptive sampling searches the rule space, allocating more evaluation games to promising candidates. Continuous proposals are discretized to valid rule configurations (using the precision above) before evaluation.

RuleSmith method overview

Figure 2: Overview of the RuleSmith method. We represent CivMini as a parameterized rule space θ. Given a candidate θ_t, two LLM agents (Empire and Nomads) play N_t self-play games, producing a balance loss L(θ). A Bayesian optimizer maintains a surrogate g(θ) and selects new candidates by maximizing an acquisition function. The number of games N_t is adaptively set by Expected Improvement; continuous proposals are mapped to discrete rulesets via D(·) before evaluation.

The Game: CivMini

CivMini is a 7×7 grid, turn-based asymmetric game. Empire has Farmers (gather only) and Soldiers (combat only); Nomads have Cavalry (combat + move 2 cells/turn, gain resources by killing). Each turn, each unit takes one action: GATHER, MOVE, BATTLE, PRODUCE_RESOURCE, PRODUCE_UNIT, or PASS. Win: destroy the opponent’s city, or else highest score at turn limit (score = weighted resources + battles won + surviving units).

12 tunable parameters (RuleSmith optimizes these for balance):

Parameter Range Precision
Economy
Initial resources[2, 10]integer
Empire farmer gather[1, 5]integer
Nomads kill resource gain[1, 10]integer
Combat
Empire damage[1, 5]integer
Nomads damage[1, 5]integer
Empire soldier HP[4, 16]integer
Nomads cavalry HP[4, 16]integer
Production
Empire unit cost[2, 10]integer
Nomads unit cost[2, 10]integer
Scoring
Resource weight[0.1, 0.5]0.1
Battle weight[1, 5]integer
Unit weight[1, 5]integer

Experimental Results

Optimized winning chances under different training (rows) and evaluation (columns). E = Empire, N = Nomads. Each cell: Empire wins | Nomads wins. Near-balanced (50% ± 5%) in bold. Model sizes: 2B, 8B.

Train \ Eval E2B vs N2B E2B vs N8B E8B vs N2B E8B vs N8B
E2B vs N2B 48 | 52 32 | 68 27 | 73 55 | 45
E2B vs N8B 81 | 19 47 | 53 91 | 9 75 | 25
E8B vs N2B 37 | 63 6 | 94 52 | 48 29 | 71
E8B vs N8B 53 | 47 24 | 76 81 | 19 51 | 49

Ablation on optimization methods. Random Search and (1+1)-ES use fixed N=64 games per iteration. BO with adaptive sampling uses N ∈ [16, 64]. Win rates as Empire | Nomads.

Random Search (1+1)-ES BO (adaptive)
13 | 87 26 | 74 51 | 49
BO (N=16) BO (N=32) BO (N=64)
34 | 66 61 | 39 48 | 52

Ablation on game designs. RuleSmith achieves balanced win rates across map sizes and turn limits (turns in parentheses).

5×5 (16) 7×7 (16) 9×9 (32) 11×11 (32)
53 | 47 51 | 49 48 | 52 51 | 49

Citation

If you find our work useful, please cite:

@article{zeng2026rulesmith,
  title = {Rulesmith: multi-agent llms for automated game balancing},
  author = {Zeng, Ziyao and Liu, Chen and Liu, Tianyu and Wang, Hao and Sun, Xiatao and Yang, Fengyu and Liu, Xiaofeng and Fan, Zhiwen},
  journal = {arXiv preprint arXiv:2602.06232},
  year = {2026}
}