Mathematik  |  Informatik

 

Tymur Haivoronskyi, 2006 | Benken, SG

 

This work investigates Transformer-based language models for Hex, a game with a simple action space and deep strategy. Our goal is to find the best way to train these models. We compile a dataset of over 25,000 complete games and train separate GPT-style models using various labeling techniques. We then refine these models using an iterative self-play pipeline with fine-tuning on both self-generated winning games and the original dataset. Evaluated on test cases with likely and unseen openings, the refined agents outperform earlier generations and are comparable to baseline algorithms.

Fragestellung

Recent breakthroughs like BERT and ChatGPT have shown that transformer models can master complex tasks that previously seemed too computationally expensive. However, their use in strategic games remains largely unexplored. Inspired by AlphaGo Zero, which achieved high performance with enormous compute, this project examines whether a small, compute-efficient Transformer can learn Hex, a game with simple rules but deep strategy, by leveraging self-play to iteratively refine its decision-making. In this work we train a transformer model on sequences of moves from Hex games and explore three ways of improving it. First, we split the dataset (A) into games won by blue (B) and red (C), training separate agents for each color. We hypothesize that this approach will mitigate the effect of bad moves that led the player to the lost game. Second, we add an indicator token (D) at the beginning of each sequence to denote whether the game was won or lost, thereby encouraging the model to play optimally. Third, we use self-play to continuously generate additional training data for the models.

Methodik

We collect human-played Hex games and use a deterministic Algorithm (https://www.lutanho.net/play/hex.html) to complete unfinished games, yielding over 25,000 games. Additionally to this dataset (A) we add three with refined data: (B), (C), (D) and train corresponding four GPT-style transformer models for 60 epochs using a move-based tokenizer.

We then improve the (B) and (C) models via self-play: at each move, instead of picking only the most likely move, we sample from several high-probability legal options to increase diversity. After filtering the resulting games by outcome, we fine-tune each model only on winning games for that model’s color, combined with the initial dataset as a regularizer. We repeat this cycle five times using the previous iteration’s models to generate new data and obtain new models (B1) – (B5), (C1) – (C5)

Ergebnisse

We find that labeling games does not improve performance; Fisher’s exact test (min. P = 0.1507) confirms no significant difference among the four baseline agents. In contrast, iterative fine-tuning over five generations significantly boosts performance. For seen openings, the red model’s win rate increases (with P being 0.0006) from 42.4% (212/500) to 53.4% (267/500) while illegal moves drop from 928 to 236; the blue model’s win rate rises from 43.4% to 54.4% with illegal moves decreasing from 866 to 226. For unseen openings, the red model improves from about 35% (219/625) to 44% (275/625) and reduces illegal moves from 1478 to 537, while the blue model’s win rate increases from roughly 28.5% to 39–40% and illegal moves fall from 1731 to below 1000. Notably, the proportion of games with no illegal moves grows from approximately 50% to nearly 70% by generation 5.

Diskussion

The iterative self-improvement approach boosts both win rate and move legality, despite minor fluctuations between iterations. Also, while self-play effectively augments the training data, it may not be the most efficient method. Exploring new data augmentation techniques or adaptive sampling strategies could offer better improvements at lower computational costs.

Schlussfolgerungen

The findings confirm that Transformer language models can learn noteworthy strategies from purely sequential representations of a board game required to win against the initial Algorithm. However, it remains an open question whether more complex approaches, such as enhanced data preparation or additional architectural features, could further improve performance. We believe this project provides an intriguing perspective on how advanced NLP techniques can be extended beyond strictly linguistic domains.

 

 

Würdigung durch den Experten

Jakub Adamek

This work presents an interesting investigation into training a transformer model for playing Hex, tackling a relevant research question. The student gave impressively careful consideration to data collection, modeling, and evaluation. Both the core ideas of separating the red and blue models, and using self-play to improve them, make perfect sense, and were rigorously tested in spite of limited access to computational resources. The statistical analysis and inclusion of negative results both show a dedication to scientific principles. Overall, an impressive piece of work.

Prädikat:

hervorragend

Sonderpreis «Taiwan International Science Fair (TISF)» gestiftet von den Odd Fellows, Helvetia Loge Nr. 1

 

 

 

Kantonsschule Wattwil
Lehrer: Emil Müller