Solve Hi-Q with AlphaZero and Curriculum Learning

On a snowy day at my parents’ house, I was passing time with family and reading a blog post about planner programming when my grandmother invited me to play Othello at her apartment. While getting the game out of her cabinet, I noticed a box for Hi-Q. I owned this game as a child and had played it to the point where I could reliably win with one peg left every time. To my disappointment, I could only get down to 4 pegs when I tried again now. While my Hi-Q skills had atrophied, my programming skills have improved considerably—I wondered if I could find an optimal solution using Picat. I couldn’t. But I did find a way to solve it with deep learning.

Attempt 1: Constraint-Based Approaches

I installed Picat and vibe-coded some scripts to find a sequence of moves leading to one peg in the center using the planning module. The scripts ran indefinitely without producing output—the search space was simply too large.

I’m the author of a MiniZinc MCP server connected to my Claude account. Since planners are essentially a DSL on top of constraints, it seemed like a constraint-based solution should work. But Claude’s solutions also timed out.

Looking for hints, I found a paper on Integer Programming Based Algorithms for Peg Solitaire Problems. The key insight: the search space is large enough that you need sophisticated pruning methods. I could have fed this paper to Claude and been done with it, but I wanted a solution I actually understood. The integer programming approach would have been another black box. This motivated my pivot to deep learning—the loss function was intuitive, and I suspected a neural network would naturally learn to prune the search space.

Attempt 2: PPO

My previous experience with reinforcement learning was using Proximal Policy Optimization in PyTorch to solve simple Gym environments like CartPole and LunarLander. I vibe-coded a Gym environment for Hi-Q and a basic PPO network, then trained for 100k timesteps. The resulting policy reliably got down to 4 pegs. Not bad!

But the policy would converge on a solution and refuse to deviate—it was stuck in local maxima. When I analyzed the model’s behavior, it was playing the exact same 30 moves every single game, ending with two pegs that couldn’t jump each other. The +1 reward per move meant it had found a local optimum: make 30 valid moves and get stuck (+28 total reward) was good enough that it never explored further.

Attempt 3: Breaking Out of Local Maxima

Claude proposed four approaches to escape the local optimum. I tested them in parallel:

Remove per-move reward — Only reward at game end based on remaining pegs. Result: Didn’t help.
Progressive reward shaping — Exponential scaling where fewer remaining pegs = much higher reward. Result: Marginal improvement.
Higher exploration — Increased entropy coefficient from 0.01 to 0.05. Result: Didn’t help.
Curriculum learning — Start the agent on easier board states (fewer pegs) and gradually increase difficulty. Result: Significant improvement!

Only curriculum learning significantly decreased loss. After 1 million timesteps with curriculum learning, I got down to 2 pegs—but still couldn’t solve it completely.

Attempt 4: AlphaZero

The neural network needed to see examples of winning solutions to converge on one. I introduced the AlphaZero architecture, which combines neural networks with Monte Carlo Tree Search. MCTS runs simulations of possible move sequences, giving the agent the ability to plan ahead and occasionally stumble onto winning games during training.

Two changes made the difference:

MCTS depth became the most important hyperparameter—more simulations meant better planning and faster convergence
Full-board mixing—even during curriculum learning, I mixed in 20% full-board games so the model always trained on the actual problem

I fired off a 6-hour training job on an H100 and went to my family Christmas party. When I woke up, Santa had delivered an optimal solution to Hi-Q.

Reflections

The integer programming paper solved this in 15 minutes on a 2001 laptop CPU. Mine took 6+ hours on a 2025 H100. Hardly an improvement in efficiency! But I learned about curriculum learning, and I ended up with an artifact I actually understand—something I wouldn’t have gotten from the integer programming route, even with vibe coding.

This project taught me that vibe coding lets me punch above my weight class with neural network architectures. For one-off projects, it’s great: I can vibe-code slop, and as long as my network converges on something useful, that’s all that matters. I’m skeptical this strategy scales to more sophisticated projects—there’s a lot of plumbing and failure modes in deep learning. But for off-the-shelf algorithms in PyTorch, Claude Code acted as faster arms and legs to execute on my ideas and test hypotheses.

The code is on GitHub, and you can view the solution on GitHub Pages.

Solve Hi-Q with AlphaZero and Curriculum Learning

Attempt 1: Constraint-Based Approaches

Attempt 2: PPO

Attempt 3: Breaking Out of Local Maxima

Attempt 4: AlphaZero

Reflections

More posts

Solve Hi-Q with AlphaZero and Curriculum Learning

Bundle Configuration with Binaries using Nix

Simple Control Flow for Automatically Steering Agents

Using Constraint Satisfaction to Optimize Item Selection for Bundles in Minecraft