Occam's razor Archive Pages Categories Tags

Fine-Tuning Qwen3 0.6B for Chess Move Legality

25 October 2025

Can a 600M parameter language model learn to play chess? Not quite - but it can learn the rules. I fine-tuned Qwen3-0.6B on 776k chess positions to predict whether a move is legal. The goal isn’t to build a chess engine (Stockfish exists), but to teach a small LLM to reason about board states and constraints.

This is a stepping stone toward building models that can generate best moves with chain-of-thought reasoning. First teach the rules, then teach strategy.

Why Chess Move Legality?

Chess engines like Stockfish use tree search and hand-coded evaluation functions. They’re superhuman but not general-purpose - you can’t ask Stockfish to explain its reasoning or apply chess concepts to analogous problems.

Language models, on the other hand, can learn patterns from data and potentially generate human-like reasoning. But they need to understand the fundamentals first. Move legality is that foundation:

Simpler than move generation: “Is this move legal?” is easier than “What’s the best move?”
Well-defined: Legal moves follow strict rules, no ambiguity
Rich in structure: Pieces have constraints, the board has geometry, special moves have conditions

If a model can’t learn move legality, it won’t learn strategy.

The Dataset

Each training example has three components:

1. Board Position - A text description of piece placement:

a8:black rook
b8:black knight
c8:black bishop
d8:black queen
e8:black king
...
e2:white pawn
f1:white king

2. Query - A question about move legality:

Is it legal for the white bishop at c4 to move to (or capture) f7?
Answer only yes or no

3. Output - The answer, optionally with reasoning:

<think>
The white bishop on c4 can move diagonally. f7 is on the diagonal from c4.
There's a black pawn on f7, so this is a capture. The move is legal.
</think>

yes

The <think> tags are for chain-of-thought training. Not all examples have it - some just answer “yes” or “no” directly.

Data Composition

The 776k examples cover:

Legal vs illegal moves: ~50/50 split
All piece types: pawns, knights, bishops, rooks, queens, kings
Special moves: castling, en passant, pawn promotion
Edge cases: pinned pieces, discovered checks, blocked paths
Captures vs quiet moves: both attacking and non-attacking moves

I generated this using python-chess, starting with real game positions from the Lichess database and synthetically creating move queries (both legal and illegal).

Training Setup

Base Model: Qwen3-0.6B - a 600M parameter causal language model. Small enough to train on consumer hardware, large enough to learn patterns.

Framework: TRL (Transformer Reinforcement Learning) for supervised fine-tuning, Unsloth for memory optimization.

Hardware: AMD RX 7900 XT (24GB VRAM). Full fine-tuning uses ~20GB, LoRA uses ~8GB.

Hyperparameters

BATCH_SIZE = 64
GRADIENT_ACCUMULATION_STEPS = 2  # Effective batch = 128
LEARNING_RATE = 2e-4
NUM_EPOCHS = 3
MAX_SEQ_LENGTH = 1024
WEIGHT_DECAY = 0.001
OPTIMIZER = "paged_adamw_8bit"
PRECISION = "bfloat16"

I used full parameter fine-tuning, not LoRA. The model is small enough that full tuning fits in VRAM and trains in reasonable time (~6 hours for 3 epochs).

Chat Template

Qwen3 uses a chat template for instruction following:

<|im_start|>user
Consider the position below and answer the query:

[board position]

Query: Is it legal for the white knight at f3 to move to g5?
Answer only yes or no
<|im_end|>
<|im_start|>assistant
yes
<|im_end|>

The tokenizer handles this formatting via tokenizer.apply_chat_template(), which wraps the prompt and output in the correct structure.

Code

The training loop is straightforward with TRL’s SFTTrainer:

from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Qwen3-0.6B",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-0.6B")

# Load dataset
dataset = load_dataset("json", data_files="data.jsonl", split="train")

# Format into chat template
def format_example(example):
    messages = [
        {"role": "user", "content": example["prompt"]},
        {"role": "assistant", "content": example["output"]}
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}

dataset = dataset.map(format_example)

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=SFTConfig(
        output_dir="./output",
        num_train_epochs=3,
        per_device_train_batch_size=64,
        gradient_accumulation_steps=2,
        learning_rate=2e-4,
        bf16=True,
        logging_steps=10,
        save_strategy="steps",
        save_steps=500,
    ),
    dataset_text_field="text",
    max_seq_length=1024,
)

trainer.train()

The key is dataset_text_field="text" - SFTTrainer expects the formatted text in a field called “text”.

Training Observations

Loss curve: Dropped quickly in the first epoch (from 1.2 to 0.3), then slowly decreased in epochs 2-3. By epoch 3, loss was ~0.15.

Overfitting: Minimal. The task is large enough (776k examples) and diverse enough that the model didn’t memorize. Validation loss tracked training loss closely.

Speed: With Unsloth optimizations, training ran at ~120 examples/second on the RX 7900 XT. Full 3-epoch run took about 6 hours.

Memory: Full fine-tuning used 19GB VRAM. LoRA would have used ~7GB but I wanted to avoid the merge step later.

Inference

Using the trained model:

from transformers import AutoTokenizer, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "./output/chess-sft-qwen3-0.6b",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./output/chess-sft-qwen3-0.6b")

# Format a chess position question
position = """a8:black rook
b8:black knight
...
[full board]
"""

prompt = f"""Consider the position below and answer the query:

{position}

Query: Is it legal for the white bishop at c4 to move to f7?
Answer only yes or no"""

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer.encode(text, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=50, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(response)
# => "yes"

The model responds with just “yes” or “no” most of the time. Occasionally it includes reasoning in <think> tags if the position is complex.

Evaluation

I held out 10k examples for testing. The model achieved:

Exact match accuracy: ~85%
Token overlap: ~91% (counts partial matches)

The 15% error rate breaks down as:

10%: Wrong answer (said “yes” when illegal, or vice versa)
5%: Formatting issues (extra text, didn’t follow template)

The common failure modes:

En passant: The model struggles with this special case
Castling through check: Sometimes allows illegal castling
Pinned pieces: Occasionally misses pins that prevent movement

These are all edge cases. For basic moves (pawn pushes, knight jumps, bishop diagonals), accuracy is >95%.

What This Enables

This model is a foundation for:

1. Best move generation: Fine-tune on (position, best_move) pairs. The model already understands legal moves, now teach it which are good.

2. Chain-of-thought reasoning: The <think> examples teach the model to explain its reasoning. Scaling this up could produce models that explain why a move is best.

3. RLHF/DPO: Use this as a reward model for reinforcement learning. The model can score move quality, which can train a policy model.

4. Multimodal chess: Extend to take board images as input instead of text. Vision models + this text understanding = models that “see” chess positions.

Limitations

Not a chess engine: This model verifies legality, it doesn’t play well. It has no concept of strategy, tactics, or winning positions. It’s like teaching someone the rules without teaching them how to win.

Small model: At 600M parameters, there’s limited capacity for complex reasoning. Scaling to 1.5B or 7B would likely improve both accuracy and reasoning quality.

No position evaluation: The model can’t tell you if a position is winning or losing. It only knows legal vs illegal.

Inference speed: Generating text is slower than Stockfish’s tree search. For move generation in real games, you’d need batching or faster inference.

Next Steps

I’m planning to:

Train on best move prediction with reasoning (SFT on annotated games)
Apply DPO using engine evaluations as preference labels
Scale to larger models (1.5B, 7B) to see if reasoning improves
Experiment with vision inputs (board images → moves)

The trained model is at chess-sft-qwen3-0.6b on Hugging Face. Training code is at github - it’s a simple TRL script that anyone can run on consumer hardware.