Can a 600M parameter language model learn to play chess? Not quite - but it can learn the rules. I fine-tuned Qwen3-0.6B on 776k chess positions to predict whether a move is legal. The goal isn’t to build a chess engine (Stockfish exists), but to teach a small LLM to reason about board states and constraints.
This is a stepping stone toward building models that can generate best moves with chain-of-thought reasoning. First teach the rules, then teach strategy.
Chess engines like Stockfish use tree search and hand-coded evaluation functions. They’re superhuman but not general-purpose - you can’t ask Stockfish to explain its reasoning or apply chess concepts to analogous problems.
Language models, on the other hand, can learn patterns from data and potentially generate human-like reasoning. But they need to understand the fundamentals first. Move legality is that foundation:
If a model can’t learn move legality, it won’t learn strategy.
Each training example has three components:
1. Board Position - A text description of piece placement:
a8:black rook
b8:black knight
c8:black bishop
d8:black queen
e8:black king
...
e2:white pawn
f1:white king
2. Query - A question about move legality:
Is it legal for the white bishop at c4 to move to (or capture) f7?
Answer only yes or no
3. Output - The answer, optionally with reasoning:
<think>
The white bishop on c4 can move diagonally. f7 is on the diagonal from c4.
There's a black pawn on f7, so this is a capture. The move is legal.
</think>
yes
The <think> tags are for chain-of-thought training. Not all examples have it - some just answer “yes” or “no” directly.
The 776k examples cover:
I generated this using python-chess, starting with real game positions from the Lichess database and synthetically creating move queries (both legal and illegal).
Base Model: Qwen3-0.6B - a 600M parameter causal language model. Small enough to train on consumer hardware, large enough to learn patterns.
Framework: TRL (Transformer Reinforcement Learning) for supervised fine-tuning, Unsloth for memory optimization.
Hardware: AMD RX 7900 XT (24GB VRAM). Full fine-tuning uses ~20GB, LoRA uses ~8GB.
BATCH_SIZE = 64
GRADIENT_ACCUMULATION_STEPS = 2 # Effective batch = 128
LEARNING_RATE = 2e-4
NUM_EPOCHS = 3
MAX_SEQ_LENGTH = 1024
WEIGHT_DECAY = 0.001
OPTIMIZER = "paged_adamw_8bit"
PRECISION = "bfloat16"
I used full parameter fine-tuning, not LoRA. The model is small enough that full tuning fits in VRAM and trains in reasonable time (~6 hours for 3 epochs).
Qwen3 uses a chat template for instruction following:
<|im_start|>user
Consider the position below and answer the query:
[board position]
Query: Is it legal for the white knight at f3 to move to g5?
Answer only yes or no
<|im_end|>
<|im_start|>assistant
yes
<|im_end|>
The tokenizer handles this formatting via tokenizer.apply_chat_template(), which wraps the prompt and output in the correct structure.
The training loop is straightforward with TRL’s SFTTrainer:
from trl import SFTTrainer, SFTConfig
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"unsloth/Qwen3-0.6B",
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("unsloth/Qwen3-0.6B")
# Load dataset
dataset = load_dataset("json", data_files="data.jsonl", split="train")
# Format into chat template
def format_example(example):
messages = [
{"role": "user", "content": example["prompt"]},
{"role": "assistant", "content": example["output"]}
]
return {"text": tokenizer.apply_chat_template(messages, tokenize=False)}
dataset = dataset.map(format_example)
# Train
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=SFTConfig(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=64,
gradient_accumulation_steps=2,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_strategy="steps",
save_steps=500,
),
dataset_text_field="text",
max_seq_length=1024,
)
trainer.train()
The key is dataset_text_field="text" - SFTTrainer expects the formatted text in a field called “text”.
Loss curve: Dropped quickly in the first epoch (from 1.2 to 0.3), then slowly decreased in epochs 2-3. By epoch 3, loss was ~0.15.
Overfitting: Minimal. The task is large enough (776k examples) and diverse enough that the model didn’t memorize. Validation loss tracked training loss closely.
Speed: With Unsloth optimizations, training ran at ~120 examples/second on the RX 7900 XT. Full 3-epoch run took about 6 hours.
Memory: Full fine-tuning used 19GB VRAM. LoRA would have used ~7GB but I wanted to avoid the merge step later.
Using the trained model:
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"./output/chess-sft-qwen3-0.6b",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./output/chess-sft-qwen3-0.6b")
# Format a chess position question
position = """a8:black rook
b8:black knight
...
[full board]
"""
prompt = f"""Consider the position below and answer the query:
{position}
Query: Is it legal for the white bishop at c4 to move to f7?
Answer only yes or no"""
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer.encode(text, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=50, temperature=0.1)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
# => "yes"
The model responds with just “yes” or “no” most of the time. Occasionally it includes reasoning in <think> tags if the position is complex.
I held out 10k examples for testing. The model achieved:
The 15% error rate breaks down as:
The common failure modes:
These are all edge cases. For basic moves (pawn pushes, knight jumps, bishop diagonals), accuracy is >95%.
This model is a foundation for:
1. Best move generation: Fine-tune on (position, best_move) pairs. The model already understands legal moves, now teach it which are good.
2. Chain-of-thought reasoning: The <think> examples teach the model to explain its reasoning. Scaling this up could produce models that explain why a move is best.
3. RLHF/DPO: Use this as a reward model for reinforcement learning. The model can score move quality, which can train a policy model.
4. Multimodal chess: Extend to take board images as input instead of text. Vision models + this text understanding = models that “see” chess positions.
Not a chess engine: This model verifies legality, it doesn’t play well. It has no concept of strategy, tactics, or winning positions. It’s like teaching someone the rules without teaching them how to win.
Small model: At 600M parameters, there’s limited capacity for complex reasoning. Scaling to 1.5B or 7B would likely improve both accuracy and reasoning quality.
No position evaluation: The model can’t tell you if a position is winning or losing. It only knows legal vs illegal.
Inference speed: Generating text is slower than Stockfish’s tree search. For move generation in real games, you’d need batching or faster inference.
I’m planning to:
The trained model is at chess-sft-qwen3-0.6b on Hugging Face. Training code is at github - it’s a simple TRL script that anyone can run on consumer hardware.