
Neural Network Design for Tetris
This guide outlines several neural network architectures and training strategies for building a strong Tetris-playing agent. Each variant specifies inputs, outputs, training objectives, and trade-offs so you can choose the approach that best matches your constraints and goals.
Overview
- Objective: Maximize survival time and/or score (lines cleared; bonuses for Tetris, T-spins if supported).
- Environment: 10×20 board, 7 tetrominoes (I, O, T, S, Z, J, L), preview queue (next-N), hold piece (optional), gravity/lock rules.
- Key design choice: Action granularity.
- High-level (recommended): Choose final placement (x, rotation, possibly use-hold). Requires a legal move generator but dramatically simplifies learning.
- Low-level: Left/right/rotate/soft-drop/hard-drop per frame. Closer to raw control but harder to explore and learn.
Common Building Blocks
Action Space Options
- High-level placements (preferred): Enumerate all legal (rotation, column) landings given current piece, ghost drop, and optional hold toggle. Typical count: 10–40 moves per state depending on piece and board.
- Low-level primitives: 5–8 discrete actions; requires long horizons to realize outcomes.
State Encoding (choose a subset)
- Board grid: 10×20 binary or integer heights; often encoded as channels.
- Occupied cells (1/0).
- Recent clears/lock delay (optional).
- Column features: heights (10), holes per column, bumpiness, wells.
- Piece features: one-hot for current piece (7), rotation index, spawn column; next-k queue (k∈[1,5]); hold available flag and held piece.
- Rule flags: lock delay, combo count, back-to-back status (if scoring uses these).
- Normalization: scale counts to [0,1] using board height; standardize scalar features.
Rewards and Training Signals
- Sparse: +lines_cleared per placement; +bonus for Tetris/T-spins; small -1 per piece for survival pressure.
- Shaped (careful): penalties for holes, high aggregate height, bumpiness; bonuses for perfect clears; avoid over-shaping that causes cheese strategies.
- Discount factor: γ ∈ [0.95, 0.999].
Legal Move Generator (for high-level actions)
Implement a deterministic enumerator for all distinct final placements:
- For each rotation, scan all columns, slide piece, hard-drop to collision, and record landing if no overlap and inside bounds.
- Optionally include an action that toggles hold and places the held piece.
- Deduplicate symmetric landings.
Evaluation Metrics
- Average lines cleared or score per game.
- Survival time (pieces placed, seconds at given gravity).
- Structural metrics: hole count, average height, bumpiness, well depth distribution.
- Decision latency (ms/action) for real-time play.
Variant 1 — Feature-Based Afterstate Value MLP (Lightweight, Fast)
A compact MLP that scores "afterstates" (the board after placing the current piece). The agent picks the placement with highest predicted value.
- Inputs (engineered features; per-afterstate):
- Aggregate height, max height, holes, row/column transitions, bumpiness, wells, eroded cells, lines cleared by the move, parity features, hidden surface area.
- Optional: next-1 piece one-hot; hold availability.
- Network:
- MLP: 32–128 hidden units × 2–3 layers, ReLU/SiLU; layer norm optional.
- Output: scalar V(s′) predicting long-term return from the afterstate.
- Action selection:
- Enumerate legal placements; compute afterstates s′.
- Evaluate V(s′) for each; pick argmax.
- Training:
- RL: TD(λ) or n-step returns on episodes of self-play; reward = lines cleared minus structural penalties; γ≈0.99.
- Imitation: Regress to scores from a strong hand-coded heuristic, then fine-tune with RL.
- Loss: L = (Vθ(s′) − G)^2 with target G (bootstrapped or Monte Carlo).
- Hyperparameters:
- Batch 1k–16k afterstates; Adam 1e-3 → 1e-4; λ ∈ [0.7, 0.95].
- Data augmentation: mirror board (swap columns) and swap L/J, S/Z where valid.
- Pros:
- Extremely fast; low-latency; easy to stabilize; minimal compute.
- Cons:
- Depends on feature quality; limited ability to exploit nuanced spatial motifs (e.g., T-spin setups) unless engineered.
Pseudo-code (per move):
A = enumerate_legal_afterstates(s)
values = [Vθ(a.afterstate) for a in A]
choose argmax(values)
Variant 2 — CNN Actor-Critic on Grid (End-to-End, Flexible)
Convolutional policy-value network over the raw board with optional feature channels.
- Inputs:
- Channels: occupied(1), holes mask(1, optional), current piece mask at spawn(4 rotations, optional), next-k queue one-hots (k×7 scalars or small embeddings), hold flags.
- Shape: (C, 20, 10).
- Network:
- 4–8 conv blocks, 32–64 filters, 3×3 kernels, residual connections; flatten to MLP head.
- Two heads:
- Policy head: logits for each legal placement index (dynamic mask).
- Value head: scalar V(s).
- Action encoding:
- Precompute a catalog of (rotation, x) landing bins per piece; map enumerated legal moves to indices; mask invalid.
- Training (PPO/A2C):
- Collect rollouts with on-policy sampling; advantage estimation (GAE λ≈0.95).
- Loss: L = Lpolicy + c1·(V−R)^2 − c2·entropy.
- Reward: lines cleared (+ bonuses) − α·holes − β·bumpiness; anneal shaping terms.
- Tricks:
- Symmetry augmentation (horizontal mirror, piece symmetries).
- Curriculum: start with slow gravity and short episodes; ramp difficulty.
- Action masking to avoid illegal moves; value normalization.
- Pros:
- Learns spatial patterns (T-spins, cavity setups) without heavy feature engineering.
- Cons:
- Heavier compute than MLP; careful action indexing/masking required.
Variant 3 — AlphaZero-Style Policy-Value + MCTS (Strong, Lookahead)
Combine a policy-value CNN with Monte Carlo Tree Search over high-level placements.
- Inputs/Network:
- Same grid CNN backbone as Variant 2 with policy and value heads.
- MCTS Details:
- Nodes: board states after placements; edges: legal placements.
- PUCT selection: a = argmax(Q + U), with U ∝ P·sqrt(N)/(1+N_a).
- Expansion: use policy head to initialize priors over child moves; mask illegal.
- Evaluation: use value head as leaf evaluation; backups with discount γ.
- Exploration: Dirichlet noise on root priors; temperature τ>0 early in games.
- Training Loop:
- Self-play using MCTS-improved policy π̂.
- Collect (state, π̂, z) where z is final/bootstrapped return.
- Optimize cross-entropy between π and π̂, plus (V−z)^2, plus L2.
- Pros:
- Strong performance via search; robust long-horizon planning.
- Cons:
- Highest latency; requires efficient move generator and batching.
Variant 4 — Recurrent Agent with Next-Piece Queue (Temporal Context)
Use an LSTM/GRU to capture temporal dynamics (combos, back-to-back, stacking plans) and next-piece sequence.
- Inputs:
- Per-step features: compact board embedding (small CNN or features), current/next-k pieces, hold state, combo/back-to-back flags.
- Network:
- Encoder: small CNN or feature MLP → 64–128 dims.
- Recurrent core: 1–2 layers LSTM/GRU (128–256 hidden).
- Heads: policy over placements (masked) and value.
- Training:
- Actor-critic with truncated BPTT; sequence length 32–128 placements.
- Auxiliary losses: predict future hole count, lines to encourage representation learning.
- Pros:
- Captures plans spanning multiple pieces; better at timing holds and T-spin setups.
- Cons:
- More complex credit assignment; careful sequence packing needed.
Variant 5 — Evolutionary Feature-Weight Search (NE/CMA-ES)
Optimize a small value network or linear heuristic using evolutionary strategies.
- Representation:
- Genome = weights of a linear model or 1–2 layer MLP over engineered features (as in Variant 1).
- Optimization:
- CMA-ES or NES over continuous weights; evaluate fitness as average score across seeds.
- Parallel evaluation across many environments; early stopping via statistical tests.
- Pros:
- No gradients; robust to non-stationary rewards and sparse signals.
- Cons:
- Sample-inefficient; needs parallel compute to be practical.
Practical Implementation Notes
- Move generator correctness is critical; unit-test against known placements and symmetry cases.
- Mask illegal actions in the policy head to avoid NaNs and wasted exploration.
- Use board mirroring for data augmentation; also rotate S/Z and L/J symmetries when legal.
- Stabilize RL with reward normalization, advantage clipping, gradient clipping, and value target normalization.
- Logging: track lines/piece, holes, bumpiness, max height, decision latency, and policy entropy.
- Inference speed targets:
- Variant 1: <0.1 ms/move on CPU.
- Variant 2: 0.2–2 ms/move on CPU/GPU.
- Variant 3: 5–50 ms/move depending on simulations.
Example Training Curricula
- Phase 1: Slow gravity, no T-spin bonuses, short episodes; encourage survival and hole avoidance.
- Phase 2: Add scoring complexity (combos, back-to-back), increase gravity.
- Phase 3: Real-time constraints and full rules; tune exploration and entropy.
Minimal Feature Set (if engineering features)
- Aggregate height, holes, bumpiness, well depth sum, row transitions, column transitions, lines cleared by move.
Choosing a Variant
- Low latency or limited compute: Variant 1.
- End-to-end learning and strong spatial reasoning: Variant 2.
- Maximum strength with planning: Variant 3.
- Planning across piece sequences: Variant 4.
- Black-box optimization or non-differentiable scoring: Variant 5.
Extensions
- Graph/convolution hybrids: GNN over columns with local CNN patches.
- Transformer decoder that auto-regresses placements conditioned on next-k queue.
- Auxiliary heads for predicting structural metrics to improve representation.
Pseudo-APIs (sketches)
Action selection with masking (Variants 2–4):
logits = policy_head(encoder(state))
mask = legal_moves_mask(state) # 1 for legal, 0 for illegal
logits[mask==0] = -inf
action = sample_or_argmax(softmax(logits))
TD learning on afterstates (Variant 1):
for (s, a, r, s_next) in rollouts:
target = r + γ * max_a' Vθ(s_next_afterstate_a')
loss += (Vθ(s_afterstate_a) - target)^2
Benchmarking Checklist
- Fixed seeds and gravity settings; report mean ± std over N≥100 games.
- Ablations: with/without next-piece info, with/without hold, reward shaping weight sweeps.
- Latency budget tests: cap ms/move and measure strength degradation.
Summary
- High-level placement action spaces simplify learning and enable strong play across all variants.
- Start simple (Variant 1), then scale to CNN actor-critic (Variant 2), and finally add search (Variant 3) if you need top performance.
- Recurrent and evolutionary approaches offer useful alternatives depending on constraints.