Neural Network Design for Tetris

This guide outlines several neural network architectures and training strategies for building a strong Tetris-playing agent. Each variant specifies inputs, outputs, training objectives, and trade-offs so you can choose the approach that best matches your constraints and goals.

Overview

Objective: Maximize survival time and/or score (lines cleared; bonuses for Tetris, T-spins if supported).
Environment: 10×20 board, 7 tetrominoes (I, O, T, S, Z, J, L), preview queue (next-N), hold piece (optional), gravity/lock rules.
Key design choice: Action granularity.
- High-level (recommended): Choose final placement (x, rotation, possibly use-hold). Requires a legal move generator but dramatically simplifies learning.
- Low-level: Left/right/rotate/soft-drop/hard-drop per frame. Closer to raw control but harder to explore and learn.

Common Building Blocks

Action Space Options

High-level placements (preferred): Enumerate all legal (rotation, column) landings given current piece, ghost drop, and optional hold toggle. Typical count: 10–40 moves per state depending on piece and board.
Low-level primitives: 5–8 discrete actions; requires long horizons to realize outcomes.

State Encoding (choose a subset)

Board grid: 10×20 binary or integer heights; often encoded as channels.
- Occupied cells (1/0).
- Recent clears/lock delay (optional).
Column features: heights (10), holes per column, bumpiness, wells.
Piece features: one-hot for current piece (7), rotation index, spawn column; next-k queue (k∈[1,5]); hold available flag and held piece.
Rule flags: lock delay, combo count, back-to-back status (if scoring uses these).
Normalization: scale counts to [0,1] using board height; standardize scalar features.

Rewards and Training Signals

Sparse: +lines_cleared per placement; +bonus for Tetris/T-spins; small -1 per piece for survival pressure.
Shaped (careful): penalties for holes, high aggregate height, bumpiness; bonuses for perfect clears; avoid over-shaping that causes cheese strategies.
Discount factor: γ ∈ [0.95, 0.999].

Legal Move Generator (for high-level actions)

Implement a deterministic enumerator for all distinct final placements:

For each rotation, scan all columns, slide piece, hard-drop to collision, and record landing if no overlap and inside bounds.
Optionally include an action that toggles hold and places the held piece.
Deduplicate symmetric landings.

Evaluation Metrics

Average lines cleared or score per game.
Survival time (pieces placed, seconds at given gravity).
Structural metrics: hole count, average height, bumpiness, well depth distribution.
Decision latency (ms/action) for real-time play.

Variant 1 — Feature-Based Afterstate Value MLP (Lightweight, Fast)

A compact MLP that scores "afterstates" (the board after placing the current piece). The agent picks the placement with highest predicted value.

Inputs (engineered features; per-afterstate):
- Aggregate height, max height, holes, row/column transitions, bumpiness, wells, eroded cells, lines cleared by the move, parity features, hidden surface area.
- Optional: next-1 piece one-hot; hold availability.
Network:
- MLP: 32–128 hidden units × 2–3 layers, ReLU/SiLU; layer norm optional.
- Output: scalar V(s′) predicting long-term return from the afterstate.
Action selection:
1. Enumerate legal placements; compute afterstates s′.
2. Evaluate V(s′) for each; pick argmax.
Training:
- RL: TD(λ) or n-step returns on episodes of self-play; reward = lines cleared minus structural penalties; γ≈0.99.
- Imitation: Regress to scores from a strong hand-coded heuristic, then fine-tune with RL.
- Loss: L = (Vθ(s′) − G)^2 with target G (bootstrapped or Monte Carlo).
Hyperparameters:
- Batch 1k–16k afterstates; Adam 1e-3 → 1e-4; λ ∈ [0.7, 0.95].
- Data augmentation: mirror board (swap columns) and swap L/J, S/Z where valid.
Pros:
- Extremely fast; low-latency; easy to stabilize; minimal compute.
Cons:
- Depends on feature quality; limited ability to exploit nuanced spatial motifs (e.g., T-spin setups) unless engineered.

Pseudo-code (per move):

A = enumerate_legal_afterstates(s)
values = [Vθ(a.afterstate) for a in A]
choose argmax(values)

Variant 2 — CNN Actor-Critic on Grid (End-to-End, Flexible)

Convolutional policy-value network over the raw board with optional feature channels.

Inputs:
- Channels: occupied(1), holes mask(1, optional), current piece mask at spawn(4 rotations, optional), next-k queue one-hots (k×7 scalars or small embeddings), hold flags.
- Shape: (C, 20, 10).
Network:
- 4–8 conv blocks, 32–64 filters, 3×3 kernels, residual connections; flatten to MLP head.
- Two heads:
  - Policy head: logits for each legal placement index (dynamic mask).
  - Value head: scalar V(s).
Action encoding:
- Precompute a catalog of (rotation, x) landing bins per piece; map enumerated legal moves to indices; mask invalid.
Training (PPO/A2C):
- Collect rollouts with on-policy sampling; advantage estimation (GAE λ≈0.95).
- Loss: L = Lpolicy + c1·(V−R)^2 − c2·entropy.
- Reward: lines cleared (+ bonuses) − α·holes − β·bumpiness; anneal shaping terms.
Tricks:
- Symmetry augmentation (horizontal mirror, piece symmetries).
- Curriculum: start with slow gravity and short episodes; ramp difficulty.
- Action masking to avoid illegal moves; value normalization.
Pros:
- Learns spatial patterns (T-spins, cavity setups) without heavy feature engineering.
Cons:
- Heavier compute than MLP; careful action indexing/masking required.

Variant 3 — AlphaZero-Style Policy-Value + MCTS (Strong, Lookahead)

Combine a policy-value CNN with Monte Carlo Tree Search over high-level placements.

Inputs/Network:
- Same grid CNN backbone as Variant 2 with policy and value heads.
MCTS Details:
- Nodes: board states after placements; edges: legal placements.
- PUCT selection: a = argmax(Q + U), with U ∝ P·sqrt(N)/(1+N_a).
- Expansion: use policy head to initialize priors over child moves; mask illegal.
- Evaluation: use value head as leaf evaluation; backups with discount γ.
- Exploration: Dirichlet noise on root priors; temperature τ>0 early in games.
Training Loop:
1. Self-play using MCTS-improved policy π̂.
2. Collect (state, π̂, z) where z is final/bootstrapped return.
3. Optimize cross-entropy between π and π̂, plus (V−z)^2, plus L2.
Pros:
- Strong performance via search; robust long-horizon planning.
Cons:
- Highest latency; requires efficient move generator and batching.

Variant 4 — Recurrent Agent with Next-Piece Queue (Temporal Context)

Use an LSTM/GRU to capture temporal dynamics (combos, back-to-back, stacking plans) and next-piece sequence.

Inputs:
- Per-step features: compact board embedding (small CNN or features), current/next-k pieces, hold state, combo/back-to-back flags.
Network:
- Encoder: small CNN or feature MLP → 64–128 dims.
- Recurrent core: 1–2 layers LSTM/GRU (128–256 hidden).
- Heads: policy over placements (masked) and value.
Training:
- Actor-critic with truncated BPTT; sequence length 32–128 placements.
- Auxiliary losses: predict future hole count, lines to encourage representation learning.
Pros:
- Captures plans spanning multiple pieces; better at timing holds and T-spin setups.
Cons:
- More complex credit assignment; careful sequence packing needed.

Variant 5 — Evolutionary Feature-Weight Search (NE/CMA-ES)

Optimize a small value network or linear heuristic using evolutionary strategies.

Representation:
- Genome = weights of a linear model or 1–2 layer MLP over engineered features (as in Variant 1).
Optimization:
- CMA-ES or NES over continuous weights; evaluate fitness as average score across seeds.
- Parallel evaluation across many environments; early stopping via statistical tests.
Pros:
- No gradients; robust to non-stationary rewards and sparse signals.
Cons:
- Sample-inefficient; needs parallel compute to be practical.

Practical Implementation Notes

Move generator correctness is critical; unit-test against known placements and symmetry cases.
Mask illegal actions in the policy head to avoid NaNs and wasted exploration.
Use board mirroring for data augmentation; also rotate S/Z and L/J symmetries when legal.
Stabilize RL with reward normalization, advantage clipping, gradient clipping, and value target normalization.
Logging: track lines/piece, holes, bumpiness, max height, decision latency, and policy entropy.
Inference speed targets:
- Variant 1: <0.1 ms/move on CPU.
- Variant 2: 0.2–2 ms/move on CPU/GPU.
- Variant 3: 5–50 ms/move depending on simulations.

Example Training Curricula

Phase 1: Slow gravity, no T-spin bonuses, short episodes; encourage survival and hole avoidance.
Phase 2: Add scoring complexity (combos, back-to-back), increase gravity.
Phase 3: Real-time constraints and full rules; tune exploration and entropy.

Minimal Feature Set (if engineering features)

Aggregate height, holes, bumpiness, well depth sum, row transitions, column transitions, lines cleared by move.

Choosing a Variant

Low latency or limited compute: Variant 1.
End-to-end learning and strong spatial reasoning: Variant 2.
Maximum strength with planning: Variant 3.
Planning across piece sequences: Variant 4.
Black-box optimization or non-differentiable scoring: Variant 5.

Extensions

Graph/convolution hybrids: GNN over columns with local CNN patches.
Transformer decoder that auto-regresses placements conditioned on next-k queue.
Auxiliary heads for predicting structural metrics to improve representation.

Pseudo-APIs (sketches)

Action selection with masking (Variants 2–4):

logits = policy_head(encoder(state))
mask = legal_moves_mask(state)  # 1 for legal, 0 for illegal
logits[mask==0] = -inf
action = sample_or_argmax(softmax(logits))

TD learning on afterstates (Variant 1):

for (s, a, r, s_next) in rollouts:
    target = r + γ * max_a' Vθ(s_next_afterstate_a')
    loss += (Vθ(s_afterstate_a) - target)^2

Benchmarking Checklist

Fixed seeds and gravity settings; report mean ± std over N≥100 games.
Ablations: with/without next-piece info, with/without hold, reward shaping weight sweeps.
Latency budget tests: cap ms/move and measure strength degradation.

Summary

High-level placement action spaces simplify learning and enable strong play across all variants.
Start simple (Variant 1), then scale to CNN actor-critic (Variant 2), and finally add search (Variant 3) if you need top performance.
Recurrent and evolutionary approaches offer useful alternatives depending on constraints.