Image

Neural Network Design for Tetris

This guide outlines several neural network architectures and training strategies for building a strong Tetris-playing agent. Each variant specifies inputs, outputs, training objectives, and trade-offs so you can choose the approach that best matches your constraints and goals.

Overview

Common Building Blocks

Action Space Options

State Encoding (choose a subset)

Rewards and Training Signals

Legal Move Generator (for high-level actions)

Implement a deterministic enumerator for all distinct final placements:

  1. For each rotation, scan all columns, slide piece, hard-drop to collision, and record landing if no overlap and inside bounds.
  2. Optionally include an action that toggles hold and places the held piece.
  3. Deduplicate symmetric landings.

Evaluation Metrics


Variant 1 — Feature-Based Afterstate Value MLP (Lightweight, Fast)

A compact MLP that scores "afterstates" (the board after placing the current piece). The agent picks the placement with highest predicted value.

Pseudo-code (per move):

A = enumerate_legal_afterstates(s)
values = [Vθ(a.afterstate) for a in A]
choose argmax(values)

Variant 2 — CNN Actor-Critic on Grid (End-to-End, Flexible)

Convolutional policy-value network over the raw board with optional feature channels.


Variant 3 — AlphaZero-Style Policy-Value + MCTS (Strong, Lookahead)

Combine a policy-value CNN with Monte Carlo Tree Search over high-level placements.


Variant 4 — Recurrent Agent with Next-Piece Queue (Temporal Context)

Use an LSTM/GRU to capture temporal dynamics (combos, back-to-back, stacking plans) and next-piece sequence.


Variant 5 — Evolutionary Feature-Weight Search (NE/CMA-ES)

Optimize a small value network or linear heuristic using evolutionary strategies.


Practical Implementation Notes

Example Training Curricula

Minimal Feature Set (if engineering features)

Choosing a Variant

Extensions

Pseudo-APIs (sketches)

Action selection with masking (Variants 2–4):

logits = policy_head(encoder(state))
mask = legal_moves_mask(state)  # 1 for legal, 0 for illegal
logits[mask==0] = -inf
action = sample_or_argmax(softmax(logits))

TD learning on afterstates (Variant 1):

for (s, a, r, s_next) in rollouts:
    target = r + γ * max_a' Vθ(s_next_afterstate_a')
    loss += (Vθ(s_afterstate_a) - target)^2

Benchmarking Checklist

Summary