_______ __ ________ __
| \ | \ | \ | \
| βββββββ\ ______ ______ \ββ_______ \ββββββββ ______ \ββ _______
| ββ__/ ββ/ \ | \| \ \ | ββ / \| \/ \
| ββ ββ ββββββ\ \ββββββ\ ββ βββββββ\ | ββ | ββββββ\ ββ βββββββ
| βββββββ\ ββ \ββ/ ββ ββ ββ | ββ | ββ | ββ \ββ ββ\ββ \
| ββ__/ ββ ββ | βββββββ ββ ββ | ββ | ββ | ββ | ββ_\ββββββ\
| ββ ββ ββ \ββ ββ ββ ββ | ββ | ββ | ββ | ββ ββ
\βββββββ \ββ \βββββββ\ββ\ββ \ββ \ββ \ββ \ββ\βββββββ
Welcome to the BrainTris Lab!
In this application, we can teach a neural network to play Tetris. Will it learn to play better
than a human? The goal of the application is to demonstrate the learning capabilities of neural
networks, specifically the DQN neural network, and to understand the
Q-Learning
method.
Given a neural network and a Tetris game. During training, the network "sees" the game board:
- the fixed, fallen elements (the stack content)
- the position and orientation of the current element in play
- all the holes that have formed during the game
- the next piece
The network's reward and punishment can be controlled based on the specified metrics of the game
board and the goal, the number of lines cleared.
Tetris
Tetris - a timeless classic
Tetris is not just a video game; it's a cultural phenomenon that has captivated players
worldwide for more than four decades. Its simple yet addictive mechanics, as well as its ability
to combine strategic thinking and quick reflexes, make it one of the most iconic and influential
games in history.
The Origins and Concept
Tetris was created in 1984 by Alexey Pajitnov, a Soviet software engineer. The idea originated
from pentominoes, which are geometric shapes consisting of five identical squares. Pajitnov
decided to work with shapes made of four squares, hence the "tetra" prefix, which means four in
Greek. Combined with the Russian word "tennis," the name "Tetris" was formed.
The basic concept of the game is extremely simple: blocks of various shapes, called tetrominos,
fall from the top of the screen. The player must rotate and place these blocks to create
complete, gap-free lines at the bottom of the screen. As soon as a line is completed, it
disappears, scoring points for the player and making room for new blocks. The game ends when the
blocks reach the top of the screen, and there is no more space for new ones.
Gameplay and Strategy
Tetris's addictive appeal lies in its ease of learning, but difficulty in mastering. As the game
progresses, the blocks fall faster and faster, putting increasing pressure on the player to make
quick and precise decisions. Successful gameplay requires not only fast fingers but also
strategic planning. The player must think ahead, considering the shapes of the next few incoming
blocks, and place the current block accordingly to maximize the chances of clearing lines.
Particularly important is the T-spin maneuver, where a T-shaped block is rotated into a tight
space, giving special points. In addition, the Tetris clear β clearing four lines simultaneously
with a straight "I" block β results in the highest score, and is often the goal of professional
players.
The Impact and Legacy of Tetris
Tetris achieved instant success in the Soviet Union, then quickly spread worldwide after being
licensed and released on various platforms. With the launch of the Nintendo Game Boy handheld
console in 1989, Tetris became explosively popular. The game is often cited as one of the main
reasons the Game Boy was a huge success, selling millions of copies worldwide.
Over the years, Tetris has seen countless incarnations, appearing on almost every imaginable
platform, from arcades to mobile phones, modern consoles, and PCs. Competitions are held, it is
the subject of scientific research (for example, the phenomenon known as the "Tetris effect,"
where people see blocks in their minds even while sleeping), and it has become deeply embedded
in popular culture.
Tetris's enduring appeal lies in its universality. No language knowledge is needed to understand
it, and despite its simplicity, it can be endlessly deep and challenging. This is the kind of
game that can be played for five minutes on a bus, or for hours at a time, immersed in the
meditative rhythm of arranging blocks. Tetris is not just a game; it's an enduring puzzle that
has entertained and challenged people for generations, and will likely remain with us for a very
long time.
Game and Learning Modes
The application offers four distinct game modes:
- Human play
- Network training
- Network play
- Heuristic play
Human play
In this game mode, we can try out Tetris games playable by a human. Control is possible with the
arrow keys; rotation works with the up arrow or the space bar.
- ←: move element left
- →: move element right
- ↓: drop element fast
- ↑ or SPACE: rotate element
We get points for dropped elements and cleared lines. After a certain number of cleared lines,
the game levels up, and the falling speed of the elements increases. If the game board (stack)
fills up, the game ends.
Network training (Train mode)
This mode provides an opportunity to train a neural network with the specified parameters. The
network receives visual input from the game board, including fallen elements, the position and
orientation of the current element, the holes that have formed, and the next piece. The
network's reward and punishment depend on the game board's metrics and the number of lines
cleared.
The Train mode includes the following sub-menu items:
- Start: Starts the network training process.
- Save: Allows saving the current training state to a file.
- Load: Allows loading a saved training state from a .zip file.
-
Select File: In the Load
menu, you can browse for the file to be loaded, or drag and drop it into the drop-zone area.
-
Options: Opens a modal window for setting training parameters. (See: Detailed
description of Training Options.)
- Reset: Resets the training state to default settings.
Play mode
The Play menu offers two options:
-
Network play: lets you watch the trained neural network make its own decisions
in real conditions. Holding the down arrow key during network play speeds up its execution.
-
Heuristic play: skips neural inference entirely and picks placements using only
the raw shaping weights (e.g., aggregate height, holes, bumpiness). Use this to test how the
specified weights alone behave, without any learned model.
Stop
The running process of any game mode can be interrupted with the stop function. The system then
switches to idle, waiting for further commands.
Training Options
Clicking the Options button opens a modal window where parameters for training the
neural network can be set. These parameters influence the reward weights and training behavior.
Reward and Punishment
Positive parameter values indicate reward, negative values indicate punishment. When designing
the reward system, we can tune the following characteristics:
- Completed Lines: The weight of the reward received for completed lines.
-
Lines Cleared Exponent: The exponent of the number of lines cleared, which
affects the increase of reward with the increase in the number of lines cleared at once.
-
Aggregate Height: Reward/punishment for the total height of the stack (usually
negative).
- Holes: Reward/punishment for the number of holes (usually negative).
- Bumpiness: Reward/punishment for the unevenness of the stack (usually negative).
- Well Depth: Reward/punishment for the depth of "wells" (either negative or
positive).
-
Penalty Row: The weight of punishment for penalty rows. If no line is cleared
from the stack for a while, the game punishes the player by inserting a randomly filled
bottom row. We can also associate a negative reward with this during training.
-
Game End Reward: Reward/punishment received at the end of the game (usually
negative).
-
Survival Reward: The weight of the reward received for survival (successful
placement of an element without ending the game). (Usually positive).
Statistical Moving Average Windows
Moving average windows help us observe the long-term direction of the network's development. If
progress stalls or deteriorates over a longer number of games, the network has not found a
solution with the current reward system settings.
- Moving Avg. Window: The size of the short-term moving average window for
statistics.
- Long-Term Window: The size of the long-term window for statistics.
The ratio of these two moving averages indicates the direction of progress.
Other Settings:
-
Penalty in Training: Activates penalty rows during training (makes it harder).
This difficulty makes the network learn for a longer period.
-
Extra Tetrominoes: Adds additional tetrominos to the game that are not part of
classic Tetris. A significant difficulty, extends the training process.
-
Force Render to Learn: Forces the full Tetris game to be rendered during
training, which can affect performance.
Operations
- Reset to Defaults: Resets all options to their default values.
- Apply: Applies the current settings, and starts or continues training.
- Cancel: Closes the options window without saving changes.
Training Panel and Statistics
Training Panel
When training mode is active, the "AI Panel" appears on the screen, providing real-time feedback
on the network's performance and the game's status.
AI Vision
Shows how the neural network "sees" the game board.
Numerical statistical data
- Games: Number of games played.
- Lines: Number of lines cleared so far in the current game.
- Avg Rows: Average line clear metric developing from game to game.
- Trend: Shows the learning trend based on short-term and long-term moving averages.
- Max Lines: The maximum number of lines cleared in a single game so far.
- Max Level: Maximum level reached.
- Upd./sec: Number of network updates per second.
Diagrams
-
Average Rows Cleared: A graph showing the evolution of average
cleared rows. If it shows an upward curve, learning is progressing well.
- Loss: A graph of the network's loss function (Mean Absolute Error).
-
Convergence: A graph of the average loss function's convergence.
If it shows a downward curve, the network is approaching a solution.
-
Learning Rate: A graph of the learning rate, which automatically
changes to facilitate the network's learning. The lower it is, the finer steps the network
takes to find the correct path.
Stability panel
Monitors the network's stability during training. It surfaces key metrics (done ratio,
rewards, TD error, prediction trends), current training state (buffer size, beta, epsilon,
learning rate), compact charts, and buffer-efficiency diagnostics so you can spot divergence or
regressions early and adjust parameters before the model destabilizes.
Reward options
We can see the currently set values of the network's reward system.
Training Methods Used
The training pipeline combines several techniques to stabilize and accelerate learning. Key
components include:
- Q-Learning with convolutional input encoding
- Adam optimizer with gradient clipping and gradient normalization
- Batch normalization
- L2 regularization
- Experience replay (PER) with alpha/beta annealing
- TD-error clipping
- Discounting
- Double DQN with soft target network updates
- Epsilon-greedy exploration
- Reward clipping and normalization
- Huber loss for value regression
- Learning-rate scheduling with warmup and plateau-based reduction
- Dropout for regularization
- Curiosity bonus shaping
- Target rewards, shaping rewards, extra rewards, and terminal rewards
- Heuristic tuning via a CMA tuner
Q-Learning with convolutional input encoding
Q-Learning estimates the action-value function Q(s,a) by iteratively minimizing the Bellman
error. For each transition the target is Qθ(s,a) ← r + γ · max_{aβ²}
Qθ-(s',a'), and the update moves Qθ(s,a) toward this target. Here
γ is the discount factor and θ- denotes the slowly updated target
network used for stable bootstrapping.
The convolutional encoder ingests only visual grid data, no handcrafted metrics. The network
must learn what the spatial cues mean and derive its own features from raw visuals.
Adam optimizer with gradient clipping and gradient normalization
Adam uses running estimates of first and second moments to adapt learning rates per weight,
while gradient clipping (value and norm) keeps updates bounded to avoid exploding steps. Gradient
normalization further scales updates to a consistent magnitude so the optimizer remains stable
across batches and training phases.
Batch normalization
Batch normalization re-centers and re-scales layer activations during training, reducing
internal covariate shift. This helps gradients stay well-behaved across batches, speeds up
convergence, and allows higher learning rates without destabilizing the network.
L2 regularization
L2 regularization adds a weight decay term that penalizes large parameter values. By keeping
weights small, it reduces overfitting and encourages smoother, more generalizable policies.
Experience replay (PER) with alpha/beta annealing
Prioritized Experience Replay samples transitions with probability proportional to their TD
error, focusing learning on surprising or mis-predicted outcomes. Alpha controls how strongly
priorities skew sampling, while beta anneals from an initial value toward 1.0 to gradually
correct bias with importance-sampling weights. This keeps training both focused and unbiased as
learning progresses.
TD-error clipping
TD errors are clipped to a bounded range so that a few outlier transitions cannot produce
runaway gradients. This keeps updates stable, especially when rewards spike or rare events
appear in the buffer.
Discounting
Future rewards are geometrically discounted by γ, weighting immediate outcomes more than
distant ones. This balances short-term line clears against long-term survival and keeps the
value targets numerically stable.
Double DQN with soft target network updates
Double DQN decouples action selection and evaluation to reduce overestimation: the online
network picks the argmax action, while the target network evaluates its value. Soft target
updates (tau) blend online weights into the target network gradually, avoiding sudden shifts
and improving stability.
Epsilon-greedy exploration
In Tetris the admissible actions (all legal placements/rotations) are enumerated each step, so
the agent follows a fully greedy policy without risking unseen moves. Epsilon is set to zero
and we always pick the highest-value action; exploration is inherent in the complete action
sweep rather than injected via random moves.
Reward clipping and normalization
Rewards are clipped to a bounded range to prevent large spikes from dominating updates, then
normalized to keep scales consistent across batches. This stabilizes TD targets, keeps gradients
in a manageable range, and makes reward shaping tweaks safer to tune.
Huber loss for value regression
The Huber loss behaves quadratically near zero error and linearly for large errors, reducing
sensitivity to outliers compared to MSE while remaining smooth. This makes value regression more
robust when rare transitions produce big TD errors.
Learning-rate scheduling with warmup and plateau-based reduction
Training begins with a warmup phase, then monitors metrics to lower the learning rate when
progress stalls. This lets the optimizer take larger steps early, then refine with smaller
steps as the policy converges, improving stability and final performance.
Dropout for regularization
Dropout randomly silences a fraction of activations during training, preventing units from
co-adapting and improving generalization. At inference time all units are active, using the
learned ensemble-like weights.
Curiosity bonus shaping
A lightweight count-based intrinsic reward boosts rarely visited states: each state hash gets a
bonus inversely proportional to its visit count (scaled by a tunable weight). This encourages
the agent to explore novel board configurations without overwhelming the task reward.
Target rewards, shaping rewards, extra rewards, and terminal rewards
The reward model separates base/target rewards (lines cleared), shaping terms (height, holes,
bumpiness), auxiliary extras (e.g., well depth, survival, penalty rows), and terminal bonuses.
This modular structure makes it easier to tune learning signals and to combine dense shaping
with sparse end-of-game feedback.
Heuristic tuning via a CMA tuner
A CMA-ES tuner explores the shaping-weight space (e.g., aggregate height, holes, bumpiness) to
discover weight sets that score best. These tuned weights can then be fed back into shaping
during training, providing stronger hand-crafted priors and improving the heuristic (non-network)
play mode. The initialization typically follows Dellacherie-style weights as a starting point.
Lab
Well, here we are at the end of an exciting journey, where the simple yet profound world of
Tetris met the complex challenges of artificial intelligence. With the BrainTris application, my
intention was not just to create a game, but also a laboratory where we can explore the power of
Q-Learning and the learning capabilities of neural networks in practice.
I hope that the tools and detailed documentation I have provided will inspire you to experiment.
Feel free to modify the reward system parameters, observe how different parameters influence the
network's behavior, and discover what strategies the AI adopts to master Tetris.
I wish you much success in your experiments and in creating a neural network that plays as
perfectly as possible! I trust that this application will not only provide entertainment but
also a valuable learning opportunity in the fascinating field of artificial intelligence and
machine learning. Explore the possibilities, and let BrainTris help you understand how an AI
thinks and learns!
Thank you for trying to create a smart artificial intelligence:
β β βββ βββ ββ βββ βββ βββ βββ βββ βββ
βββ βββ βββ βββ βββ ββ βββ β βββ βββ