This episode extends last one, where Minimax and Alpha Beta Pruning
algorithms are introduced. We will solve several tic-tac-toe problems in
leetcode, gathering intuition and building blocks for tic-tac-toe game
logic, which can be naturally extended to Connect-N game or Gomoku
(N=5). Then we solve tic-tac-toe using Minimax and Alpha Beta pruning
for small N and analyze their state space. In the following episodes,
based on building blocks here, we will implement a Connect-N Open Gym
GUI Environment, where we can play against computer visually or compare
different computer algorithms. Finally, we demonstrate how to implement
a Monte Carlo Tree Search for Connect-N Game.
Tic-tac-toe is played by two players A and B on a 3 x 3 grid.
Here are the rules of Tic-Tac-Toe: Players take turns placing
characters into empty squares (" "). The first player A always
places "X" characters, while the second player B always places "O"
characters. "X" and "O" characters are always placed into empty
squares, never on filled ones. The game ends when there are 3 of
the same (non-empty) character filling any row, column, or
diagonal. The game also ends if all squares are non-empty. No
more moves can be played if the game is over. Given an array moves where
each element is another array of size 2 corresponding to the row and
column of the grid where they mark their respective character in the
order in which A and B play. Return the winner of the game if it
exists (A or B), in case the game ends in a draw return "Draw", if there
are still movements to play return "Pending". You can assume that
moves is valid (It follows the rules of Tic-Tac-Toe), the grid is
initially empty and A will play first.
Example 1: Input: moves = [[0,0],[2,0],[1,1],[2,1],[2,2]]
Output: "A" Explanation: "A" wins, he always plays first. "X "
"X " "X " "X " "X " " " -> " " -> " X " -> " X " -> " X
" " " "O " "O " "OO " "OOX"
Example 3: Input: moves =
[[0,0],[1,1],[2,0],[1,0],[1,2],[2,1],[0,1],[0,2],[2,2]] Output:
"Draw" Explanation: The game ends in a draw since there are no
moves to make. "XXO" "OOX" "XOX"
Example 4: Input: moves = [[0,0],[1,1]] Output:
"Pending" Explanation: The game has not finished yet. "X
" " O " " "
The intuitive solution is to permute all 8 possible winning
conditions: 3 vertical lines, 3 horizontal lines and 2 diagonal lines.
We keep 8 variables representing each winning condition and a simple
trick is converting board state to a 3x3 2d array, whose cell has value
-1, 1, and 0. In this way, we can traverse the board state exactly once
and in the process determine all 8 variables value by summing
corresponding cell value. For example, row[0] is for first line winning
condition, summed by all 3 cells in first row during board traveral. It
indicates win for first player only when it's equal to 3 and win for
second player when it's -3.
classSolution: deftictactoe(self, moves: List[List[int]]) -> str: board = [[0] * 3for _ inrange(3)] for idx, xy inenumerate(moves): player = 1if idx % 2 == 0else -1 board[xy[0]][xy[1]] = player
turn = 0 row, col = [0, 0, 0], [0, 0, 0] diag1, diag2 = False, False for r inrange(3): for c inrange(3): turn += board[r][c] row[r] += board[r][c] col[c] += board[r][c] if r == c: diag1 += board[r][c] if r + c == 2: diag2 += board[r][c]
oWin = any(row[r] == 3for r inrange(3)) orany(col[c] == 3for c inrange(3)) or diag1 == 3or diag2 == 3 xWin = any(row[r] == -3for r inrange(3)) orany(col[c] == -3for c inrange(3)) or diag1 == -3or diag2 == -3
Below we give another AC solution. Despite more code, it's more
efficient than previous one because for a given game state, it does not
need to visit each cell on the board. How is it achieved? The problem
guarentees each move is valid, so what's sufficent to examine is to
check neighbours of the final move and see if any line including final
move creates a winning condition. Later we will reuse the code in this
solution to create tic-tac-toe game logic.
if (north + south + 1 >= 3) or (east + west + 1 >= 3) or \ (south_east + north_west + 1 >= 3) or (north_east + south_west + 1 >= 3): returnTrue returnFalse
defgetConnectedNum(self, r: int, c: int, dr: int, dc: int) -> int: player = self.board[r][c] result = 0 i = 1 whileTrue: new_r = r + dr * i new_c = c + dc * i if0 <= new_r < 3and0 <= new_c < 3: if self.board[new_r][new_c] == player: result += 1 else: break else: break i += 1 return result
deftictactoe(self, moves: List[List[int]]) -> str: self.board = [[0] * 3for _ inrange(3)] for idx, xy inenumerate(moves): player = 1if idx % 2 == 0else -1 self.board[xy[0]][xy[1]] = player
# only check last move r, c = moves[-1] win = self.checkWin(r, c) if win: return"A"iflen(moves) % 2 == 1else"B"
A Tic-Tac-Toe board is given as a string array board. Return True if
and only if it is possible to reach this board position during the
course of a valid tic-tac-toe game. The board is a 3 x 3 array, and
consists of characters " ", "X", and "O". The " " character represents
an empty square. Here are the rules of Tic-Tac-Toe: Players
take turns placing characters into empty squares (" "). The first
player A always places "X" characters, while the second player B always
places "O" characters. "X" and "O" characters are always placed
into empty squares, never on filled ones. The game ends when there
are 3 of the same (non-empty) character filling any row, column, or
diagonal. The game also ends if all squares are non-empty. No
more moves can be played if the game is over.
Example 1: Input: board = ["O ", " ", " "] Output:
false Explanation: The first player always plays "X".
Example 2: Input: board = ["XOX", " X ", " "] Output:
false Explanation: Players take turns making moves.
Example 4: Input: board = ["XOX", "O O", "XOX"] Output:
true
Note: board is a length-3 array of strings, where each string
board[i] has length 3. Each board[i][j] is a character in the set
{" ", "X", "O"}.
Surely, it can be solved using DFS, checking if the state given would
be reached from initial state. However, this involves lots of states to
search. Could we do better? There are obvious properties we can rely on.
For example, the number of X is either equal to the number of O or one
more. If we can enumerate a combination of necessary and sufficient
conditions of checking its reachability, we can solve it in O(1) time
complexity.
Design a Tic-tac-toe game that is played between two players on a n x
n grid. You may assume the following rules: A move is
guaranteed to be valid and is placed on an empty block. Once a
winning condition is reached, no more moves is allowed. A player
who succeeds in placing n of their marks in a horizontal, vertical, or
diagonal row wins the game.
Example: Given n = 3, assume that player 1 is "X" and player 2
is "O" in the board. TicTacToe toe = new TicTacToe(3);
toe.move(0, 0, 1); -> Returns 0 (no one wins) |X| | | | | |
| // Player 1 makes a move at (0, 0). | | | |
toe.move(0, 2, 2); -> Returns 0 (no one wins) |X| |O| | | |
| // Player 2 makes a move at (0, 2). | | | |
toe.move(2, 2, 1); -> Returns 0 (no one wins) |X| |O| | | |
| // Player 1 makes a move at (2, 2). | | |X|
toe.move(1, 1, 2); -> Returns 0 (no one wins) |X| |O| | |O|
| // Player 2 makes a move at (1, 1). | | |X|
toe.move(2, 0, 1); -> Returns 0 (no one wins) |X| |O| | |O|
| // Player 1 makes a move at (2, 0). |X| |X|
toe.move(1, 0, 2); -> Returns 0 (no one wins) |X| |O| |O|O|
| // Player 2 makes a move at (1, 0). |X| |X|
toe.move(2, 1, 1); -> Returns 1 (player 1 wins) |X| |O|
|O|O| | // Player 1 makes a move at (2, 1). |X|X|X|
Follow up: Could you do better than O(n2) per move()
operation?
348 is a locked problem. For each player's move, we can resort to
checkWin function in second solution for 1275. We show another solution
based on first solution of 1275, where 8 winning condition flags are
kept and each move only touches associated several flag variables.
def__init__(self, n:int): """ Initialize your data structure here. :type n: int """ self.row, self.col, self.diag1, self.diag2, self.n = [0] * n, [0] * n, 0, 0, n
defmove(self, row:int, col:int, player:int) -> int: """ Player {player} makes a move at ({row}, {col}). @param row The row of the board. @param col The column of the board. @param player The player, can be either 1 or 2. @return The current winning condition, can be either: 0: No one wins. 1: Player 1 wins. 2: Player 2 wins. """ if player == 2: player = -1
self.row[row] += player self.col[col] += player if row == col: self.diag1 += player if row + col == self.n - 1: self.diag2 += player
if self.n in [self.row[row], self.col[col], self.diag1, self.diag2]: return1 if -self.n in [self.row[row], self.col[col], self.diag1, self.diag2]: return2 return0
Optimal Strategy of
Tic-Tac-Toe
Tic-tac-toe and Gomoku (Connect Five in a Row) share the same rules
and are generally considered as M,n,k-game, where
board size range to M x N and winning condition changes to k.
ConnectNGame class implements M,n,k-game of MxM board size. It
encapsulates the logic of checking each move and also is able to undo
last move to facilitate backtrack in game search algorithm later.
defgetAvailablePositions(self) -> List[Tuple[int, int]]: return [(i,j) for i inrange(self.board_size) for j inrange(self.board_size) if self.board[i][j] == ConnectNGame.AVAILABLE]
defgetStatus(self) -> Tuple[Tuple[int, ...]]: returntuple([tuple(self.board[i]) for i inrange(self.board_size)])
Note that checkWin code is identical to second solution in 1275.
Minimax Strategy
Now we have Connect-N game logic, let's finish its minimax algorithm
to solve the game.
Define a generic strategy base class, where action method needs to be
overridden. Action method expects ConnectNGame class telling current
game state and returns a tuple of 2 elements, the first element is the
estimated or exact game result after taking action specified by second
element. The second element is of form Tuple[int, int], denoting the
position of the move, for instance, (1,1).
defminimax(self) -> Tuple[int, Tuple[int, int]]: game = self.game bestMove = None assertnot game.gameOver if game.currentPlayer == ConnectNGame.PLAYER_A: ret = -math.inf for pos in game.getAvailablePositions(): move = pos result = game.move(*pos) if result isNone: assertnot game.gameOver result, oppMove = self.minimax() game.undo() ret = max(ret, result) bestMove = move if ret == result else bestMove if ret == 1: return1, move return ret, bestMove else: ret = math.inf for pos in game.getAvailablePositions(): move = pos result = game.move(*pos) if result isNone: assertnot game.gameOver result, oppMove = self.minimax() game.undo() ret = min(ret, result) bestMove = move if ret == result else bestMove if ret == -1: return -1, move return ret, bestMove
We plot up to first 2 moves with code above. For first
player O, there are possibly 9 positions, where due to symmetry, only 3
kinds of moves, which we call corner, edge and center, respectively. The
following graph shows whatever 9 positions the first player takes, the
best result is draw. So solution of tic-tac-toe is draw.
Plot first step of 3 kinds of moves one by one below.
Tic-tac-toe Solution
and Number of States
An interesting question is the number of game states of tic-tac-toe.
A loosely upper bound can be derived by \(3^9=19683\), which includes lots of
inreachable states. This article Tic-Tac-Toe
(Naughts and Crosses, Cheese and Crackers, etc lists number of
states after each move. The total number is 5478.
Moves
Positions
Terminal Positions
0
1
1
9
2
72
3
252
4
756
5
1260
120
6
1520
148
7
1140
444
8
390
168
9
78
78
Total
5478
958
We can verify the number if we change a little of existing code to
code below.
if gameStatus in self.dpMap: return self.dpMap[gameStatus]
game = self.game bestMove = None assertnot game.gameOver if game.currentPlayer == ConnectNGame.PLAYER_A: ret = -math.inf for pos in game.getAvailablePositions(): move = pos result = game.move(*pos) if result isNone: assertnot game.gameOver result, oppMove = self.minimax(game.getStatus()) self.dpMap[game.getStatus()] = result, oppMove else: self.dpMap[game.getStatus()] = result, move game.undo() ret = max(ret, result) bestMove = move if ret == result else bestMove self.dpMap[gameStatus] = ret, bestMove return ret, bestMove else: ret = math.inf for pos in game.getAvailablePositions(): move = pos result = game.move(*pos)
if result isNone: assertnot game.gameOver result, oppMove = self.minimax(game.getStatus()) self.dpMap[game.getStatus()] = result, oppMove else: self.dpMap[game.getStatus()] = result, move game.undo() ret = min(ret, result) bestMove = move if ret == result else bestMove self.dpMap[gameStatus] = ret, bestMove return ret, bestMove
if __name__ == '__main__': tic_tac_toe = ConnectNGame(N=3, board_size=3) strategy = CountingMinimaxStrategy() strategy.action(tic_tac_toe) print(f'Game States Number {len(strategy.dpMap)}')
Running the code proves the total number is 5478. Also illustrate
some small scale game configuration results.
3x3
4x4
k=3
5478 (Draw)
6035992 (Win)
k=4
9722011 (Draw)
k=5
According to Wikipedia
M,n,k-game, below are results for some game configuration.
3x3
4x4
5x5
6x6
k=3
Draw
Win
Win
Win
k=4
Draw
Draw
Win
k=5
Draw
Draw
What's worth mentioning is that Gomoku (Connect Five in a Row), of
board size MxM >= 15x15 is proved by L. Victor Allis to be Win.
defalpha_beta(self, gameStatus: Tuple[Tuple[int, ...]], alpha:int=None, beta:int=None) -> Tuple[int, Tuple[int, int]]: game = self.game bestMove = None assertnot game.gameOver if game.currentPlayer == ConnectNGame.PLAYER_A: ret = -math.inf for pos in game.getAvailablePositions(): move = pos result = game.move(*pos) if result isNone: assertnot game.gameOver result, oppMove = self.alpha_beta(game.getStatus(), alpha, beta) game.undo() alpha = max(alpha, result) ret = max(ret, result) bestMove = move if ret == result else bestMove if alpha >= beta or ret == 1: return ret, move return ret, bestMove else: ret = math.inf for pos in game.getAvailablePositions(): move = pos result = game.move(*pos) if result isNone: assertnot game.gameOver result, oppMove = self.alpha_beta(game.getStatus(), alpha, beta) game.undo() beta = min(beta, result) ret = min(ret, result) bestMove = move if ret == result else bestMove if alpha >= beta or ret == -1: return ret, move return ret, bestMove
Rewrite alpha beta pruning with DP, where we omit alpha and beta
parameters in alpha_beta_dp because lru_cache cannot specify effective
parameters. Instead, we keep alpha and beta in a stack variable and
maintain the stack according to alpha_bate_dp calling stack.
@lru_cache(maxsize=None) defalpha_beta_dp(self, gameStatus: Tuple[Tuple[int, ...]]) -> Tuple[int, Tuple[int, int]]: alpha, beta = self.alphaBetaStack[-1] game = self.game bestMove = None assertnot game.gameOver if game.currentPlayer == ConnectNGame.PLAYER_A: ret = -math.inf for pos in game.getAvailablePositions(): move = pos result = game.move(*pos) if result isNone: assertnot game.gameOver self.alphaBetaStack.append((alpha, beta)) result, oppMove = self.alpha_beta_dp(game.getStatus()) self.alphaBetaStack.pop() game.undo() alpha = max(alpha, result) ret = max(ret, result) bestMove = move if ret == result else bestMove if alpha >= beta or ret == 1: return ret, move return ret, bestMove else: ret = math.inf for pos in game.getAvailablePositions(): move = pos result = game.move(*pos) if result isNone: assertnot game.gameOver self.alphaBetaStack.append((alpha, beta)) result, oppMove = self.alpha_beta_dp(game.getStatus()) self.alphaBetaStack.pop() game.undo() beta = min(beta, result) ret = min(ret, result) bestMove = move if ret == result else bestMove if alpha >= beta or ret == -1: return ret, move return ret, bestMove
This is fifth episode of series: TSP From DP to Deep Learning. In
this episode, we turn to Reinforcement Learning technology, in
particular, a model-free policy gradient method that embeds pointer
network to learn minimal tour without supervised best tour label in
dataset. Full list of this series is listed below.
The stochastic policy \(p(\pi | s;
\theta)\), parameterized by \(\theta\), is aiming to assign high
probabilities to short tours and low probabilities to long tours. The
joint probability assumes independency to allow factorization.
The loss of the model is cross entropy between the network’s output
probabilities \(\pi\) and the best tour
\(\hat{\pi}\) generated by a TSP
solver.
Contribution made by Pointer networks is that it addressed the
constraint in that it allows for dynamic index value given by the
particular test case, instead of from a fixed-size vocabulary.
Reinforcement Learning
Neural Combinatorial Optimization with Reinforcement Learning
combines the power of Reinforcement Learning (RL) and Deep
Learning to further eliminate the constraint required by Pointer
Networks that the training dataset has to have supervised labels of best
tour. With deep RL, test cases do not need to have a solution which is
common pattern in deep RL. In the paper, a model-free policy-based RL
method is adopted.
Model-Free Policy Gradient
Methods
In the authoritative RL book, chapter 8 Planning and Learning
with Tabular Methods, there are two major approaches in RL. One is
model-based RL and the other is model-free RL. Distinction between the
two relies on concept of model, which is stated as follows:
By a model of the environment we mean anything that an agent can use
to predict how the environment will respond to its actions.
So model-based methods demand a model of the environment, and hence
dynamic programming and heuristic search fall into this category. With
model in mind, utility of the state can be computed in various ways and
planning stage that essentially builds policy is needed before agent can
take any action. In contrast, model-free methods, without building a
model, are more direct, ignoring irrelevant information and just
focusing on the policy which is ultimately needed. Typical examples of
model-free methods are Monte Carlo Control and Temporal-Difference
Learning. >Model-based methods rely on planning as their primary
component, while model-free methods primarily rely on learning.
In TSP problem, the model is fully determined by all points given,
and no feedback is generated for each decision made. So it's unclear to
how to map state value with a tour. Therefore, we turn to model-free
methods. In chapter 13 Policy Gradient Methods, a particular
approximation model-free method that learns a parameterized policy that
can select actions without consulting a value function. This approach
fits perfectly with aforementioned pointer networks where the
parameterized policy \(p(\pi | s;
\theta)\) is already defined.
Training objective is obvious, the expected tour length of \(\pi_\theta\) which, given an input graph
\(s\)
Monte Carlo
Policy Gradient: REINFORCE with Baseline
In order to find largest reward, a typical way is to optimize the
parameters \(\theta\) in the direction
of derivative: \(\nabla_{\theta} J(\theta |
s)\).
RHS of equation above is the derivative of expectation that we have
no idea how to compute or approximate. Here comes the well-known
REINFORCE trick that turns it into form of expectation of derivative,
which can be approximated easily with Monte Carlo sampling, where the
expectation is replaced by averaging.
Another common trick, subtracting a baseline \(b(s)\), leads the derivative of reward to
the following equation. Note that \(b(s)\) denotes a baseline function that
must not depend on \(\pi\). \[
\nabla_{\theta} J(\theta | s)=\mathbb{E}_{\pi \sim p_{\theta}(. |
s)}\left[(L(\pi | s)-b(s)) \nabla_{\theta} \log p_{\theta}(\pi |
s)\right]
\]
The trick is explained in as:
Because the baseline could be uniformly zero, this update is a strict
generalization of REINFORCE. In general, the baseline leaves the
expected value of the update unchanged, but it can have a large effect
on its variance.
Finally, the equation can be approximated with Monte Carlo sampling,
assuming drawing \(B\) i.i.d: \(s_{1}, s_{2}, \ldots, s_{B} \sim
\mathcal{S}\) and sampling a single tour per graph: $ {i}
p{}(. | s_{i}) $, as follows \[
\nabla_{\theta} J(\theta) \approx \frac{1}{B}
\sum_{i=1}^{B}\left(L\left(\pi_{i} |
s_{i}\right)-b\left(s_{i}\right)\right) \nabla_{\theta} \log
p_{\theta}\left(\pi_{i} | s_{i}\right)
\]
Actor Critic Methods
REINFORCE with baseline works quite well but it also has
disadvantage.
REINFORCE with baseline is unbiased and will converge asymptotically
to a local minimum, but like all Monte Carlo methods it tends to learn
slowly (produce estimates of high variance) and to be inconvenient to
implement online or for continuing problems.
A typical improvement is actor–critic methods, that not only learn
approximate policy, the actor job, but also learn approximate value
funciton, the critic job. This is because it reduces variance and
accelerates learning via a bootstrapping critic that introduce bias
which is often beneficial. Detailed algorithm in the paper illustrated
below.
defforward(self, batch_input: Tensor) -> Tuple[Tensor, List[Tensor], List[Tensor], List[Tensor]]: """ Args: batch_input: [batch_size * 2 * seq_len] Returns: R: Tensor of shape [batch_size] action_prob_list: List of [seq_len], tensor shape [batch_size] action_list: List of [seq_len], tensor shape [batch_size * 2] action_idx_list: List of [seq_len], tensor shape [batch_size] """ batch_size = batch_input.size(0) seq_len = batch_input.size(2) prob_list, action_idx_list = self.actor(batch_input)
action_list = [] batch_input = batch_input.transpose(1, 2) for action_id in action_idx_list: action_list.append(batch_input[[x for x inrange(batch_size)], action_id.data, :]) action_prob_list = [] for prob, action_id inzip(prob_list, action_idx_list): action_prob_list.append(prob[[x for x inrange(batch_size)], action_id.data])
R = self.reward(action_list)
return R, action_prob_list, action_list, action_idx_list defreward(self, sample_solution: List[Tensor]) -> Tensor: """ Computes total distance of tour Args: sample_solution: list of size N, each tensor of shape [batch_size * 2] Returns: tour_len: [batch_size] """ batch_size = sample_solution[0].size(0) n = len(sample_solution) tour_len = Variable(torch.zeros([batch_size]))