Tag: reinforcement

# Beating Battleships with Algorithms and AI

Past days, I discovered this series of blogs on how to win the classic game of Battleships (gameplay explanation) using different algorithmic approaches. I thought they might amuse you as well : )

The story starts with this 2012 Datagenetics blog where Nick Berry constrasts four algorithms’ performance in the game of Battleships. The resulting levels of artificial intelligence (AI) seem to compare respectively to a distracted baby, two sensible adults, and a mathematical progidy.

The first, stupidest approach is to just take Random shots. The AI resulting from such an algorithm would just pick a random tile to shoot at each turn. Nick simulated 100 million games with this random apporach and computed that the algorithm would require 96 turns to win 50% of games, given that it would not be defeated before that time. At best, the expertise level of this AI would be comparable to that of a distracted baby. Basically, it would lose from the average toddler, given that the toddler would survive the boredom of playing such a stupid AI.

A first major improvement results in what is dubbed the Hunt algorithm. This improved algorithm includes an instruction to explore nearby spaces whenever a prior shot hit. Every human who has every played Battleships will do this intuitively. A great improvement indeed as Nick’s simulations demonstrated that this Hunt algorithm completes 50% of games within ~65 turns, as long as it is not defeated beforehand. Your little toddler nephew will certainly lose, and you might experience some difficulty as well from time to time.

Another minor improvement comes from adding the so-called Parity principle to this Hunt algorithm (i.e., Nick’s Hunt + Parity algorithm). This principle instructs the algorithm to take into account that ships will always cover odd as well as even numbered tiles on the board. This information can be taken into account to provide for some more sensible shooting options. For instance, in the below visual, you should avoid shooting the upper left white tile when you have already shot its blue neighbors. You might have intuitively applied this tactic yourself in the past, shooting tiles in a “checkboard” formation. With the parity principle incorporated, the median completion rate of our algorithm improves to ~62 turns, Nick’s simulations showed.

Now, Nick’s final proposed algorithm is much more computationally intensive. It makes use of Probability Density Functions. At the start of every turn, it works out all possible locations that every remaining ship could fit in. As you can imagine, many different combinations are possible with five ships. These different combinations are all added up, and every tile on the board is thus assigned a probability that it includes a ship part, based on the tiles that are already uncovered.

At the start of the game, no tiles are uncovered, so all spaces will have about the same likelihood to contain a ship. However, as more and more shots are fired, some locations become less likely, some become impossible, and some become near certain to contain a ship. For instance, the below visual reflects seven misses by the X’s and the darker tiles which thus have a relatively high probability of containing a ship part.

Nick simulated 100 million games of Battleship for this probabilistic apporach as well as the prior algorithms. The below graph summarizes the results, and highlight that this new probabilistic algorithm greatly outperforms the simpler approaches. It completes 50% of games within ~42 turns! This algorithm will have you crying at the boardgame table.

Reddit user /u/DataSnaek reworked this probablistic algorithm in Python and turned its inner calculations into a neat GIF. Below, on the left, you see the probability of each square containing a ship part. The brighter the color (white <- yellow <- red <- black), the more likely a ship resides at that location. It takes into account that ships occupy multiple consecutive spots. On the right, every turn the algorithm shoots the space with the highest probability. Blue is unknown, misses are in red, sunk ships in brownish, hit “unsunk” ships in light blue (sorry, I am terribly color blind).

This latter attempt by DataSnaek was inspired by Jonathan Landy‘s attempt to train a reinforcement learning (RL) algorithm to win at Battleships. Although the associated GitHub repository doesn’t go into much detail, the approach is elaborately explained in this blog. However, it seems that this specific code concerns the training of a neural network to perform well on a very small Battleships board, seemingly containing only a single ship of size 3 on a board with only a single row of 10 tiles.

Fortunately, Sue He wrote about her reinforcement learning approach to Battleships in 2017. Building on the open source phoenix-battleship project, she created a Battleship app on Heroku, and asked co-workers to play. This produced data on 83 real, two-person games, showing, for instance, that Sue’s coworkers often tried to hide their size 2 ships in the corners of the Battleships board.

Next, Sue scripted a reinforcement learning agent in PyTorch to train and learn where to shoot effectively on the 10 by 10 board. It became effective quite quickly, requiring only 52 turns (on average over the past 25 games) to win, after training for only a couple hundreds games.

However, as Sue herself notes in her blog, disappointly, this RL agent still does not outperform the probabilistic approach presented earlier in this current blog.

Reddit user /u/christawful faced similar issues. Christ (I presume he is called) trained a convolutional neural network (CNN) with the below architecture on a dataset of Battleships boards. Based on the current board state (10 tiles * 10 tiles * 3 options [miss/hit/unknown]) as input data, the intermediate convolutional layers result in a final output layer containing 100 values (10 * 10) depicting the probabilities for each tile to result in a hit. Again, the algorithm can simply shoot the tile with the highest probability.

Christ was nice enough to include GIFs of the process as well [via]. The first GIF shows the current state of the board as it is input in the CNN — purple represents unknown tiles, black a hit, and white a miss (i.e., sea). The next GIF represent the calculated probabilities for each tile to contain a ship part — the darker the color the more likely it contains a ship. Finally, the third picture reflects the actual board, with ship pieces in black and sea (i.e., miss) as white.

As cool as this novel approach was, Chris ran into the same issue as Sue, his approach did not perform better than the purely probablistic one. The below graph demonstrates that while Christ’s CNN (“My Algorithm”) performed quite well — finishing a simulated 9000 games in a median of 52 turns — it did not outperform the original probabilistic approach of Nick Berry — which came in at 42 turns. Nevertheless, Chris claims to have programmed this CNN in a couple of hours, so very well done still.

Interested by all the above, I searched the web quite a while for any potential improvement or other algorithmic approaches. Unfortunately, in vain, as I did not find a better attempt than that early 2012 Datagenics probability algorithm by Nick.

Surely, with today’s mass cloud computing power, someone must be able to train a deep reinforcement learner to become the Battleship master? It’s not all probability right, there must be some patterns in generic playing styles, like Sue found among her colleagues. Or maybe even the ability of an algorithm to adapt to the opponent’s playin style, as we see in Libratus, the poker AI. Maybe the guys at AlphaGo could give it a shot?

For starters, Christ’s provided some interesting improvements on his CNN approach. Moreover, while the probabilistic approach seems the best performing, it might not the most computationally efficient. All in all, I am curious to see whether this story will continue.

Join 365 other followers

# Open Source Visual Inspector for Neuroevolution (VINE)

In optimizing their transportation services, Uber uses evolutionary strategies and genetic algorithms to train deep neural networks through reinforcement learning. A lot of difficult words in one sentence; you can imagine the complexity of the process.

Because it is particularly difficult to observe the underlying dynamics of this learning process in neural network optimization, Uber built VINE – a Visual Inspector for NeuroEvolution. VINE helps to discover how evolutionary strategies and genetic optimizing are performing under the hood. In a recent article, they demonstrate how VINE works on the Mujoco Humanoid Locomotion task.

[…] In the Humanoid Locomotion Task, each pseudo-offspring neural network controls the movement of a robot, and earns a score, called its fitness, based on how well it walks. [Evolutionary principles] construct the next parent by aggregating the parameters of pseudo-offspring based on these fitness scores […]. The cycle then repeats.

VINE plots parent neural networks and their pseudo-offspring according to their performance. Users can then interact with these plots to:

• visualize parents, top performance, and/or the entire pseudo-offspring cloud of any generation,
• compare between and within generation performance,
• and zoom in on any pseudo-offspring (points) in the plot to display performance information.

The GIFs below demonstrate what VINE is capable of displaying:

VINE can be found at this link. It is lightweight, portable, and implemented in Python.

# Libratus: A Texas Hold-Em Poker AI

Four of the best professional poker players in the world – Dong Kim, Jason Les, Jimmy Chou, and Daniel McAulay – recently got beat by Libratus, a poker-playing AI developed at the Pittsburgh Supercomputing Center. During a period of 20 days of continuous play (10h/day), each of these four professionals lost to Libratus heads-up in a whopping total of 120.000 hands of No Limit Texas Hold-em Poker.

A player may face 10 to the power of 160 different situations in Texas Hold-em Poker: more than the number of atoms in the universe. It took extensive machine learning to compute and prioritize the computation of the most rewarding actions in these situations. Libratus works by running extensive simulations, taking into account the way the professionals play, and figuring out the best counter strategy. Although it is not without flaws, any “holes” the players found in Libratus’ strategy could not be exploited for long, as the algorithm would quickly learn and adapt to prevent further exploitation. The experience was completely different from playing a human player, the professionals argue, as Libratus would make both tiny and huge bets and would continuously change its strategy and plays.

The video below provides more detailed information and also shows the million-dollar margin by which Libratus won at the end of the twenty day poker (training) marathon:

# Video: Human-Computer Interactions in Reinforcement Learning

Reinforcement learning is an area of machine learning inspired by behavioral psychology, concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward (Wikipedia, 2017). Normally, reinforcement learning occurs autonomously. Here, algorithms will seek to minimize/maximize a score that is estimated via predefined constraints. As such, algorithms can thus learn to perform the most effective actions (those that minimize/maximize the score) by repeatedly experimenting and assessing strategies.

The approach in the video below is radically different. Instead of a pre-defined scoring, human-computer interaction is used to assign each action sequence (each iteration/experiment) a score. This approach is particularly useful for complex behaviors, such as a back-flip, for which it is hard to pre-define the constraints and actions that lead to the “most effective” back-flip. However, for us humans, it is relatively easy to recognize a good back-flip when we see one. The video below shows how the researchers therefore integrated a human-computer interaction in their reinforcement learning algorithm. After observing the algorithm perform a sequence of actions, a human actor indicates to what extent the goal (i.e., a backflip) is achieved or not. This human assessment thus functions as the score which the algorithm will try to minimize/maximize.

This approach can be really valuable for organizations seeking to improve their machine learning application. The paper on the principle (Deep Reinforcement Learning from Human Preferences) can be found here. The scholars conclude that this supervised approach based on human preferences has very good training results whereas the cost are similar the simple bulldozer approach of training a neural net from scratch using GPU servers.