# Pairing Reinforcement Learning and Machine Learning, an Enhanced Emergency Response Scenario

Practical walkthroughs on machine learning, data exploration and finding insight.

**On YouTube:**

**Companion Code on GitHub:**

*Art credit: Lucas Amunategui*

Imagine a scenario unraveling at a chemical factory right after an explosion that caused a dangerous chemical leak. The alarms are blazing and the personnel is evacuated as the leak cannot be located. They send an autonomous robot inside the empty factory equipped with a camera, lights, and environmental sensors, capable of capturing ambient humidity, in hopes of locating the chemical leak.

The robot has the ability to travel quickly and at ease through each room of the factory without any prior knowledge of the layout. Its primary goal is to find the shortest path to the leak by exploring each room and using its humidity sensor and camera. The robot’s decision-making framework uses reinforcement learning and will start traveling at random throughout the factory floor many times, recording each path taken in detail to eventually determine the shortest route to the leak. The goal isn’t only to find the leak but also to find the shortest path. This will be of tremendous help to the emergency crew to quickly locate the danger and limit human exposure to chemicals.

As Q-Learning requires constant back-and-forth and trial-and-error towards finding a shortest path, why not make use of that discovery process to record as much environmental data as possible? With the rise in popularity of Internet of Things (IoT), there are plenty of third-party sensor attachments, hardware and software management tools to leverage. This could be as simple as attaching a temperature-sensing gauge on the bot with built-in data storage or realtime transmission capabilities. The hope is to analyze the data after each incident and learn more than simply the shortest path.

We then extend this scenario to another factory facing a similar chemical-leak situation. This time the software in the bot is enhanced with an additional system to leverage the IoT lessons learned from the first experience. The bot won’t only look for the shortest path to the leak but will also use environmental lessons it learned from the first incident.

Let’s
see how the process of search and discovery can be enhanced when combining
**Reinforcement Learning (RL)**, **Machine Learning (ML)**, **Internet of Things (IoT)**,
and **Case-based reasoning (CBR)**.

**Note:** For a gentle introduction to RL and Q-Learning in Python, see a post I made on
my Github blog http://amunategui.github.io/reinforcement-learning/index.html

# Finding the Factory Floor with the Dangerous Chemical Leak

The goal in this experiment is to apply Q-Learning method to map the shortest path to a chemical leak for response team. The factory floor plan considered for this experiment comprises 57 rooms in which the entrance and the goal are considered at room 0 and 50 respectively (see Figure 1).

**Figure 1:** Shows the graph representation of our chemical
factory. Each colored circle is a room. The entrance to the factory is room ‘0’
and the room with the chemical leak is room ‘50’.

Q-Learning is an unsupervised learning process that will iterate thousands of times to map and learn the fastest route to a goal point –room ‘50’ in our case. The bot learns using an intelligent rewards-feedback mechanism. This is done with the help of a rewards table, a large matrix used to score each path the bot can follow. This is the matrix version of a road map. We initialize the matrix to be the height and width of all our points (64 in this example) and initialize all values to –1. We then change the viable paths to value 0 and goal paths to 100 (you can make the goal score anything you want as long as its larger enough to propagate during Q-Learning).

When the model starts, it creates a blank Q-matrix (hence the name Q-Learning), which is a matrix the same size as the rewards matrix but all values are initialized to 0. It is this Q-matrix that our model will use to keep track and score how well the paths are doing. Each attempt is scored using the following formula:

Equation (1) returns a score evaluating the move from one point to a new one. It also
takes in consideration any previous scores the model has already seen. The term
‘*state*’ is the current room the bot is in, and ‘action’ is the next
state or next room. is a tunable variable between 0 and 1 where values closer
to 0 make the model choose an immediate reward while values closer to 1 will
consider alternative paths and allow rewards from later moves. Finally, it
multiplies gamma against all experienced next actions from the Q-matrix. This
is an important part as all new actions take in consideration previous lessons
learned.

To understand how this works, let’s take a slight detour, and imagine that our factory has only 5 rooms as shown in the Figure 2:

**Figure 2:** All paths are initialized to 0 except for paths
leading to the goal point which is set to 100 (room 2 to room 4 and recursively
room 4 to room 2).

Below is the matrix format of the rewards table (-1s are used to fill non-existing paths which the bot will ignore):

**Figure 3:**Reward matrix with path and goal-point scores.

The Q-matrix is the same size as the rewards matrix but all cells are initialized with zeros. For n many iterations, the model will randomly select a state, a point on the map represented by a row on the rewards matrix, then move to another state and calculate the rewards for that action. For example, say the model picks point 2 as a state, it can go to either points 1, 3 or 4. According to the rewards matrix, the bot is on row 2 (third from top) and the only cells that aren’t -1s are 1, 3 and 4. Point 4 is chosen at random. According to the Q-Learning algorithm, the score for this move is the current point, plus gamma times maximum value of the new action points:

That move is valued at 100 points and entered into the Q-matrix. Why 100? Gamma multiplied by max(Q[(4,2),(4,4)]) equals zero as the bot hasn’t visited those rooms yet and our Q-matrix still only holds zeros. Because as that is the value in R[2,4] and. Had the model chosen point 3 instead, then:

After running the model for few hundred iterations, the model converges and returns the following matrix:

**Figure 4:** Q-Matrix with converged scores from
learning process.

The matrix shows that starting from point 0 (row 0/room 0), the next step with the highest score is room 1 at 215.14 points. Moving down to the second row, room 3 has the highest score at 268.93 points. Finally, dropping to the third row, room 4 has the highest score, at 336.16 points, which is the goal point. But you can also pick any point you want and find the best path to the goal point from that vantage point (for example starting from room/point 3).

Let’s get back to our factory incident. We had to take this detour as the matrices for our factory are just too big to display and hopefully will be easier to understand. We now let the bot run loose over the factory floor.