# Reinforcement learning ppt Currently, I know very little about Reinforcement Learning, and I want to change that, so here is my first step on learning reinforcement learning. And as a first step, I wish to cover high level overviews first.

Please note that this post is for my future self and my learning process might be slow or different from yours.

Reinforcement Learning Tutorial by Peter Bodík, UC Berkeley

From this lecture, I learned that Reinforcement learning is more general compared to supervised or unsupervised. However, there seems to be still a notion of a goal, hence I assume there is going to be a certain cost function to measure how close are we from achieving that goal. Below is a very good summary of what reinforcement learning might be.

Right Image → Optimal Solution (No Reward For each step)
Middle Image → Solution when reward for each step is -0.1
Left Image → Solution when reward for each step is 0.01

Above, images are a perfect example (for me) that shows how complex reinforcement learning can be. If we make a robot that it’s objective is to get the most point, the optimal solution would be most right image. However, depending on the policy (this time, reward for each step) the solution that the robot learns are drastically different. From here the ppt explains quite a lot of math, so I wouldn’t included as a high level overview, but the ppt had a very good summary page.

From above, I learned that when reinforcement learning is used and what is the most challenging part of the reinforcement learning is actually designing the features, states and rewards.

International Conference on Machine Learning (ICML 2007) Tutorial

From previous presentation, we already learned that the challenging part is designing the State, and Reward. And those acronym at the bottom stands for Markov decision process and Partially observable Markov decision process.

A perfect example of states, action and rewards are shown above. We can see that this setting can easily applied to any games. (chess, start craft or even real world settings.)

Different types of learning algorithms, from here I learned that there are different sets of algorithms similar to set of classification algorithms, SVM, NN, or k-nearest neighbors.

Simple Beginner’s guide to Reinforcement Learning & its implementation (analyticsvidhya)

When we google reinforcement learning, we can see images like above, over and over again. So rather than seeing an agent or environment, lets actually think about this as a process where a baby is learning how to walk.

The “problem statement” of the example is to walk, where the child is an agent trying to manipulate the environment (which is the surface on which it walks) by taking actions (viz walking) and he/she tries to go from one state (viz each step he/she takes) to another. The child gets a reward(let’s say chocolate) when he/she accomplishes a sub module of the task (viz taking couple of steps) and will not receive any chocolate (a.k.a negative reward) when he/she is not able to walk. This is a simplified description of a reinforcement learning problem.” — Faizan Shaikh

The author actually give a long explanation of how these algorithms differ, if you wish to view them please click here. But in a short one/two sentence.

Supervised vs RL : Both map the relation between input and output, but in RL there is an reward function to measure the action that the agent took additionally a cost function to measure if we met the final goal. (e.g Winning a chess game → Winning the game is important, but there are multiple ways to win a chess game)
Unsupervised vs RL : Unsupervised learning is (mostly) finding patterns in underlying data and clustering them.

Sours: https://towardsdatascience.com/my-journey-to-reinforcement-learning-part-0-introduction-1e3aec1ee5bf

## Hierarchical Reinforcement Learning - PowerPoint PPT Presentation

PowerShow.com is a leading presentation/slideshow sharing website. Whether your application is business, how-to, education, medicine, school, church, sales, marketing, online training or just for fun, PowerShow.com is a great resource. And, best of all, most of its cool features are free and easy to use.

Sours: https://www.powershow.com/view4/71f777-MDFkM/Hierarchical_Reinforcement_Learning_powerpoint_ppt_presentation
• Reinforcement Learning Instructor: Max Welling Source: T. Mitchell, Machine Learning, Chapter 13.

• Overview • Supervised Learning: Immediate feedback (labels provided for every input). • Unsupervised Learning: No feedback (no labels provided). • Reinforcement Learning: Delayed scalar feedback (a number called reward). • RL deals with agents that must sense & act upon their environment. • This is combines classical AI and machine learning techniques. • It the most comprehensive problem setting. • Examples: • A robot cleaning my room and recharging its battery • Robot-soccer • How to invest in shares • Modeling the economy through rational agents • Learning how to fly a helicopter • Scheduling planes to their destinations • and so on

• The Big Picture Your action influences the state of the world which determines its reward

• Complications • The outcome of your actions may be uncertain • You may not be able to perfectly sense the state of the world • The reward may be stochastic. • Reward is delayed (i.e. finding food in a maze) • You may have no clue (model) about how the world responds to your actions. • You may have no clue (model) of how rewards are being paid off. • The world may change while you try to learn it • How much time do you need to explore uncharted territory before you • exploit what you have learned?

• The Task • To learn an optimal policy that maps states of the world to actions of the agent. • I.e., if this patch of room is dirty, I clean it. If my battery is empty, I recharge it. • What is it that the agent tries to optimize? • Answer: the total future discounted reward: Note: immediate reward is worth more than future reward. What would happen to mouse in a maze with gamma = 0 ?

• Value Function • Let’s say we have access to the optimal value function that computes the • total future discounted reward • What would be the optimal policy ? • Answer: we choose the action that maximizes: • We assume that we know what the reward will be if we perform action “a” in • state “s”: • We also assume we know what the next state of the world will be if we perform • action “a” in state “s”:

• Example I • Consider some complicated graph, and we would like to find the shortest • path from a node Si to a goal node G. • Traversing an edge will cost you “length edge” dollars. • The value function encodes the total remaining • distance to the goal node from any node s, i.e. • V(s) = “1 / distance” to goal from s. • If you know V(s), the problem is trivial. You simply • choose the node that has highest V(s). Si G

• Example II Find your way to the goal.

• Q-Function Bellman Equation: • One approach to RL is then to try to estimate V*(s). • However, this approach requires you to know r(s,a) and delta(s,a). • This is unrealistic in many real problems. What is the reward if a robot • is exploring mars and decides to take a right turn? • Fortunately we can circumvent this problem by exploring and experiencing • how the world reacts to our actions. We need to learn r & delta. • We want a function that directly learns good state-action pairs, i.e. • what action should I take in this state. We call this Q(s,a). • Given Q(s,a) it is now trivial to execute the optimal policy, without knowing • r(s,a) and delta(s,a). We have:

• Example II Check that

• Q-Learning • This still depends on r(s,a) and delta(s,a). • However, imagine the robot is exploring its environment, trying new actions • as it goes. • At every step it receives some reward “r”, and it observes the environment • change into a new state s’ for action a. • How can we use these observations, (s,a,s’,r) to learn a model? s’=st+1

• Q-Learning s’=st+1 • This equation continually estimates Q at state s consistent with an estimate • of Q at state s’, one step in the future: temporal difference (TD) learning. • Note that s’ is closer to goal, and hence more “reliable”, but still an estimate itself. • Updating estimates based on other estimates is called bootstrapping. • We do an update after each state-action pair. I.e., we are learning online! • We are learning useful things about explored state-action pairs. These are typically • most useful because they are likely to be encountered again. • Under suitable conditions, these updates can actually be proved to converge to the • real answer.

• Example Q-Learning Q-learning propagates Q-estimates 1-step backwards

• Exploration / Exploitation • It is very important that the agent does not simply follow the current policy • when learning Q. (off-policy learning).The reason is that you may get stuck • in a suboptimal solution. I.e. there may be other solutions out there that you • have never seen. • Hence it is good to try new things so now and then, e.g. • If T large lots of exploring, if T small follow current policy. • One can decrease T over time.

• Improvements • One can trade-off memory and computation by cashing (s,s’,r) for observed • transitions. After a while, as Q(s’,a’) has changed, you can “replay” the update: • One can actively search for state-action pairs for which Q(s,a) is • expected to change a lot (prioritized sweeping). • One can do updates along the sampled path much further back than just • one step ( learning).

• Extensions • To deal with stochastic environments, we need to maximize • expected future discounted reward: • Often the state space is too large to deal with all states. In this case we • need to learn a function: • Neural network with back-propagation have been quite successful. • For instance, TD-Gammon is a back-gammon program that plays at expert level. • state-space very large, trained by playing against itself, uses NN to approximate • value function, uses TD(lambda) for learning.

• More on Function Approximation • For instance: linear function: • The features Phi are fixed measurements of the state (e.g. # stones on the board). • We only learn the parameters theta. • Update rule: (start in state s, take action a, observe reward r and end up in state s’) change in Q

• Conclusion • Reinforcement learning addresses a very broad and relevant question: • How can we learn to survive in our environment? • We have looked at Q-learning, which simply learns from experience. • No model of the world is needed. • We made simplifying assumptions: e.g. state of the world only depends on • last state and action. This is the Markov assumption. The model is called • a Markov Decision Process (MDP). • We assumed deterministic dynamics, reward function, but the world really is • stochastic. • There are many extensions to speed up learning. • There have been many successful real world applications. http://elsy.gdan.pl/index.php?option=com_content&task=view&id=20&Itemid=39

• Sours: https://www.slideserve.com/stefan/reinforcement-learning-powerpoint-ppt-presentation
Reinforcement Learning for Engineers, Part 1: What Is Reinforcement Learning?

.

## Learning ppt reinforcement

.

What is Reinforcement Learning?

.

### You will also be interested:

.

304 305 306 307 308