Today I have the great, great pleasure to teach you about hidden Markov models and filter algorithms. The reason why I'm so excited is in pretty much all of my scientific career, hidden Markov models and filters have played a major role. There's no robot that I program today that wouldn't extensively use hidden Markov models and things such as particle filters. In fact, when I applied for a job at Stanford University as a professor many years ago, my job talk that I used to market myself to Stanford was extensively about a version of hidden Markov models and particle filters applied to robotic mapping. Today I will teach you those algorithms so you can use them in many challenging problems. I can't quite promise you that once you have mastered the material you will get a job at Stanford, but you can really, really apply them to a vast array of problems in places such as finance, medicine, robotics, weather prediction, time series analysis, and many, many other domains. This is going to be a really fun class.
[Thrun] Hidden Markov models, or abbreviated HMMs, are used to analyze or to predict time series. Applications include robotics, medical, finance, speech and language technologies, and many, many, many other domains. In fact, HMMs and filters are at the core of a huge amount of deployed practical systems from elevators to airplanes. Every time there is a time series that involves noise or sensors or uncertainty, this is the method of choice. So today I'll teach you all about HMMs and filters so you can apply some of the basic algorithms in a wide array of practical problems. iooooooo
[Thrun] The essence of HMMs are really simply characterized by the following Bayes network. There's a sequence of states that evolve over time, and each state depends only on the previous state in this Bayes network. Each state also emits what's called a measurement. It is this Bayes network that is the core of hidden Markov models and various probabilistic filters such as Kalman filters, particle filters, and many others. These are words that might sound cryptic and they might not mean anything to you, but you might come across them as you study different disciplines of computer science and control theory. The real key here is the graphical model. If you look at the evolution of states, what you'll find is that these states evolve as what's called a Markov chain. In a Markov chain, each state only depends on its predecessor. So for example, state S3 is conditioned on S2 but not on S1. It's only immediate through S2 that S3 might be influenced by S1. That's called a Markov chain, and we're going to study Markov chains quite a bit in this class to understand them well. But what makes it a hidden Markov model or hidden Markov chain, if you wish, is the fact that there is measurement variables. So rather than being able to observe the state itself, what you get to see are measurements. Let me put this to perspective, showing you several of the robots I've built that possess hidden state. And where I only get to observe certain measurements, let me infer something about the hidden state.
[Thrun] What's shown here is the tour guide robot that I showed you earlier, but now I'll talk about the what's called localization problem-- the problem of finding out where in the world this robot is. This problem is important because to find its way around the museum and to arrive at exhibits of interest, it must know where it is. The problem with this problem is that it doesn't have a sensor that tells us where it is. Instead, it's given what's called range finders. These are sensors that measure distances to surrounding objects. It's also given the map of the environment, and it can compare these range finders measurements with the map of the environment and infer from that where it might be. The process of inferring the hidden state of the robot's location from the measurements, the range sensor measurements, that's the problem of filtering. And the underlying model is exactly the same I showed you before. It's a hidden Markov model where the state is the sequence of locations that the robot assumes in the museum and the measurements is the sequence of range measurements it perceives while it navigates the museum. A second example is the underground robotic mapping robot which has pretty much the same problem--finding out where it is-- but now it is not given a map; it builds the map from scratch. What this animation here shows you is a so-called particle filter applied to robotic mapping. Intuitively--what you see is very simple-- as the robot transcends into a mine, it builds a map. But the many black lines are hypotheses on where the robot might have been when building this map. It can't tell because of the noise in its motors and in its sensors. As the robot reconnects and closes the loop in this map, one of these black what we call particles in the trade-- one of these black hypotheses are being selected as the best one, and by virtue of having maintained many of those, the robot is able to build a coherent map. In fact, this animation was a key animation in my job talk when I applied to become a professor at Stanford University. Here is one final example I'd like to discuss with you which is called speech recognition. If you have a microphone that records speech and you want to make your computer recognize the speech, you will likely come across hidden Markov models. This is a typical speech signal over here. It's an oscillation for the words "speech lab" which I borrowed from Simon Arnfield. And if you blow up a small region over here, you'll find that there is an oscillation, and this oscillation in time is the speech signal. What speech recognizing systems do is they transform this signal over here back into letters like "speech lab." And you can see it's not an easy task. There is some signal here. The E, for example, is a certain shape. But different speakers speak differently, and there might be background noise, so decoding this back into speech is challenging. There's been enormous progress in the field mostly due to hidden Markov models that have been researched for more than 20 years. And today's best speech recognizers all use variants of hidden Markov models. So once again, I can't teach you everything in this class, but I'll teach you the very basics that you can apply to things such as speech signals.
[Thrun] So let's begin by taking the hidden out of the Markov model and study Markov chains. We're going to use an example for which I will quiz you. Suppose there are 2 types of weather--rainy, which we call R, and sunny, which we call S-- and suppose we have the following state transition diagram. If it's rainy, it stays rainy with a 0.6 chance while with 0.4 it becomes sunny. Sunny remains sunny with 0.8 chance but moves to rainy with 0.2 chance. This is obviously a temporal sequence so the weather at time 1 will be called R1 or S1, at time 2, R2 or S2. Suppose in the beginning we happen to know it is rainy, which means R times 0 when we begin. We have the probability of rain equals 1 and the probably of sun, S times 0 equals 0. I'd like to know from you what's the probability of rain on day 1, the same for day 2, and the same for day 3.
[Thrun] And the answer will be 0.6, 0.44, and 0.376. It's really an exercise applying probability theory. In the very beginning we know to be in state R, and the probability of remaining there is 0.6, which is directly the value on the arc over here. On the second state we know that the probability of R is 0.6 and therefore, the probability of sun is 0.4, and we compute the probability of rain on day 2 using total probability. The probability of rain on day 2 given rain on day 1 times the probability of rain on day 1 plus the probability of rain on day 2 given it was sunny on day 1 times the probability of sun on day 1. And if you plug in all these values, we get 0.6 times 0.6 plus rain following sun which is this arc over here, 0.2, times 0.4 as the prior, and this results in 0.44. We can now do the same with the probability of rain on day 3, which is the same 0.6 over here, but now our prior is different--it's 0.44-- plus the same 0.2 over here with the prior of 0.56, which is 1 minus 0.44. And when you work this all out, it is 0.376 as indicated over here. So what we really learned here is that this is a temporal Bayes network of which we can apply conventional probabilities such as the total probability which was also known as variable elimination in the Bayes network lecture. All these fancy words aside, it's really easy to evaluate those. So if you want to do this and you ask yourself given the probability of the certain time step like 1, what's it related to time step 2, you ask yourself what's the durations that I encounter in time step 1. There are usually 2 in this case. What are the transition probabilities that lead me to the desired state in time step 2 like the 0.6 if you started in R and 0.2 if you started in S, and you add all these cases up and you just get the right number. It's really an easy piece of mathematics if you think about it.
[Thrun] Let's practice this again with another 2-state Markov chain. States are A and B. A has a 50% chance of transitioning to B, and B always transitions into A. There is no loop from B to itself. Let's assume again at a time, 0, we know with certainty to be in state A. I would like to know the probability of A at time 1, at time 2, and at time 3.
[Thrun] And again the solution follows directly from the state diagram over here. In the beginning we do know we're in state A and the chance of remaining in A is 0.5. This is the 0.5 over here. We can just read this off. For the next state we find ourselves to be with 0.5 chance to be in A and 0.5 chance to be in B. If we're in B, we transition with certainty to A. That's because of the 0.5. But if we're in A, we stay in A with a 0.5 chance. So you put this together. plus 0.5 probability to be in B times 1 probability to transition to A. That gives us 0.75. Following the same logic but now we're in A with 0.75 times a 0.5 probability of staying in A plus 0.25 in B, which is 1 minus 0.75, and the transition's uncertainty back to A as 1, we get 0.625. So now you should be able to take a Markov chain and compute by hand or write a piece of software the probabilities of future states. You will be able to predict something. That's really exciting.
[Thrun] So one of the questions you might ask for a Markov chain like this is what happens if time becomes really large? What happens for the probability of A1000? Or let's go extreme. What about in the limit, A infinity, often written as the limits of time going to infinity of any P of At. That's like the fancy math notation, but what it really means is we just wait a long, long time. What is going to happen to the Markov chain over here? What is that probability? This probability is called a stationary distribution, and a Markov chain settles to a stationary distribution or sometimes a limit cycle if the transition is alternativistic(?), which we don't care about. And the key to calculating this is to realize that the probability for any t must be the same as the probability 1 times (?) This can be resolved as follows. We know that P of At is P of At given At minus 1 times P of At minus 1 plus P of At given Bt minus 1 times probability of Bt minus 1. This is just the theorem of total probability or forward propagation rule applied to this case over here, so nothing really new. But if you call this guy over here X, then we now have X equals probability of At given At minus 1 is 0.5 times--and this is the same X as this one over here because you're looking for the stationary distribution, so it's X again. This probability over here, A following B, is 1 in this special case, and the probability of Bt minus 1 is 1 minus At minus 1. And if you plug this in, that's the same as 1 minus X. And we can now solve this for X. Let me just do this. X equals, if you put these 2 Xs together we get minus 0.5 plus 1 or, differently, 1.5X equals 1. That means X equals 1 over 1.5, which is 2/3. So the answer here is the stationary distribution will have A occurring with 2/3 chance and B with 1/3 chance. It's still a Markov chain--it flips from A to B-- but these are the frequencies at which A occurs and this is the frequency at which B occurs.
[Thrun] To see if you understood this, let me look at the rain-sun Markov chain again, and let me ask you for the stationary distribution or the limit distribution for rain to be the case after infinitely many steps.
[Thrun] And the answer is 1/3, as you can easily see if you call X the probability of rain in time T and also the probability of rain, T minus 1. These 2 must be equivalent because we're looking for the stationary distribution. Then we get, by virtue of our expansion of the state at time T, the probability of transitioning from rain to rain is 0.6, the probability of having it rain is X again, the probability of transitioning from sun to rain is 0.2, and the probability of having sun before is 1 minus X, so we get X equals 0.4X plus 0.2. Or, differently, we have 0.6X equals 0.2, and when we work this out, X is 1/3, which is the probability of rain in the asymptote if you wait forever. One of the interesting things to observe here is that the stationary distribution does not depend on the initial distribution. In fact, I didn't even tell you what the initial state was. Markov chains that have that property, which are pretty much all Markov chains, are called ergodic. You can safely forget that word again, but people in the field use this word to express Markov chains that mix. And mix means that the knowledge of the initial distribution fades over time until it disappears in the end. The speed at which it gets lost is called the mixing speed.
[Thrun] You can also learn the transition probabilities of a Markov chain like this from actual data. Suppose you look out of the window and see sequences of rainy days followed by sunny days followed by rainy days and you wonder what numbers to put here, here, here, and here. Let me assume you see a sequence rain, sun, sun, sun, rain, sun, and rain. These are, in total, 7 different days, and we wish to estimate all those probabilities over here, including the initial distribution for the first day using maximum likelihood. You might remember all this work with Laplace smoothing, but for now we keep it simple, just maximum likelihood. We find for day 0 we had rain, and maximum likelihood would just say the probability for day 0 is 1. That's the most likely estimate. Then for the transition probability we find we transition from rain to something else twice here. We sometimes transition to sun and sometimes stay in rain. In both of the transitions we go from rain to sun. There is no instance of rain to rain. So maximum likelihood gives us over here a 1 and this over here 0. And finally, we can also ask the question what happens from a sunny state. We transition to a new sunny state or a rainy state, and those distributions are easily calculated. We have 4 transitioning out of a sunny state to something else-- this one, this one, this one, and this one. Twice it goes to sunny over here and over here, twice it goes to rainy over here and over here, so therefore the probability for either transition is 0.5. So we have 0.5 over here, 0.5 over here, 1 over here, and 0 over here for the transition probabilities.
[Thrun] So in this quiz please do the same for me. Here is our sequence. There's a couple of sunny days--5 in total--a rainy day, 3 sunny days, 2 rainy days. Calculate using maximum likelihood the prior probability of rain and then the 4 transition probabilities as before. Please fill in those numbers over here.
[Thrun] The initial probability for rain is 0 because we are just encountering 1 initial day and it's sunny. The maximum likelihood estimate is therefore 0. We transition 8 times out of a sunny state--1, 2, 3, 4, 5, 6, 7, 8-- twice into a rainy state, and therefore 6 times we remain in a sunny state, so the probability of sun to sun is ¾, whereas sun to rain is ¼. From a rainy state we have 2 outbound transitioning, The last R over here has no outbound transition, so it doesn't really count in our statistic. The maximum likelihood therefore is 0.5 or ½ for each of those.
[Thrun] One of the oddities of the maximum likelihood estimator is overfitting. So for example, we observed that we always have a single first day, and this becomes our prior probability. So in this case the prior probability for rain on day 0 would be 1, which kind of doesn't make sense, really. It should be more like the stationary distribution or something like that. Well, you might remember the work on Laplacian smoothing. This is a great moment where I can test whether you really think like an artificial intelligence person. I'm going to make you apply Laplacian smoothing in this new context of estimating the parameters of this Markov chain using the smoother of K = 1. You might remember you add something to the numerator, like 1, and something to the denominator to make sure things normalize, and then you get different probabilities than you would get with the maximum likelihood estimator. So I'm going to ask you a quiz here, even though I haven't completely shown you the application of Laplacian smoothing in this context. But if you understood Laplacian smoothing, you might want to give it a try. What's the probability of rain on day 0, and what are its conditional probabilities? Sun goes to sun, sun goes to rain, rain goes to sun, and rain stays in rain. The way probabilities work, as you surely know, these 2 things over here have to add up to 1, and these 2 things over here have to add up to 1.
[Thrun] So in Laplacian smoothing we look at the relative counts. We know there is 1 instance of rain at time 0. Normally it would be 1. But we add 1 to the numerator and 2 to the denominator, and we get 2/3. Let's look at these numbers again. The count that we have is 1 out of 1 is rain and 1 out of 1 would give us 1 under the maximum likelihood estimator. But because we're smoothing, we're adding a pseudocount, which is 1 rainy day and 1 sunny day, and we have to compensate for the 2 additional counts with a 2 over here and therefore we get 2/3. So our probability under the Laplacian smoother is 2/3 for the rainy day to be the first day, which is really different from 1. Applying the same logic over here, we transition 3 times out of a sunny state-- So maximum likelihood would say 3 times out of 3 it's sunny into sunny. We add a pseudo observation of 1, and then there's 2 possible outcomes; hence, we have to count 2 over here. So it's 4/5. And the missing 1/5 shows up over here. We can do the same math as before. Zero with 3 transitions from a sunny day resulted in a rainy day. In fact, they were all sunny. But we add 1 pseudo observation over here and 2 of the normalizer, 1/5. These 2 things surely add up to 1. The last one is analogous. We have 1 transition of a rainy state and it led to a sunny state, so 1/1, but we add 1 over here and 2 on the denominator so you get 2/3. And if you do the math over here, you get 1/3. I really want you to remember Laplacian smoothing. It's applicable to many estimation problems, and it will be important going forward in this class. Here we applied it to the estimation of a Markov chain. Please take a moment and study the logic so you'll be able to apply those things again.
[Thrun] So now let's return to hidden Markov models. Those are really the subject of this class. Let's again use the rainy and sunny example just to keep it simple. These are the transition probabilities as before. Let's assume for now that the initial probability of rain is 0.5; hence, the probability of sun at time 0 is 0.5. The key modification to go to hidden Markov model is that this state is actually hidden. I cannot see whether it's raining or it's sunny. Instead I get to observe something else. Suppose I can be happy or grumpy and happiness or grumpiness is being caused by the weather. So rain might make me happy or grumpy, and sunshine makes me happy or grumpy but with vastly different probabilities. If it's sunny, I'm just mostly happy, 0.9. There's a 0.1 chance I might still be grumpy for some other reason. If it's rainy, I'm only happy with 0.4 probability and with 0.6 I'm grumpy. In fact, living in California I can attest that these are actually not wrong probabilities. I love the sun over here. Suppose I observe that I'm happy on day 1. A question that we can ask now is what is the so-called posterior probability for it raining on day 1 and what's the posterior probability for it being sunny on day 1? What's the probability of rain on day 1 given that I observed that I was happy on day 1? This is being answered using Bayes rule, so this is the probability of being happy given that it rains times the probability that it rains over the probability of being happy. We know the probability of rain at day 1 based on our Markov state transition model. In fact, let's just calculate it. The probability of rain on day 1 is the probability it was rainy on day 0 and it led to a self transition from rain to rain from day 0 to day 1 plus the probability it was sunny on day 0 times the probability that sun led to rain over here. If you can plug in all these numbers to obtain 0.4, you can just easily verify this. So we know this guy over here is 0.4. This guy over here is 0.4 again, but now it's this 0.4 over here. The probability of being happy on a rainy day is 0.4. This guy over here resolves to 0.4 times 0.4 plus the same situation with sunny in time 1 where the prior is 0.6 and the happiness factor is 0.9. And that gives us the entire expression is 0.229. Let's interpret the 0.229 in the context of the question we asked. We know that at time 0 it was raining with half a chance. If you look at the state transition diagram, it's more likely to be sunny afterwards because it's more likely to flip from rain to sun than sun to rain. In fact, we worked out that the probability of rain at a time step later was only 0.4, so it was 0.6 sunny. But now that I saw myself being happy, my probability of rain was further lowered from 0.4 to 0.229. And the reason why the probability went down is if you look at happiness, happiness is much more likely to occur on a sunny day than it is to occur on a rainy day. And when you work this in using Bayes rule and total probability, you would find just the fact that it was at happiness at time 1 makes your belief of it being rainy go down from 0.4 to 0.229. This is a wonderful example of applying Bayes rule in this really relatively complicated hidden Markov model.
[Thrun] So let me use exactly the same hidden Markov model where we have rain and sun and happiness and grumpiness with 0.4 and 0.6 and 0.9 and 0.1 probabilities. The only change I will apply is I will tell you that for probability 1 it's raining on day 0; hence, the probability of sunny at day 0 is 0. I now observe another happy face on day 1, and I'd like to know the probability of it raining on day 1 given this observation. This is the same as before with the only difference that we have a different initial probability, but all the other probabilities should just be the same.
[Thrun] Once again let's calculate the probability of rain on day 1. This one is easy because we know it is raining on day 0, so it's 0.6, the 0.6 over here. This expression over here is expanded by a Bayes rule as applied over here. Probability of happiness during rain is 0.4, the probability of rain was said to be just 0.6, and we divide by 0.4 times 0.6 plus 0.9 times 0.4, which is 1 minus 0.6. And that resolves simply to 0.4 if you work it all out. So the interesting thing here is if you were just to run the Markov chain, on day 1 we have a 0.6 chance of rain, but the fact that I observed myself to be happy reduces the chance of rain to 0.4.
[Thrun] So if you got those questions right, I'm in awe with you--wow-- because you understand the very basics of using a hidden Markov model for 2 things now. One is prediction, and one is called state estimation. In state estimation that's a really fancy word for just computing the probability of the internal or hidden state given measurements. In prediction we predict the next state, and you might also predict the next measurement.
[Thrun] I want to show you a little animation of hidden Markov models used for robot localization. This is obviously a little toy robot over here that lives in the grid world, and the grid world is composed of discrete cells where the robot may be located. This robot happens to know where north is at all times. It's given 4 sensors, a wall sensor to the left, to the right, to the top and the bottom over here, and it can sense whether in the adjacent cell there's a wall or not. Initially this robot has no clue where it is. It faces what we call a global localization problem. It now uses its sensors and its actuators to localize itself. So in the very first episode the robot senses a wall north and south of it but none west or east. And look what this does to the probabilities. The posterior probability is now increased in places that are consistent with this measurement, like all of those places have a wall in north and east, like these guys over here, and free space in the left and the right, yet they have been decreased in places that are inconsistent, like this guy over here. These states over here are interesting. They are shaded gray and lighter gray. What this means is they still have a significant probability but yet not as much as over here, the reason being that this measurement over here would be characteristic for the state over here if there had been exactly 1 measurement error-- if the bottom sensor had erred and erroneously detected a wall. Errors are less likely than no errors, and as a result, the cell over here which is completely consistent ends up to be more likely than the cell over here, yet you can see the HMM does a nice job in understanding the posterior probability. Let's assume the robot moves right and senses again and gets the exact same measurement. Of course it has no clue that it is exactly over here. It can see the probability as being decayed. Interestingly enough, this guy over here has a lower probability, and the reason is by itself it is very consistent with the most recent measurement, but it's less consistent with the idea of having moved right and measured before a wall to the north and the south. And similarly, these places over here become less consistent. The only ones that are completely consistent are these 3 states over here and the 3 states over here. The robot keeps moving to the right, and now we get to the point where the sequence of measurement really makes 2 states equally likely--the ones over here. They are equally likely with symmetry. Those are still pretty likely, and those are gradually and likely over here to the left. As the robot now moves, it moves into a distinguishing state. It sees a wall in the north but free space in the 3 other directions, and that renders the state over here relatively unlikely, and now it has localized itself.
[Thrun] We discussed specific incidents of hidden Markov model inference or filtering in our quizzes. Let me now give you the basic math. We all know hidden Markov model is a chain like this of hidden states that are Markovian and measurements that only depend on the corresponding state. We know that this Bayes network entailed certain independencies. For example, given X2 the past, the future, and the present measurement are all conditionally independent given X2. The nice thing about this structure is it makes it possible to efficiently do inference. I'll give you the equations we used before here in a more explicit form. Let's look at the measurement side, and suppose we wish to know the probability of an internal state variable given a specific measurement, and that by Bayes rule becomes P of Z1 given X1 times P of X1 over P of Z1. When you start doing this, you'll find that the normalizer doesn't depend on the target variable X; therefore, we often write a proportionality sign and get an equation like this. This product over here is the basic measurement update of hidden Markov models. And the thing to remember when you apply it, you have to normalize. We already practiced all of this, so you know all of this. The other equation is the prediction equation, so let's go from X1 to X2. This is called prediction even though sometimes it has nothing to do with prediction. It's the traditional term, but it comes from the fact that we might want to predict the distribution of X2 given that we know the distribution of X1. Here we apply total probability. The probability of X2 is obtained by checking all states we might have come from in X1 and calculating the probability of going from X1 to X2. We also practiced this before. Any probability of X2 being in a certain state must have come from another state, X1, and then transitioned into X2, so we sum over all of those and we get the posterior probability of X2. These 2 equations together form the math of a hidden Markov model where the next state distribution and the measurement distribution and the initial state distribution are all given as the parameters of a hidden Markov model.
[Thrun] Here is the application of HMM to a real robot localization example. This robot is in a world that's 1-dimensional and it is lost. It has initial uncertainty about where it is, and it is actually located next to a door but it doesn't know. It's also given a map of the world, and the distribution of all possible states, here noted as s, is given by this histogram. We bin the world into small bins, and for each bin we assign a single numerical probability of the robot being there. The fact they have all the same height means that the robot is maximally uncertain as to where it is. Let's assume this robot is going to sense and it senses to be next to a door. The red graph over here is the probability of seeing a door for different locations in the environment. There are 3 different doors, and seeing a door is more likely here than it is in between. It might still see a door here, but it's just less likely. We now apply Bayes rule. We multiply the prior with this measurement probability to obtain the posterior. That was our measurement update. It's that simple. So you can see how all these uniform values over here become nonuniform values over here multiplied by this curve over here. The story progresses by the robot taking an action to the right, and this is now the next state prediction part, the what we call convolution part or state transition part, where these little bumps over here get shifted along with the robot and they are flattened out a little bit just because robot motion has used uncertainty. Again, it's a really simple operation. You shift those to the right and you smooth them out a little bit to account for the control noise in the robot's actuators. And now we get to the point that the robot senses again, and this robot senses a door again. And see what happens. It multiplies. It's now a nonuniform prior over here with the same measurement probability as before, but now we get a distribution that's peaked over here and has smaller bumps at various other places, the reason being the only place where my prior has a higher probability and my measurement probability is also high probability is the second door, and as a result of our distribution over here, it assumes a much larger value. If you look at that picture, that is really easy to implement, and that's what we did all along when we talked about rain and sun and so on. It's really a very simple algorithm. Measurements are multiplications, and motion become essentially convolutions which are shifts with added noise.
[Thrun] This is a great segue to one of the most successful algorithms in artificial intelligence and robotics called particle filters. Again, the topic here is robot localization, and here we're dealing with a real robot with actual sensor data. The robot is lost in this building. You can see different rooms, and you can see corridors, and the robot is equipped with range sensors. These are sound sensors that measure the range to nearby obstacles. Its task is to figure out where it is. The robot will move along the black line over here, but it doesn't know this. It has no clue where it is. It has to figure out where it is. The key thing in particle filters is the representation of the belief. Whereas before we had discrete worlds like our sun and rain example or we had a histogram approach where we cut the space into small bins, particle filters have a very different representation. They represent the space by a collection of points or particles. Each of these small dots over here is a hypothesis where the robot might be. It's a concrete value of its X location and its Y location and its heading direction in this environment. So it's a vector of 3 values. The sum or set of all those vectors together form the belief space. So particle filters approximate a posterior by many, many, many guesses, and the density of those guesses represents the posterior probability of being at a certain location. To illustrate this, let me run the video. You can see in a very short amount of time the range sensors, even though they're very noisy, force the particles to collect in the corridor. There's 2 symmetrical point dots--this one over here and this one over here-- that come from the fact that the corridor itself is symmetric. But as the robot moves into the office, the symmetry is broken. This office looks very different from this office over here, and those particles die out. What's happening here? Intuitively speaking, each particle is a representation of a possible state, and the more consistent the particle with the measurement, the more the sonar measurement fits into the place where the particle says the robot is, the more likely it is to survive. This is the essence of particle filters. Particle filters use many particles to represent a belief, and they will let those particles survive in proportion to the measurement probability. And the measurement probability here is nothing else but the consistency of the sonar range measurements with the map of the environment given the particle place. Let me play this again. Here's the maze. The robot is lost in space. Again, you can see how within very few steps the particles consistent with the range measurements all accumulate in the corridor. As the robot hits the end of the corridor, only 2 particle clouds survive due to the symmetry of the corridor, and the particles finally die out. This algorithm is beautiful, and you can implement it in less than 10 lines of program code. So given all the difficulty of talking of probabilities and Bayes network and hidden Markov models, you will now find a way to implement one of the most amazing algorithms for filtering and state estimation in less than 10 lines of C code. Isn't that amazing?
[Thrun] Here is our 1-dimensional localization example again, this time with particle filters. You can see the particles initially spread out uniformly. This 1-dimensional space of forward locations you're going to use as an example to explain every single step of particle filters. In the very first step, the robot senses a door. Here is its initial particles before sensing the door. It now copies these particles over verbatim but gives them what's called a weight. We call this weight the importance weight, and the importance weight is nothing else but the measurement probability. It's more likely to see a door over here than over here. The red curve over here is the measurement probability, and the particles over here are the same as up here, but they now attached an importance weight where the height of the particle illustrates the weight. So you can see the place over here, the place over here, and the place over here carry the most weight because they're the most likely ones. This robot moves and it moves by using its previous particles to create a new random particle set that represents the posterior probability of being at a new location. The key thing here is called resampling. The algorithm works as follows. Pick a particle from the set over here and pick it in proportion to the importance weight. Once you've picked one--and sure enough, you pick those more frequently than those over here--add the motion to it plus a little bit of noise to create a new particle. Repeat this procedure for each particle. Pick them with replacement. You're allowed to pick a particle twice or 3 or 4 times. Sure enough, you pick these more frequently. These are being forward moved to over here, these to over here. You see a higher density of particles over here and over here, than you see, for example, over here. That's your forward prediction step in particle filters. It's really easy to implement. The next step is another measurement step, and here I'm illustrating to you that indeed this nonuniform set of particles leads to a reasonable posterior in this space. We now have a particle set as nonuniform. We have increased density over here, over here, and over here. You can see how multiplying these particles with the importance weight, which is copying them over verbatim but attaching a vertical importance weight in proportion to the measurement probability, yields a lot of particles over here with big weights, some over here with big weights, lots of particles over here with low weights. They got copied over, but the measurement probability here is low and so on and so on. And if you look at this set of particles, you already understand why the majority of importance and weights resides in the correct location given that we had a measurement of a door and motion to the right and another measurement of the door. The nice thing here is that particle filters work in continuous spaces, and, what's often underappreciated, they use your computational resources in proportion to how likely something is. You can see that almost all the computation now resides over here, almost all the memory resides over here, and that's the place that's likely. Stuff over here requires less memory, less computation, and guess what? It's much less likely. So particle filters make use of your computational resources in an intelligent way. They're really nice to implement on something with low compute power. Let me move on to explain the next motion. Here you see our robot moving to the right again, and now the same what we call resampling takes place. We pick, with replacement, particles from over here. Sure enough, these are the ones we pick the most often. And then we add the motion command plus some random noise. If you look at this particle set over here, almost all the particles sit over here. It doesn't really show it very well on this computer screen, but the density of particles over here is significantly higher than anywhere else. There's occurrences over here and over here that correspond with these guys over here and these guys over here and over here, correspond to this guy over here, but the vast majority of probability mass sits over here. So let's dive into how complicated this algorithm really is.
[Thrun] So here is our algorithm particle filter. It sets as an input a set of particles with associated important weights, a control, and a measurement vector, and it constructs a new particle set as prime and in doing so it also has an auxiliary variable, eta. Here is the algorithm. Initially we go through all new particles of which there are n and we sample in index j according to the distribution defined by the importance weights associated with the particle set over here. Put differently, we have a set of particles over here and we have associated importance factors which we will construct a little bit later on, and now we pick one of these particles with replacement where the probability of picking this particle is exactly the importance weight, w. For this particle we now sample a possible successor state according to the state transition probability using our controls and that specific particle as an input. We call it sj over here. We also compute an importance weight, which is the measurement probability for that specific particle over here. This gives us a new particle, and this gives us a new non-normalized importance weight. For now we just add them into our new particle set as prime and we reiterate. The only thing missing now is at the very end we have to normalize all the weights. For this we keep our running counter, eta, and we have a For loop in which we take all the weights in the set over here and just normalize them accordingly. This is the entire algorithm. We feed in over here particles with associated important weights and a control and a measurement, and then we construct the new set of particles by picking particles from our previous set at random with replacement but in accordance to the importance weights, so important particles are picked more frequently. We guess for this particle this will be a state. We guess what a new state might be by just sampling it, and we attach it an importance weight which we later normalize that is proportional to the measurement probability for this thing over here. So you're going to upweigh the particles that look consistent with the measurements and downweigh the ones that are non-consistent. We add all of these things back to our particle sets and reiterate. I promised you it would be an easy algorithm. You can look at this, and you could actually implement this really easily. Just remember how much difficulty we introduced with talking about Bayes networks and hidden Markov models and all that stuff. This is all there is to implement particle filters.
[Thrun] Particle filters are really easy to implement. They have some deficiencies. They don't really scale to high-dimensional spaces. That's been recognized because the number of particles you need to fill a high-dimensional space tends to grow exponentially with the dimensionality of the space. So for 100 dimensions it's hard to make work. But there are extensions. They go under really fancy names like Rao-Blackwellized particle filters that can actually do this, but I won't talk about them in any detail here. They also have problems with degenerate conditions. For example, they don't work well if you only have 1 particle or 2 particles. They tend not to work well if you have no noise in your measurement model or no noise in your controls. You need this kind of to remix things a little bit. If there is very little noise, you have to deviate from the basic paradigm. But the good news is they work really well in many, many applications. For example, our self-driving cars use particle filters for localization and for mapping and for a number of other things. And the reason why they work so well is they're really easy to implement, they're computationally efficient in the sense that they really put the computational resources where they are needed the most, and they can deal with highly non-monotonic and very complex posterior distribution that have many peaks. And that's important. Many other filters can't. So particle filters are often the method of choice when it comes to building quickly an estimation method for problems where the posterior is complex.
[Thrun] Wow! You learned a lot about hidden Markov models and particle filters. Particle filters is the most used algorithm in robotics today when it comes to interpreting sensor data, but these algorithms are applicable in a wide array of applications such as finance, medicine, behavioral studies, time series analysis, speech, language technologies, anything involving time and sensors or uncertainty. And now you know how to use them. You can apply them to all these problems if you listened carefully. It's been a pleasure teaching you with this class. As I told you in the beginning, this is a topic very close to my heart, and I hope it's going to empower you to do better stuff in any domain involving time series and uncertainty.