Bellman equation

thumb|Bellman flow chart

A Bellman equation, named after Richard E. Bellman, is a technique in dynamic programming which breaks an optimization problem into a sequence of simpler subproblems, as Bellman's "principle of optimality" prescribes. It is a necessary condition for optimality. The "value" of a decision problem at a certain point in time is written in terms of the payoff from some initial choices and the "value" of the remaining decision problem that results from those initial choices. The equation applies to algebraic structures with a total ordering; for algebraic structures with a partial ordering, the generic Bellman's equation can be used.

The Bellman equation was first applied to engineering control theory and to other topics in applied mathematics, and subsequently became an important tool in economic theory; though the basic concepts of dynamic programming are prefigured in John von Neumann and Oskar Morgenstern's Theory of Games and Economic Behavior and Abraham Wald's sequential analysis. The term "Bellman equation" usually refers to the dynamic programming equation (DPE) associated with discrete-time optimization problems. In continuous-time optimization problems, the analogous equation is a partial differential equation called the Hamilton–Jacobi–Bellman equation.

In discrete time any multi-stage optimization problem can be solved by analyzing the appropriate Bellman equation. The appropriate Bellman equation can be found by introducing new state variables (state augmentation). However, the resulting augmented-state multi-stage optimization problem has a higher dimensional state space than the original multi-stage optimization problem - an issue that can potentially render the augmented problem intractable due to the "curse of dimensionality". Alternatively, it has been shown that if the cost function of the multi-stage optimization problem satisfies a "backward separable" structure, then the appropriate Bellman equation can be found without state augmentation.

Analytical concepts in dynamic programming

To understand the Bellman equation, several underlying concepts must be introduced. First, any optimization problem has some objective: minimizing travel time, minimizing cost, maximizing profits, maximizing utility, etc. The mathematical function that describes this objective is called the objective function.

Dynamic programming breaks a multi-period planning problem into simpler steps at different points in time. Therefore, it requires keeping track of how the decision situation is evolving over time. The information about the current situation that is needed to make a correct decision is called the "state". For example, to decide how much to consume and spend at each point in time, people would need to know (among other things) their initial wealth. Therefore, wealth <math>(W)</math> would be one of their state variables, but there would probably be others.

The variables chosen at any given point in time are often called the control variables. For instance, given their current wealth, people might decide how much to consume now. Choosing the control variables now may be equivalent to choosing the next state; more generally, the next state is affected by other factors in addition to the current control. For example, in the simplest case, today's wealth (the state) and consumption (the control) might exactly determine tomorrow's wealth (the new state), though typically other factors will affect tomorrow's wealth too.

The dynamic programming approach describes the optimal plan by finding a rule that tells what the controls should be, given any possible value of the state. For example, if consumption (c) depends only on wealth (W), we would seek a rule <math>c(W)</math> that gives consumption as a function of wealth. Such a rule, determining the controls as a function of the states, is called a policy function.</blockquote>

In computer science, a problem that can be broken apart like this is said to have optimal substructure. In the context of dynamic game theory, this principle is analogous to the concept of subgame perfect equilibrium, although what constitutes an optimal policy in this case is conditioned on the decision-maker's opponents choosing similarly optimal policies from their points of view.

As suggested by the principle of optimality, we will consider the first decision separately, setting aside all future decisions (we will start afresh from time 1 with the new state <math>x_1 </math>). Collecting the future decisions in brackets on the right, the above infinite-horizon decision problem is equivalent to:

:<math> \max_{ a_0 } \left \{ F(x_0,a_0)

+ \beta \left[ \max_{ \left \{ a_{t} \right \}_{t=1}^{\infty} }

\sum_{t=1}^{\infty} \beta^{t-1} F(x_t,a_{t}):

a_{t} \in \Gamma (x_t), \; x_{t+1}=T(x_t,a_t), \; \forall t \geq 1 \right] \right \}</math>

subject to the constraints

:<math> a_0 \in \Gamma (x_0), \; x_1=T(x_0,a_0). </math>

Here we are choosing <math>a_0</math>, knowing that our choice will cause the time 1 state to be <math>x_1=T(x_0,a_0)</math>. That new state will then affect the decision problem from time 1 on. The whole future decision problem appears inside the square brackets on the right.

The Bellman equation

So far it seems we have only made the problem uglier by separating today's decision from future decisions. But we can simplify by noticing that what is inside the square brackets on the right is the value of the time 1 decision problem, starting from state <math>x_1=T(x_0,a_0)</math>.

Therefore, the problem can be rewritten as a recursive definition of the value function:

:<math>V(x_0) = \max_{ a_0 } \{ F(x_0,a_0) + \beta V(x_1) \} </math>, subject to the constraints: <math> a_0 \in \Gamma (x_0), \; x_1=T(x_0,a_0). </math>

This is the Bellman equation. It may be simplified even further if the time subscripts are dropped and the value of the next state is plugged in:

:<math>V(x) = \max_{a \in \Gamma (x) } \{ F(x,a) + \beta V(T(x,a)) \}.</math>

The Bellman equation is classified as a functional equation, because solving it means finding the unknown function <math>V</math>, which is the value function. Recall that the value function describes the best possible value of the objective, as a function of the state <math>x</math>. By calculating the value function, we will also find the function <math>a(x)</math> that describes the optimal action as a function of the state; this is called the policy function.

In a stochastic problem

In the deterministic setting, other techniques besides dynamic programming can be used to tackle the above optimal control problem. However, the Bellman Equation is often the most convenient method of solving stochastic optimal control problems.

For a specific example from economics, consider an infinitely-lived consumer with initial wealth endowment <math>{\color{Red}a_0}</math> at period <math>0</math>. They have an instantaneous utility function <math>u(c)</math> where <math>c</math> denotes consumption and discounts the next period utility at a rate of <math>0< \beta<1 </math>. Assume that what is not consumed in period <math>t</math> carries over to the next period with interest rate <math>r</math>. Then the consumer's utility maximization problem is to choose a consumption plan <math>\ = (1 + r) ({\color{Red}a_t} - {\color{OliveGreen}c_t}), \; {\color{OliveGreen}c_t} \geq 0,</math>

and

:<math>\lim_{t \rightarrow \infty} {\color{Red}a_t} \geq 0.</math>

The first constraint is the capital accumulation/law of motion specified by the problem, while the second constraint is a transversality condition that the consumer does not carry debt at the end of their life. The Bellman equation is

:<math>V(a) = \max_{ 0 \leq c \leq a } \{ u(c) + \beta V((1+r) (a - c)) \},</math>

Alternatively, one can treat the sequence problem directly using, for example, the Hamiltonian equations.

Now, if the interest rate varies from period to period, the consumer is faced with a stochastic optimization problem. Let the interest r follow a Markov process with probability transition function <math>Q(r, d\mu_r)</math> where <math>d\mu_r</math> denotes the probability measure governing the distribution of interest rate next period if current interest rate is <math>r</math>. In this model the consumer decides their current period consumption after the current period interest rate is announced.

Rather than simply choosing a single sequence <math>\