Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program. It forms the foundation for a wide variety of compiler optimizations and program verification techniques. A program's control-flow graph (CFG) is used to determine those parts of a program to which a particular value assigned to a variable might propagate. The information gathered is often used by compilers when optimizing a program. A canonical example of a data-flow analysis is reaching definitions. Other commonly used data-flow analyses include live variable analysis, available expressions, constant propagation, and very busy expressions, each serving a distinct purpose in compiler optimization passes.

A simple way to perform data-flow analysis of programs is to set up data-flow equations for each node of the control-flow graph and solve them by repeatedly calculating the output from the input locally at each node until the whole system stabilizes, i.e., it reaches a fixpoint. The efficiency and precision of this process are significantly influenced by the design of the data-flow framework, including the direction of analysis (forward or backward), the domain of values, and the join operation used to merge information from multiple control paths. This general approach, also known as Kildall's method<!-- or Kildall's algorithm -->, was developed by Gary Kildall while teaching at the Naval Postgraduate School.

Basic principles

Data-flow analysis is the process of collecting information about the way the variables are defined and used in the program. It attempts to obtain particular information at each point in a procedure. Usually, it is enough to obtain this information at the boundaries of basic blocks, since from that it is easy to compute the information at points in the basic block. In forward flow analysis, the exit state of a block is a function of the block's entry state. This function is the composition of the effects of the statements in the block. The entry state of a block is a function of the exit states of its predecessors. This yields a set of data-flow equations:

For each block b:

: <math> out_b = trans_b (in_b) </math>

: <math> in_b = join_{p \in pred_b}(out_p) </math>

In this, <math> trans_b </math> is the transfer function of the block <math>b</math>. It works on the entry state <math>in_b</math>, yielding the exit state <math>out_b</math>. The join operation <math>join</math> combines the exit states of the predecessors <math>p \in pred_b</math> of <math>b</math>, yielding the entry state of <math>b</math>.

After solving this set of equations, the entry and/or exit states of the blocks can be used to derive properties of the program at the block boundaries. The transfer function of each statement separately can be applied to get information at a point inside a basic block.

Each particular type of data-flow analysis has its own specific transfer function and join operation. Some data-flow problems require backward flow analysis. This follows the same plan, except that the transfer function is applied to the exit state yielding the entry state, and the join operation works on the entry states of the successors to yield the exit state.

The entry point (in forward flow) plays an important role: Since it has no predecessors, its entry state is well defined at the start of the analysis. For instance, the set of local variables with known values is empty. If the control-flow graph does not contain cycles (there were no explicit or implicit loops in the procedure) solving the equations is straightforward. The control-flow graph can then be topologically sorted; running in the order of this sort, the entry states can be computed at the start of each block, since all predecessors of that block have already been processed, so their exit states are available. If the control-flow graph does contain cycles, a more advanced algorithm is required.

An iterative algorithm

The most common way of solving the data-flow equations is by using an iterative algorithm. It starts with an approximation of the in-state of each block. The out-states are then computed by applying the transfer functions on the in-states. From these, the in-states are updated by applying the join operations. The latter two steps are repeated until we reach the so-called fixpoint: the situation in which the in-states (and the out-states in consequence) do not change.

A basic algorithm for solving data-flow equations is the round-robin iterative algorithm:

:for i ← 1 to N

::initialize node i

:while (sets are still changing)

::for i ← 1 to N

:::recompute sets at node i

Convergence

To be usable, the iterative approach should actually reach a fixpoint. This can be guaranteed

by imposing constraints on the combination of the value domain of the states, the transfer functions and the join operation.

The value domain should be a partial order with finite height (i.e., there are no infinite ascending chains <math>x_1</math> < <math>x_2</math> < ...). The combination of the transfer function and the join operation should be monotonic with respect to this partial order. Monotonicity ensures that on each iteration the value will either stay the same or will grow larger, while finite height ensures that it cannot grow indefinitely. Thus we will ultimately reach a situation where T(x) = x for all x, which is the fixpoint.

The work list approach

It is easy to improve on the algorithm above by noticing that the in-state of a block will not change if the out-states of its predecessors don't change. Therefore, we introduce a work list: a list of blocks that still need to be processed. Whenever the out-state of a block changes, we add its successors to the work list. In each iteration, a block is removed from the work list. Its out-state is computed. If the out-state changed, the block's successors are added to the work list. For efficiency, a block should not be in the work list more than once.

The algorithm is started by putting information-generating blocks in the work list. It terminates when the

work list is empty.

Ordering

The efficiency of iteratively solving data-flow equations is influenced by the order at which local nodes are visited.

In 2002, Markus Mohnen described a new method of data-flow analysis that does not require the explicit construction of a data-flow graph,

</references>

Further reading