Matrix calculus

In mathematics, matrix calculus is a specialized notation for doing multivariable calculus, especially over spaces of matrices. It collects the various partial derivatives of a single function with respect to many variables, and/or of a multivariate function with respect to a single variable, into vectors and matrices that can be treated as single entities. This greatly simplifies operations such as finding the maximum or minimum of a multivariate function and solving systems of differential equations. The notation used here is commonly used in statistics and engineering, while the tensor index notation is preferred in physics.

Two competing notational conventions split the field of matrix calculus into two separate groups. The two groups can be distinguished by whether they write the derivative of a scalar with respect to a vector as a column vector or a row vector. Both of these conventions are possible even when the common assumption is made that vectors should be treated as column vectors when combined with matrices (rather than row vectors). A single convention can be somewhat standard throughout a single field that commonly uses matrix calculus (e.g. econometrics, statistics, estimation theory and machine learning). However, even within a given field different authors can be found using competing conventions. Authors of both groups often write as though their specific conventions were standard. Serious mistakes can result when combining results from different authors without carefully verifying that compatible notations have been used. Definitions of these two conventions and comparisons between them are collected in the layout conventions section.

Scope

Matrix calculus refers to a number of different notations that use matrices and vectors to collect the derivative of each component of the dependent variable with respect to each component of the independent variable. In general, the independent variable can be a scalar, a vector, or a matrix while the dependent variable can be any of these as well. Each different situation will lead to a different set of rules, or a separate calculus, using the broader sense of the term. Matrix notation serves as a convenient way to collect the many derivatives in an organized way.

As a first example, consider the gradient from vector calculus. For a scalar function of three independent variables, <math>f(x_1, x_2, x_3)</math>, the gradient is given by the vector equation

:<math>\nabla f = \frac{\partial f}{\partial x_1} \hat{x}_1 + \frac{\partial f}{\partial x_2} \hat{x}_2 + \frac{\partial f}{\partial x_3} \hat{x}_3 ,</math>

where <math>\hat{x}_i</math> represents a unit vector in the <math>x_i</math> direction for <math>1\le i \le 3</math>. This type of generalized derivative can be seen as the derivative of a scalar, f, with respect to a vector, <math>\mathbf{x}</math>, and its result can be easily collected in vector form.

:<math>\nabla f = \left( \frac{\partial f}{\partial \mathbf{x \right)^{\mathsf{T =

\begin{bmatrix}

\dfrac{\partial f}{\partial x_1} &

\dfrac{\partial f}{\partial x_2} &

\dfrac{\partial f}{\partial x_3} \\

\end{bmatrix}^\textsf{T}.

</math>

More complicated examples include the derivative of a scalar function with respect to a matrix, known as the gradient matrix, which collects the derivative with respect to each matrix element in the corresponding position in the resulting matrix. In that case the scalar must be a function of each of the independent variables in the matrix. As another example, if we have an -vector of dependent variables, or functions, of independent variables we might consider the derivative of the dependent vector with respect to the independent vector. The result could be collected in an matrix consisting of all of the possible derivative combinations.

There are a total of nine possibilities using scalars, vectors, and matrices. Notice that as we consider higher numbers of components in each of the independent and dependent variables we can be left with a very large number of possibilities. The six kinds of derivatives that can be most neatly organized in matrix form are collected in the following table.

:<math>\begin{align}

\frac{\partial y}{\partial \mathbf{x &= \begin{bmatrix}

\frac{\partial y}{\partial x_1} &

\frac{\partial y}{\partial x_2} &

\cdots &

\frac{\partial y}{\partial x_n}

\end{bmatrix}. \\

\frac{\partial \mathbf{y{\partial x} &= \begin{bmatrix}

\frac{\partial y_1}{\partial x} \\

\frac{\partial y_2}{\partial x} \\

\vdots \\

\frac{\partial y_m}{\partial x} \\

\end{bmatrix}. \\

\frac{\partial \mathbf{y{\partial \mathbf{x &= \begin{bmatrix}

\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\

\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\

\vdots & \vdots & \ddots & \vdots \\

\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \\

\end{bmatrix}. \\

\frac{\partial y}{\partial \mathbf{X &= \begin{bmatrix}

\frac{\partial y}{\partial x_{11 & \frac{\partial y}{\partial x_{21 & \cdots & \frac{\partial y}{\partial x_{p1 \\

\frac{\partial y}{\partial x_{12 & \frac{\partial y}{\partial x_{22 & \cdots & \frac{\partial y}{\partial x_{p2 \\

\vdots & \vdots & \ddots & \vdots \\

\frac{\partial y}{\partial x_{1q & \frac{\partial y}{\partial x_{2q & \cdots & \frac{\partial y}{\partial x_{pq \\

\end{bmatrix}.

\end{align}</math>

The following definitions are only provided in numerator-layout notation:

:<math>\begin{align}

\frac{\partial \mathbf{Y{\partial x} &= \begin{bmatrix}

\frac{\partial y_{11{\partial x} & \frac{\partial y_{12{\partial x} & \cdots & \frac{\partial y_{1n{\partial x} \\

\frac{\partial y_{21{\partial x} & \frac{\partial y_{22{\partial x} & \cdots & \frac{\partial y_{2n{\partial x} \\

\vdots & \vdots & \ddots & \vdots \\

\frac{\partial y_{m1{\partial x} & \frac{\partial y_{m2{\partial x} & \cdots & \frac{\partial y_{mn{\partial x} \\

\end{bmatrix}. \\

d\mathbf{X} &= \begin{bmatrix}

dx_{11} & dx_{12} & \cdots & dx_{1n} \\

dx_{21} & dx_{22} & \cdots & dx_{2n} \\

\vdots & \vdots & \ddots & \vdots \\

dx_{m1} & dx_{m2} & \cdots & dx_{mn} \\

\end{bmatrix}.

\end{align}</math>

Denominator-layout notation

Using denominator-layout notation, we have:

:<math>\begin{align}

\frac{\partial y}{\partial \mathbf{x &= \begin{bmatrix}

\frac{\partial y}{\partial x_1}\\

\frac{\partial y}{\partial x_2}\\

\vdots\\

\frac{\partial y}{\partial x_n}\\

\end{bmatrix}. \\

\frac{\partial \mathbf{y{\partial x} &= \begin{bmatrix}

\frac{\partial y_1}{\partial x} &

\frac{\partial y_2}{\partial x} &

\cdots &

\frac{\partial y_m}{\partial x}

\end{bmatrix}. \\

\frac{\partial \mathbf{y{\partial \mathbf{x &= \begin{bmatrix}

\frac{\partial y_1}{\partial x_1} & \frac{\partial y_2}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_1} \\

\frac{\partial y_1}{\partial x_2} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_2} \\

\vdots & \vdots & \ddots & \vdots \\

\frac{\partial y_1}{\partial x_n} & \frac{\partial y_2}{\partial x_n} & \cdots & \frac{\partial y_m}{\partial x_n}\\

\end{bmatrix}. \\

\frac{\partial y}{\partial \mathbf{X &= \begin{bmatrix}

\frac{\partial y}{\partial x_{11 & \frac{\partial y}{\partial x_{12 & \cdots & \frac{\partial y}{\partial x_{1q\\

\frac{\partial y}{\partial x_{21 & \frac{\partial y}{\partial x_{22 & \cdots & \frac{\partial y}{\partial x_{2q\\

\vdots & \vdots & \ddots & \vdots\\

\frac{\partial y}{\partial x_{p1 & \frac{\partial y}{\partial x_{p2 & \cdots & \frac{\partial y}{\partial x_{pq\\

\end{bmatrix}.

\end{align}</math>

Identities

As noted above, in general, the results of operations will be transposed when switching between numerator-layout and denominator-layout notation.

To help make sense of all the identities below, keep in mind the most important rules: the chain rule, product rule and sum rule. The sum rule applies universally, and the product rule applies in most of the cases below, provided that the order of matrix products is maintained, since matrix products are not commutative. The chain rule applies in some of the cases, but unfortunately does not apply in matrix-by-scalar derivatives or scalar-by-matrix derivatives (in the latter case, mostly involving the trace operator applied to matrices). In the latter case, the product rule can't quite be applied directly, either, but the equivalent can be done with a bit more work using the differential identities.

The following identities adopt the following conventions:

the scalars, , , , , and are constant in respect of, and the scalars, , and are functions of one of , , or ;
the vectors, , , , , and are constant in respect of, and the vectors, , and are functions of one of , , or ;
the matrices, , , , , and are constant in respect of, and the matrices, and are functions of one of , , or .

Vector-by-vector identities

This is presented first because all of the operations that apply to vector-by-vector differentiation apply directly to vector-by-scalar or scalar-by-vector differentiation simply by reducing the appropriate vector in the numerator or denominator to a scalar.

:{|class="wikitable" style="text-align: center;"

|+ Identities: vector-by-vector <math>\frac{\partial \mathbf{y{\partial \mathbf{x</math>

! scope="col" width="150" | Condition

! scope="col" width="10" | Expression

! scope="col" width="100" | Numerator layout, i.e. by and

! scope="col" width="100" | Denominator layout, i.e. by and

| is not a function of || <math>\frac{\partial \mathbf{a{\partial \mathbf{x =</math> ||colspan=2| <math>\mathbf{0}</math>

| || <math>\frac{\partial \mathbf{x{\partial \mathbf{x =</math> || colspan=2|<math>\mathbf{I}</math>

| is not a function of || <math>\frac{\partial \mathbf{A}\mathbf{x{\partial \mathbf{x =</math> || <math>\mathbf{A}</math> || <math>\mathbf{A}^\top</math>

| is not a function of || <math>\frac{\partial \mathbf{x}^\top \mathbf{A{\partial \mathbf{x =</math> || <math>\mathbf{A}^\top</math> || <math>\mathbf{A}</math>

| is not a function of , || <math>\frac{\partial a\mathbf{u{\partial\, \mathbf{x =</math>

| colspan=2|<math>a\frac{\partial \mathbf{u{\partial \mathbf{x</math>

| , is not a function of || <math>\frac{\partial v\mathbf{a{\partial \mathbf{x =</math> || <math>\mathbf{a}\frac{\partial v}{\partial \mathbf{x </math> || <math>\frac{\partial v}{\partial \mathbf{x \mathbf{a}^\top</math>

|, || <math>\frac{\partial v\mathbf{u{\partial \mathbf{x =</math> || <math>v \frac{\partial \mathbf{u{\partial \mathbf{x + \mathbf{u}\frac{\partial v}{\partial \mathbf{x </math> || <math>v\frac{\partial \mathbf{u{\partial \mathbf{x + \frac{\partial v}{\partial \mathbf{x \mathbf{u}^\top</math>

| is not a function of , || <math>\frac{\partial \mathbf{A}\mathbf{u{\partial \mathbf{x =</math> || <math>\mathbf{A}\frac{\partial \mathbf{u{\partial \mathbf{x</math> || <math>\frac{\partial \mathbf{u{\partial \mathbf{x\mathbf{A}^\top</math>

| , || <math>\frac{\partial (\mathbf{u} + \mathbf{v})}{\partial \mathbf{x =</math>

| colspan=2|<math>\frac{\partial \mathbf{u{\partial \mathbf{x + \frac{\partial \mathbf{v{\partial \mathbf{x</math>

| || <math>\frac{\partial \mathbf{g}(\mathbf{u})}{\partial \mathbf{x =</math>|| <math>\frac{\partial \mathbf{g}(\mathbf{u})}{\partial \mathbf{u \frac{\partial \mathbf{u{\partial \mathbf{x</math>|| <math>\frac{\partial \mathbf{u{\partial \mathbf{x \frac{\partial \mathbf{g}(\mathbf{u})}{\partial \mathbf{u</math>

| || <math>\frac{\partial \mathbf{f}(\mathbf{g}(\mathbf{u}))}{\partial \mathbf{x =</math>|| <math>\frac{\partial \mathbf{f}(\mathbf{g})}{\partial \mathbf{g \frac{\partial \mathbf{g}(\mathbf{u})}{\partial \mathbf{u \frac{\partial \mathbf{u{\partial \mathbf{x</math>|| <math>\frac{\partial \mathbf{u{\partial \mathbf{x \frac{\partial \mathbf{g}(\mathbf{u})}{\partial \mathbf{u \frac{\partial \mathbf{f}(\mathbf{g})}{\partial \mathbf{g</math>

Scalar-by-vector identities

The fundamental identities are placed above the thick black line.

:{|class="wikitable" style="text-align: center;"

|+ Identities: scalar-by-vector <math>\frac{\partial y}{\partial \mathbf{x = \nabla_\mathbf{x} y</math>

! scope="col" width="150" | Condition

! scope="col" width="200" | Expression

! scope="col" width="200" | Numerator layout, i.e. by ; result is row vector

! scope="col" width="200" | Denominator layout, i.e. by ; result is column vector

| is not a function of || <math>\frac{\partial a}{\partial \mathbf{x =</math>

| <math>\mathbf{0}^\top</math>||<math>\mathbf{0}</math>||<math>\mathbf{0}</math>    <math>\frac{\partial \operatorname{tr}(\mathbf{AX})}{\partial \mathbf{X = \frac{\partial \operatorname{tr}(\mathbf{XA})}{\partial \mathbf{X =</math>|| <math>\mathbf{A}</math>|| <math>\mathbf{A}^\top</math>

| is not a function of ||     <math>\frac{\partial |\mathbf{X}|}{\partial \mathbf{X =</math>|| <math>\operatorname{cofactor}(X)^\top = |\mathbf{X}|\mathbf{X}^{-1}</math>||<math>\operatorname{cofactor}(X) = |\mathbf{X}|\left(\mathbf{X}^{-1}\right)^\top</math>

| is not a function of ||

|| <math>\mathbf{X}^{-1}</math> ||<math>\left(\mathbf{X}^{-1}\right)^\top</math>

| , are not functions of ||

! scope="col" width="175" | Condition

! scope="col" width="100" | Expression

! scope="col" width="100" | Consistent numerator layout, i.e. by and

! scope="col" width="100" | Mixed layout, i.e. by and

| || <math>\frac{\partial |\mathbf{U}|}{\partial x} =</math> || colspan=2|<math>|\mathbf{U}|\operatorname{tr}\left(\mathbf{U}^{-1}\frac{\partial \mathbf{U{\partial x}\right)</math>

| || <math>\frac{\partial \ln|\mathbf{U}|}{\partial x} =</math> || colspan=2|<math>\operatorname{tr}\left(\mathbf{U}^{-1}\frac{\partial \mathbf{U{\partial x}\right)</math>

| || <math>\frac{\partial^2 |\mathbf{U}|}{\partial x^2} =</math>

| colspan=2 | <math>\left|\mathbf{U}\right| \left[

\operatorname{tr}\left(\mathbf{U}^{-1}\frac{\partial^2 \mathbf{U{\partial x^2}\right) +

\operatorname{tr}^2\left(\mathbf{U}^{-1}\frac{\partial \mathbf{U{\partial x}\right) -

\operatorname{tr}\left(\left(\mathbf{U}^{-1}\frac{\partial \mathbf{U{\partial x}\right)^2\right)

\right]</math>

| <math>\frac{\partial g(\mathbf{U})}{\partial x} =</math>

| <math>\operatorname{tr}\left( \frac{\partial g(\mathbf{U})}{\partial \mathbf{U \frac{\partial \mathbf{U{\partial x}\right)</math>

| <math>\operatorname{tr}\left( \left(\frac{\partial g(\mathbf{U})}{\partial \mathbf{U\right)^\top \frac{\partial \mathbf{U{\partial x}\right)</math>

| is not a function of , is any polynomial with scalar coefficients, or any matrix function defined by an infinite polynomial series (e.g. , , , , etc.); is the equivalent scalar function, is its derivative, and is the corresponding matrix function. || <math>\frac{\partial \operatorname{tr}(\mathbf{g}(x\mathbf{A}))}{\partial x} =</math> || colspan=2|<math>\operatorname{tr}\left(\mathbf{A}\mathbf{g}'(x\mathbf{A})\right)</math>

| is not a function of || <math>\frac{\partial \operatorname{tr}\left(e^{x\mathbf{A\right)}{\partial x} =</math> || colspan=2|<math>\operatorname{tr}\left(\mathbf{A}e^{x\mathbf{A\right)</math>

Identities in differential form

It is often easier to work in differential form and then convert back to normal derivatives. This only works well using the numerator layout. In these rules, is a scalar.

:{|class="wikitable" style="text-align: center;"

|+ Differential identities: scalar involving matrix

! Condition !! Expression !! Result (numerator layout)

|A is not a function of || <math>d(\mathbf{A}) =</math> || <math>0</math>

|a is not a function of || <math>d(a\mathbf{X}) =</math> || <math>a\,d\mathbf{X}</math>

| || <math>d(\mathbf{X} + \mathbf{Y}) =</math> || <math>d\mathbf{X} + d\mathbf{Y}</math>

| || <math>d(\mathbf{X}\mathbf{Y}) =</math> || <math>(d\mathbf{X})\mathbf{Y} + \mathbf{X}(d\mathbf{Y})</math>

| (Kronecker product) || <math>d(\mathbf{X} \otimes \mathbf{Y}) =</math> || <math>(d\mathbf{X})\otimes\mathbf{Y} + \mathbf{X}\otimes(d\mathbf{Y})</math>

| (Hadamard product) || <math>d(\mathbf{X} \circ \mathbf{Y}) =</math> || <math>(d\mathbf{X}) \circ \mathbf{Y} + \mathbf{X} \circ (d\mathbf{Y})</math>

| || <math>d\left(\mathbf{X}^\top\right) =</math> || <math>(d\mathbf{X})^\top</math>

|<math>d\left(\mathbf{X}^{-1}\right) =</math>

|<math>-\mathbf{X}^{-1}\left(d\mathbf{X}\right)\mathbf{X}^{-1}</math>

| (conjugate transpose) || <math>d\left(\mathbf{X}^\mathrm{H}\right) =</math> || <math>(d\mathbf{X})^\mathrm{H}</math>

| is a positive integer || <math>d\left(\mathbf{X}^n\right) =</math> || <math>\sum_{i=0}^{n-1} \mathbf{X}^i (d\mathbf{X})\mathbf{X}^{n-i-1}</math>

| <math>d \left(e^\mathbf{X}\right) =</math>

| <math> \int_0^1 e^{a\mathbf{X (d\mathbf{X}) e^{(1-a)\mathbf{X \, da </math>

| <math>d \left(\log{X}\right) =</math>

| <math> \int_0^\infty (\mathbf{X}+z \, \mathbf{I})^{-1} (d\mathbf{X}) (\mathbf{X}+z \, \mathbf{I})^{-1} \, dz </math>

| <math>\mathbf{X} = \sum_i \lambda_i \mathbf{P}_i</math> is diagonalizable

<math>\mathbf{P}_i \mathbf{P}_j = \delta_{ij} \mathbf{P}_i </math>

is differentiable at every eigenvalue <math>\lambda_i</math>

| <math>d \left(f(\mathbf{X})\right) =</math>

| <math>\sum_{ij} \mathbf{P}_i (d\mathbf{X}) \mathbf{P}_j \begin{cases}

f'(\lambda_i) & \lambda_i = \lambda_j \\

\frac{f(\lambda_i) - f(\lambda_j)}{\lambda_i - \lambda_j} & \lambda_i \neq \lambda_j

\end{cases} </math>

In the last row, <math>\delta_{ij}</math> is the Kronecker delta and <math>(\mathbf{P}_k)_{ij} = (\mathbf{Q})_{ik} (\mathbf{Q}^{-1})_{kj}</math> is the set of orthogonal projection operators that project onto the -th eigenvector of .

is the matrix of eigenvectors of <math>\mathbf{X} = \mathbf{Q} \boldsymbol{\Lambda} \mathbf{Q}^{-1}</math>, and <math>(\boldsymbol{\Lambda})_{ii} = \lambda_i</math> are the eigenvalues.

The matrix function <math>f(\mathbf{X})</math> is defined in terms of the scalar function <math>f(x)</math> for diagonalizable matrices by <math display="inline">f(\mathbf{X}) = \sum_i f(\lambda_i) \mathbf{P}_i </math> where <math display="inline">\mathbf{X} = \sum_i \lambda_i \mathbf{P}_i</math> with

To convert to normal derivative form, first convert it to one of the following canonical forms, and then use these identities:

:{|class="wikitable" style="text-align: center;"

|+ Conversion from differential to derivative form

It is used in regression analysis to compute, for example, the ordinary least squares regression formula for the case of multiple explanatory variables.

It is also used in random matrices, statistical moments, local sensitivity and statistical diagnostics.

Notes

References

External links

Software

MatrixCalculus.org, a website for evaluating matrix calculus expressions symbolically
NCAlgebra, an open-source Mathematica package that has some matrix calculus functionality
SymPy supports symbolic matrix derivatives in its matrix expression module, as well as symbolic tensor derivatives in its array expression module.
Tensorgrad, an open-source python package for matrix calculus. Supports general symbolic tensor derivatives using Penrose graphical notation.

Information

Matrix Reference Manual, Mike Brookes, Imperial College London.
Matrix Differentiation (and some other stuff), Randal J. Barnes, Department of Civil Engineering, University of Minnesota.
Notes on Matrix Calculus, Paul L. Fackler, North Carolina State University.
Matrix Differential Calculus (slide presentation), Zhang Le, University of Edinburgh.
Introduction to Vector and Matrix Differentiation (notes on matrix differentiation, in the context of Econometrics), Heino Bohn Nielsen.
A note on differentiating matrices (notes on matrix differentiation), Pawel Koval, from Munich Personal RePEc Archive.
Vector/Matrix Calculus More notes on matrix differentiation.
Matrix Identities (notes on matrix differentiation), Sam Roweis.
Tensor Cookbook Matrix Calculus using Tensor Diagrams.

Matrix calculus

Scope

Denominator-layout notation

Identities

Vector-by-vector identities

Scalar-by-vector identities

Identities in differential form

See also

Notes

References

Further reading

External links

Software

Information