thumb|400px|Example graph of a logistic regression curve fitted to data. The curve shows the estimated probability of passing an exam (binary dependent variable) versus hours studying (scalar independent variable). See for worked details.

In statistics, a logistic model (or logit model) is a statistical model that models the log-odds of an event as a linear combination of one or more independent variables. In regression analysis, logistic regression (or logit regression) estimates the parameters of a logistic model (the coefficients in the linear or non linear combinations). In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling; Many other medical scales used to assess severity of a patient have been developed using logistic regression. Logistic regression may be used to predict the risk of developing a given disease (e.g. diabetes; coronary heart disease), based on observed characteristics of the patient (age, sex, body mass index, results of various blood tests, etc.). Another example might be to predict whether a Nepalese voter will vote Nepali Congress or Communist Party of Nepal or for any other party, based on age, income, sex, race, state of residence, votes in previous elections, etc. It is also used in marketing applications such as prediction of a customer's propensity to purchase a product or halt a subscription, etc. In economics, it can be used to predict the likelihood of a person ending up in the labor force, and a business application would be to predict the likelihood of a homeowner defaulting on a mortgage. Conditional random fields, an extension of logistic regression to sequential data, are used in natural language processing. Disaster planners and engineers rely on these models to predict decisions taken by householders or building occupants in small-scale and large-scales evacuations, such as building fires, wildfires, hurricanes among others. These models help in the development of reliable disaster managing plans and safer design for the built environment.

Supervised machine learning

Logistic regression is a supervised machine learning algorithm widely used for binary classification tasks, such as identifying whether an email is spam or not and diagnosing diseases by assessing the presence or absence of specific conditions based on patient test results. This approach utilizes the logistic (or sigmoid) function to transform a linear combination of input features into a probability value ranging between 0 and 1. This probability indicates the likelihood that a given input corresponds to one of two predefined categories. The essential mechanism of logistic regression is grounded in the logistic function's ability to model the probability of binary outcomes accurately. With its distinctive S-shaped curve, the logistic function effectively maps any real-valued number to a value within the 0 to 1 interval. This feature renders it particularly suitable for binary classification tasks, such as sorting emails into "spam" or "not spam". By calculating the probability that the dependent variable will be categorized into a specific group, logistic regression provides a probabilistic framework that supports informed decision-making.

Example

Problem

As a simple example, we can use a logistic regression with one explanatory variable and two categories to answer the following question:

<blockquote>

A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?

</blockquote>

The reason for using logistic regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1" and "0", are not cardinal numbers. If the problem was changed so that pass/fail was replaced with the grade 0–100 (cardinal numbers), then simple regression analysis could be used.

The table shows the number of hours each student spent studying, and whether they passed (1) or failed (0).

{| class="wikitable"

|-

! Hours (x<sub>k</sub>)

| 0.50|| 0.75|| 1.00|| 1.25|| 1.50|| 1.75|| 1.75|| 2.00|| 2.25|| 2.50|| 2.75|| 3.00|| 3.25|| 3.50|| 4.00|| 4.25|| 4.50|| 4.75|| 5.00 || 5.50

|-

! Pass (y<sub>k</sub>)

| 0|| 0|| 0|| 0|| 0|| 0|| 1|| 0|| 1|| 0|| 1|| 0|| 1|| 0|| 1|| 1|| 1|| 1|| 1|| 1

|}

We wish to fit a logistic function to the data consisting of the hours studied (x<sub>k</sub>) and the outcome of the test (y<sub>k</sub>&nbsp;=1 for pass, 0 for fail). The data points are indexed by the subscript k which runs from <math>k=1</math> to <math>k=K=20</math>. The x variable is called the "explanatory variable", and the y variable is called the "categorical variable" consisting of two categories: "pass" or "fail" corresponding to the categorical values 1 and 0 respectively.

Model

thumb|400px|Graph of a logistic regression curve fitted to the (x<sub>m</sub>,y<sub>m</sub>) data. The curve shows the probability of passing an exam versus hours studying.

The logistic function is of the form:

:<math>p(x)=\frac{1}{1+e^{-(x-\mu)/s</math>

where μ is a location parameter (the midpoint of the curve, where <math>p(\mu)=1/2</math>) and s is a scale parameter. This expression may be rewritten as:

:<math>p(x)=\frac{1}{1+e^{-(\beta_0+\beta_1 x)</math>

where <math>\beta_0 = -\mu/s</math> and is known as the intercept (it is the vertical intercept or y-intercept of the line <math>y = \beta_0+\beta_1 x</math>), and <math>\beta_1= 1/s</math> (inverse scale parameter or rate parameter): these are the y-intercept and slope of the log-odds as a function of x. Conversely, <math>\mu=-\beta_0/\beta_1</math> and <math>s=1/\beta_1</math>.

Note that this model is actually an oversimplification, as it implies that every student will pass if they study indefinitely (limit = 1).

Fit

The usual measure of goodness of fit for a logistic regression uses logistic loss (or log loss), the negative log-likelihood. For a given x<sub>k</sub> and y<sub>k</sub>, write <math>p_k=p(x_k)</math>. The are the probabilities that the corresponding will equal one, and are the probabilities that they will be zero (see Bernoulli distribution). We wish to find the values of and which give the "best fit" to the data. For comparison of a "best fit" to the data, see the case of linear regression. There, the sum of the squared deviations of the fit from the data points (y<sub>k</sub>), the squared error loss, is taken as a measure of the goodness of fit, and the best fit is obtained when this loss is minimized.

The log loss for the k-th point is:

:<math>\ell_k = \begin{cases}

-\ln p_k & \text{ if } y_k = 1, \\

-\ln (1 - p_k) & \text{ if } y_k = 0.

\end{cases}</math>

The log loss can be interpreted as the "surprisal" of the actual outcome relative to the prediction , and is a measure of information content. Log loss is always greater than or equal to 0, equals 0 only in case of a perfect prediction (i.e., when <math>p_k = 1</math> and <math>y_k = 1</math>, or <math>p_k = 0</math> and <math>y_k = 0</math>), and approaches infinity as the prediction gets worse (i.e., when <math>y_k = 1</math> and <math>p_k \to 0</math> or <math>y_k = 0

</math> and <math>p_k \to 1</math>), meaning the actual outcome is "more surprising". Since the value of the logistic function is always strictly between zero and one, the log loss is always greater than zero and less than infinity. Unlike in a linear regression, where the model can have zero loss at a point by passing through a data point (and zero loss overall if all points are on a line), in a logistic regression it is not possible to have zero loss at any points, since is either 0 or 1, but .

These can be combined into a single expression:

:<math>\ell_k = -y_k\ln p_k - (1 - y_k)\ln (1 - p_k).</math>

This expression is more formally known as the cross-entropy of the predicted distribution <math>\big(p_k, (1-p_k)\big)</math> from the actual distribution <math>\big(y_k, (1-y_k)\big)</math>, as probability distributions on the two-element space of (pass, fail).

The sum of these, the total loss, is the overall negative log-likelihood , and the best fit is obtained for those choices of and for which is minimized.

Alternatively, instead of minimizing the loss, one can maximize its inverse, the (positive) log-likelihood:

:<math>\ell = \sum_{k:y_k=1}\ln(p_k) + \sum_{k:y_k=0}\ln(1-p_k) = \sum_{k=1}^K \left(\,y_k \ln(p_k)+(1-y_k)\ln(1-p_k)\right)</math>

or equivalently maximize the likelihood function itself, which is the probability that the given data set is produced by a particular logistic function:

:<math>L = \prod_{k:y_k=1}p_k\,\prod_{k:y_k=0}(1-p_k)</math>

This method is known as maximum likelihood estimation.

Parameter estimation

Since ℓ is nonlinear in and , determining their optimum values will require numerical methods. One method of maximizing ℓ is to require the derivatives of ℓ with respect to and to be zero:

:<math>0 = \frac{\partial \ell}{\partial \beta_0} = \sum_{k=1}^K(y_k-p_k)</math>

:<math>0 = \frac{\partial \ell}{\partial \beta_1} = \sum_{k=1}^K(y_k-p_k)x_k</math>

and the maximization procedure can be accomplished by solving the above two equations for and , which, again, will generally require the use of numerical methods.

The values of and which maximize ℓ and L using the above data are found to be:

:<math>\beta_0 \approx -4.1</math>

:<math>\beta_1 \approx 1.5</math>

which yields a value for μ and s of:

:<math>\mu = -\beta_0/\beta_1 \approx 2.7</math>

:<math>s = 1/\beta_1 \approx 0.67</math>

Predictions

The and coefficients may be entered into the logistic regression equation to estimate the probability of passing the exam.

For example, for a student who studies 2 hours, entering the value <math>x = 2</math> into the equation gives the estimated probability of passing the exam of 0.25:

: <math>

t = \beta_0+2\beta_1 \approx - 4.1 + 2 \cdot 1.5 = -1.1

</math>

: <math>

p = \frac{1}{1 + e^{-t} } \approx 0.25 = \text{Probability of passing exam}

</math>

Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87:

: <math>t = \beta_0+4\beta_1 \approx - 4.1 + 4 \cdot 1.5 = 1.9</math>

: <math>p = \frac{1}{1 + e^{-t} } \approx 0.87 = \text{Probability of passing exam} </math>

This table shows the estimated probability of passing the exam for several values of hours studying.

{| class="wikitable"

|-

! rowspan="2" | Hours<br />of study<br />(x)

! colspan="3" | Passing exam

|-

! Log-odds (t) !! Odds (e<sup>t</sup>) !! Probability (p)

|- style="text-align: right;"

| 1|| −2.57 || 0.076 ≈ 1:13.1 || 0.07

|- style="text-align: right;"

| 2|| −1.07 || 0.34 ≈ 1:2.91 || 0.26

|- style="text-align: right;"

| || 0 ||1 || = 0.50

|- style="text-align: right;"

| 3|| 0.44 || 1.55 || 0.61

|- style="text-align: right;"

| 4|| 1.94 || 6.96 || 0.87

|- style="text-align: right;"

| 5|| 3.45 || 31.4 || 0.97

|}

Model evaluation

The logistic regression analysis gives the following output.

{| class="wikitable"

|-

! !! Coefficient!! Std. Error !! z-value !! p-value (Wald)

|- style="text-align:right;"

! Intercept (β<sub>0</sub>)

| −4.1 || 1.8 || −2.3 || 0.021

|- style="text-align:right;"

! Hours (β<sub>1</sub>)

| 1.5 || 0.9

|| 1.7 || 0.017

|}

By the Wald test, the output indicates that hours studying is significantly associated with the probability of passing the exam (<math>p = 0.017</math>). Rather than the Wald method, the recommended method to calculate the p-value for logistic regression is the likelihood-ratio test (LRT), which for these data give <math>p \approx 0.00064</math> (see below).

Generalizations

This simple model is an example of binary logistic regression, and has one explanatory variable and a binary categorical variable which can assume one of two categorical values. Multinomial logistic regression is the generalization of binary logistic regression to include any number of explanatory variables and any number of categories.

Background

thumb|320px|right|Figure 1. The standard logistic function <math>\sigma (t)</math>; <math>\sigma (t) \in (0,1)</math> for all <math>t</math>.

Definition of the logistic function

An explanation of logistic regression can begin with an explanation of the standard logistic function. The logistic function is a sigmoid function, which takes any real input <math>t</math>, and outputs a value between zero and one.

For a binary independent variable the odds ratio is defined as <math>\frac{ad}{bc}</math> where a, b, c and d are cells in a 2×2 contingency table.

Multiple explanatory variables

If there are multiple explanatory variables, the above expression <math>\beta_0+\beta_1x</math> can be revised to <math>\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m = \beta_0+ \sum_{i=1}^m \beta_ix_i</math>. Then when this is used in the equation relating the log odds of a success to the values of the predictors, the linear regression will be a multiple regression with m explanators; the parameters <math>\beta_i</math> for all <math>i = 0, 1, 2, \dots, m</math> are all estimated.

Again, the more traditional equations are:

:<math>\log \frac{p}{1-p} = \beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_mx_m</math>

and

:<math>p = \frac{1}{1+b^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_mx_m )</math>

where usually <math>b=e</math>.

Definition

A dataset contains N points. Each point i consists of a set of m input variables x<sub>1,i</sub> ... x<sub>m,i</sub> (also called independent variables, explanatory variables, predictor variables, features, or attributes), and a binary outcome variable Y<sub>i</sub> (also known as a dependent variable, response variable, output variable, or class), i.e. it can assume only the two possible values 0 (often meaning "no" or "failure") or 1 (often meaning "yes" or "success"). The goal of logistic regression is to use the dataset to create a predictive model of the outcome variable.

As in linear regression, the outcome variables Y<sub>i</sub> are assumed to depend on the explanatory variables x<sub>1,i</sub> ... x<sub>m,i</sub>.

; Explanatory variables

The explanatory variables may be of any type: real-valued, binary, categorical, etc. The main distinction is between continuous variables and discrete variables.

(Discrete variables referring to more than two possible choices are typically coded using dummy variables (or indicator variables), that is, separate explanatory variables taking the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning "variable does have the given value" and a 0 meaning "variable does not have that value".)

;Outcome variables

Formally, the outcomes Y<sub>i</sub> are described as being Bernoulli-distributed data, where each outcome is determined by an unobserved probability p<sub>i</sub> that is specific to the outcome at hand, but related to the explanatory variables. This can be expressed in any of the following equivalent forms:

::<math>

\begin{align}

Y_i\mid x_{1,i},\ldots,x_{m,i} \ & \sim \operatorname{Bernoulli}(p_i) \\[5pt]

\operatorname{\mathbb E}[Y_i\mid x_{1,i},\ldots,x_{m,i}] &= p_i \\[5pt]

\Pr(Y_i=y\mid x_{1,i},\ldots,x_{m,i}) &=

\begin{cases}

p_i & \text{if }y=1 \\

1-p_i & \text{if }y=0

\end{cases}

\\[5pt]

\Pr(Y_i=y\mid x_{1,i},\ldots,x_{m,i}) &= p_i^y (1-p_i)^{(1-y)}

\end{align}

</math>

The meanings of these four lines are:

  1. The first line expresses the probability distribution of each Y<sub>i</sub> : conditioned on the explanatory variables, it follows a Bernoulli distribution with parameters p<sub>i</sub>, the probability of the outcome of 1 for trial i. As noted above, each separate trial has its own probability of success, just as each trial has its own explanatory variables. The probability of success p<sub>i</sub> is not observed, only the outcome of an individual Bernoulli trial using that probability.
  2. The second line expresses the fact that the expected value of each Y<sub>i</sub> is equal to the probability of success p<sub>i</sub>, which is a general property of the Bernoulli distribution. In other words, if we run a large number of Bernoulli trials using the same probability of success p<sub>i</sub>, then take the average of all the 1 and 0 outcomes, then the result would be close to p<sub>i</sub>. This is because doing an average this way simply computes the proportion of successes seen, which we expect to converge to the underlying probability of success.
  3. The third line writes out the probability mass function of the Bernoulli distribution, specifying the probability of seeing each of the two possible outcomes.
  4. The fourth line is another way of writing the probability mass function, which avoids having to write separate cases and is more convenient for certain types of calculations. This relies on the fact that Y<sub>i</sub> can take only the value 0 or 1. In each case, one of the exponents will be 1, "choosing" the value under it, while the other is 0, "canceling out" the value under it. Hence, the outcome is either p<sub>i</sub> or 1&nbsp;−&nbsp;p<sub>i</sub>, as in the previous line.

; Linear predictor function

The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability p<sub>i</sub> using a linear predictor function, i.e. a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same for all trials. The linear predictor function <math>f(i)</math> for a particular data point i is written as:

:<math>f(i) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_m x_{m,i},</math>

where <math>\beta_0, \ldots, \beta_m</math> are regression coefficients indicating the relative effect of a particular explanatory variable on the outcome.

The model is usually put into a more compact form as follows:

  • The regression coefficients β<sub>0</sub>, β<sub>1</sub>, ..., β<sub>m</sub> are grouped into a single vector β of size m&nbsp;+&nbsp;1.
  • For each data point i, an additional explanatory pseudo-variable x<sub>0,i</sub> is added, with a fixed value of 1, corresponding to the intercept coefficient β<sub>0</sub>.
  • The resulting explanatory variables x<sub>0,i</sub>, x<sub>1,i</sub>, ..., x<sub>m,i</sub> are then grouped into a single vector X<sub>i</sub> of size m&nbsp;+&nbsp;1.

This makes it possible to write the linear predictor function as follows:

:<math>f(i)= \boldsymbol\beta \cdot \mathbf{X}_i,</math>

using the notation for a dot product between two vectors.

thumb|356x356px|This is an example of an SPSS output for a logistic regression model using three explanatory variables (coffee use per week, energy drink use per week, and soda use per week) and two categories (male and female).

Many explanatory variables, two categories

The above example of binary logistic regression on one explanatory variable can be generalized to binary logistic regression on any number of explanatory variables x<sub>1</sub>, x<sub>2</sub>,... and any number of categorical values <math>y=0,1,2,\dots</math>.

To begin with, we may consider a logistic model with M explanatory variables, x<sub>1</sub>, x<sub>2</sub> ... x<sub>M</sub> and, as in the example above, two categorical values (y = 0 and 1). For the simple binary logistic regression model, we assumed a linear relationship between the predictor variable and the log-odds (also called logit) of the event that <math>y=1</math>. This linear relationship may be extended to the case of M explanatory variables:

:<math>t = \log_b \frac{p}{1-p} = \beta_0 + \beta_1 x_1 + \beta_2 x_2+ \cdots +\beta_M x_M </math>

where t is the log-odds and <math>\beta_i</math> are parameters of the model. An additional generalization has been introduced in which the base of the model (b) is not restricted to Euler's number e. In most applications, the base <math>b</math> of the logarithm is usually taken to be e. However, in some cases it can be easier to communicate results by working in base 2 or base 10.

For a more compact notation, we will specify the explanatory variables and the β coefficients as -dimensional vectors:

:<math>\boldsymbol{x}=\{x_0,x_1,x_2,\dots,x_M\}</math>

:<math>\boldsymbol{\beta}=\{\beta_0,\beta_1,\beta_2,\dots,\beta_M\}</math>

with an added explanatory variable x<sub>0</sub> =1. The logit may now be written as:

:<math>t =\sum_{m=0}^{M} \beta_m x_m = \boldsymbol{\beta} \cdot x</math>

Solving for the probability p that <math>y=1</math> yields:

:<math>p(\boldsymbol{x}) = \frac{b^{\boldsymbol{\beta} \cdot \boldsymbol{x}{1+b^{\boldsymbol{\beta} \cdot \boldsymbol{x}= \frac{1}{1+b^{-\boldsymbol{\beta} \cdot \boldsymbol{x}=S_b(t)</math>,

where <math>S_b</math> is the sigmoid function with base <math>b</math>. The above formula shows that once the <math>\beta_m</math> are fixed, we can easily compute either the log-odds that <math>y=1</math> for a given observation, or the probability that <math>y=1</math> for a given observation. The main use-case of a logistic model is to be given an observation <math>\boldsymbol{x}</math>, and estimate the probability <math>p(\boldsymbol{x})</math> that <math>y=1</math>. The optimum beta coefficients may again be found by maximizing the log-likelihood. For K measurements, defining <math>\boldsymbol{x}_k</math> as the explanatory vector of the k-th measurement, and <math>y_k</math> as the categorical outcome of that measurement, the log likelihood may be written in a form very similar to the simple <math>M=1</math> case above:

:<math>\ell = \sum_{k=1}^K y_k \log_b(p(\boldsymbol{x_k}))+\sum_{k=1}^K (1-y_k) \log_b(1-p(\boldsymbol{x_k}))</math>

As in the simple example above, finding the optimum β parameters will require numerical methods. One useful technique is to equate the derivatives of the log likelihood with respect to each of the β parameters to zero yielding a set of equations which will hold at the maximum of the log likelihood:

:<math>\frac{\partial \ell}{\partial \beta_m} = 0 = \sum_{k=1}^K y_k x_{mk} - \sum_{k=1}^K p(\boldsymbol{x}_k)x_{mk}</math>

where x<sub>mk</sub> is the value of the x<sub>m</sub> explanatory variable from the k-th measurement.

Consider an example with <math>M=2</math> explanatory variables, <math>b=10</math>, and coefficients <math>\beta_0=-3</math>, <math>\beta_1=1</math>, and <math>\beta_2=2</math> which have been determined by the above method. To be concrete, the model is:

:<math>t=\log_{10}\frac{p}{1 - p} = -3 + x_1 + 2 x_2</math>

:<math>p = \frac{b^{\boldsymbol{\beta} \cdot \boldsymbol{x}{1+b^{\boldsymbol{\beta} \cdot x = \frac{b^{\beta_0 + \beta_1 x_1 + \beta_2 x_2{1+b^{\beta_0 + \beta_1 x_1 + \beta_2 x_2} } = \frac{1}{1 + b^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2)</math>,

where p is the probability of the event that <math>y=1</math>. This can be interpreted as follows:

  • <math>\beta_0 = -3</math> is the y-intercept. It is the log-odds of the event that <math>y=1</math>, when the predictors <math>x_1=x_2=0</math>. By exponentiating, we can see that when <math>x_1=x_2=0</math> the odds of the event that <math>y=1</math> are 1-to-1000, or <math>10^{-3}</math>. Similarly, the probability of the event that <math>y=1</math> when <math>x_1=x_2=0</math> can be computed as <math> 1/(1000 + 1) = 1/1001.</math>
  • <math>\beta_1 = 1</math> means that increasing <math>x_1</math> by 1 increases the log-odds by <math>1</math>. So if <math>x_1</math> increases by 1, the odds that <math>y=1</math> increase by a factor of <math>10^1</math>. The probability of <math>y=1</math> has also increased, but it has not increased by as much as the odds have increased.
  • <math>\beta_2 = 2</math> means that increasing <math>x_2</math> by 1 increases the log-odds by <math>2</math>. So if <math>x_2</math> increases by 1, the odds that <math>y=1</math> increase by a factor of <math>10^2.</math> Note how the effect of <math>x_2</math> on the log-odds is twice as great as the effect of <math>x_1</math>, but the effect on the odds is 10 times greater. But the effect on the probability of <math>y=1</math> is not as much as 10 times greater, it's only the effect on the odds that is 10 times greater.

Multinomial logistic regression: Many explanatory variables and many categories

In the above cases of two categories (binomial logistic regression), the categories were indexed by "0" and "1", and we had two probabilities: The probability that the outcome was in category 1 was given by <math>p(\boldsymbol{x})</math>and the probability that the outcome was in category 0 was given by <math>1-p(\boldsymbol{x})</math>. The sum of these probabilities equals 1, which must be true, since "0" and "1" are the only possible categories in this setup.

In general, if we have explanatory variables (including x<sub>0</sub>) and categories, we will need separate probabilities, one for each category, indexed by n, which describe the probability that the categorical outcome y will be in category y=n, conditional on the vector of covariates x. The sum of these probabilities over all categories must equal 1. Using the mathematically convenient base e, these probabilities are:

:<math>p_n(\boldsymbol{x}) = \frac{e^{\boldsymbol{\beta}_n\cdot \boldsymbol{x}{1+\sum_{u=1}^N e^{\boldsymbol{\beta}_u\cdot \boldsymbol{x}</math> for <math>n=1,2,\dots,N</math>

:<math>p_0(\boldsymbol{x}) = 1-\sum_{n=1}^N p_n(\boldsymbol{x})=\frac{1}{1+\sum_{u=1}^N e^{\boldsymbol{\beta}_u\cdot \boldsymbol{x}</math>

Each of the probabilities except <math>p_0(\boldsymbol{x})</math> will have their own set of regression coefficients <math>\boldsymbol{\beta}_n</math>. It can be seen that, as required, the sum of the <math>p_n(\boldsymbol{x})</math> over all categories n is 1. The selection of <math>p_0(\boldsymbol{x})</math> to be defined in terms of the other probabilities is artificial. Any of the probabilities could have been selected to be so defined. This special value of n is termed the "pivot index", and the log-odds (t<sub>n</sub>) are expressed in terms of the pivot probability and are again expressed as a linear combination of the explanatory variables:

:<math>t_n = \ln\left(\frac{p_n(\boldsymbol{x})}{p_0(\boldsymbol{x})}\right) = \boldsymbol{\beta}_n \cdot \boldsymbol{x}</math>

Note also that for the simple case of <math>N=1</math>, the two-category case is recovered, with <math>p(\boldsymbol{x})=p_1(\boldsymbol{x})</math> and <math>p_0(\boldsymbol{x})=1-p_1(\boldsymbol{x})</math>.

The log-likelihood that a particular set of K measurements or data points will be generated by the above probabilities can now be calculated. Indexing each measurement by k, let the k-th set of measured explanatory variables be denoted by <math>\boldsymbol{x}_k</math> and their categorical outcomes be denoted by <math>y_k</math> which can be equal to any integer in [0,N]. The log-likelihood is then:

:<math>\ell = \sum_{k=1}^K \sum_{n=0}^N \Delta(n,y_k)\,\ln(p_n(\boldsymbol{x}_k))</math>

where <math>\Delta(n,y_k)</math> is an indicator function which equals 1 if y<sub>k</sub> = n and zero otherwise. In the case of two explanatory variables, this indicator function was defined as y<sub>k</sub> when n = 1 and 1-y<sub>k</sub> when n = 0. This was convenient, but not necessary. Again, the optimum beta coefficients may be found by maximizing the log-likelihood function generally using numerical methods. A possible method of solution is to set the derivatives of the log-likelihood with respect to each beta coefficient equal to zero and solve for the beta coefficients:

:<math>\frac{\partial \ell}{\partial \beta_{nm = 0 = \sum_{k=1}^K \Delta(n,y_k)x_{mk} - \sum_{k=1}^K p_n(\boldsymbol{x}_k)x_{mk}</math>

where <math>\beta_{nm}</math> is the m-th coefficient of the <math>\boldsymbol{\beta}_n</math> vector and <math>x_{mk}</math> is the m-th explanatory variable of the k-th measurement. Once the beta coefficients have been estimated from the data, we will be able to estimate the probability that any subsequent set of explanatory variables will result in any of the possible outcome categories.

Interpretations

There are various equivalent specifications and interpretations of logistic regression, which fit into different types of more general models, and allow different generalizations.

As a generalized linear model

The particular model used by logistic regression, which distinguishes it from standard linear regression and from other types of regression analysis used for binary-valued outcomes, is the way the probability of a particular outcome is linked to the linear predictor function:

:<math>\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid x_{1,i},\ldots,x_{m,i}]) = \operatorname{logit}(p_i) = \ln \left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_m x_{m,i}</math>

Written using the more compact notation described above, this is:

:<math>\operatorname{logit}(\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i]) = \operatorname{logit}(p_i)=\ln\left(\frac{p_i}{1-p_i}\right) = \boldsymbol\beta \cdot \mathbf{X}_i</math>

This formulation expresses logistic regression as a type of generalized linear model, which predicts variables with various types of probability distributions by fitting a linear predictor function of the above form to some sort of arbitrary transformation of the expected value of the variable.

The intuition for transforming using the logit function (the natural log of the odds) was explained above. It also has the practical effect of converting the probability (which is bounded to be between 0 and 1) to a variable that ranges over <math>(-\infty,+\infty)</math> — thereby matching the potential range of the linear prediction function on the right side of the equation.

Both the probabilities p<sub>i</sub> and the regression coefficients are unobserved, and the means of determining them is not part of the model itself. They are typically determined by some sort of optimization procedure, e.g. maximum likelihood estimation, that finds values that best fit the observed data (i.e. that give the most accurate predictions for the data already observed), usually subject to regularization conditions that seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients. The use of a regularization condition is equivalent to doing maximum a posteriori (MAP) estimation, an extension of maximum likelihood. (Regularization is most commonly done using a squared regularizing function, which is equivalent to placing a zero-mean Gaussian prior distribution on the coefficients, but other regularizers are also possible.) Whether or not regularization is used, it is usually not possible to find a closed-form solution; instead, an iterative numerical method must be used, such as iteratively reweighted least squares (IRLS) or, more commonly these days, a quasi-Newton method such as the L-BFGS method.

The interpretation of the β<sub>j</sub> parameter estimates is as the additive effect on the log of the odds for a unit change in the j the explanatory variable. In the case of a dichotomous explanatory variable, for instance, gender <math>e^\beta</math> is the estimate of the odds of having the outcome for, say, males compared with females.

An equivalent formula uses the inverse of the logit function, which is the logistic function, i.e.:

:<math>\operatorname{\mathbb E}[Y_i\mid \mathbf{X}_i] = p_i = \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) = \frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i</math>

The formula can also be written as a probability distribution (specifically, using a probability mass function):

: <math>\Pr(Y_i=y\mid \mathbf{X}_i) = {p_i}^y(1-p_i)^{1-y} =\left(\frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i\right)^{y} \left(1-\frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i\right)^{1-y} = \frac{e^{\boldsymbol\beta \cdot \mathbf{X}_i \cdot y} }{1+e^{\boldsymbol\beta \cdot \mathbf{X}_i</math>

As a latent-variable model

The logistic model has an equivalent formulation as a latent-variable model. This formulation is common in the theory of discrete choice models and makes it easier to extend to certain more complicated models with multiple, correlated choices, as well as to compare logistic regression to the closely related probit model.

Imagine that, for each trial i, there is a continuous latent variable Y<sub>i</sub><sup>*</sup> (i.e. an unobserved random variable) that is distributed as follows:

: <math> Y_i^\ast = \boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon_i \, </math>

where

: <math>\varepsilon_i \sim \operatorname{Logistic}(0,1) \, </math>

i.e. the latent variable can be written directly in terms of the linear predictor function and an additive random error variable that is distributed according to a standard logistic distribution.

Then Y<sub>i</sub> can be viewed as an indicator for whether this latent variable is positive:

: <math> Y_i = \begin{cases} 1 & \text{if }Y_i^\ast > 0 \ \text{ i.e. } {- \varepsilon_i} < \boldsymbol\beta \cdot \mathbf{X}_i, \\

0 &\text{otherwise.} \end{cases} </math>

The choice of modeling the error variable specifically with a standard logistic distribution, rather than a general logistic distribution with the location and scale set to arbitrary values, seems restrictive, but in fact, it is not. It must be kept in mind that we can choose the regression coefficients ourselves, and very often can use them to offset changes in the parameters of the error variable's distribution. For example, a logistic error-variable distribution with a non-zero location parameter μ (which sets the mean) is equivalent to a distribution with a zero location parameter, where μ has been added to the intercept coefficient. Both situations produce the same value for Y<sub>i</sub><sup>*</sup> regardless of settings of explanatory variables. Similarly, an arbitrary scale parameter s is equivalent to setting the scale parameter to 1 and then dividing all regression coefficients by s. In the latter case, the resulting value of Y<sub>i</sub><sup>*</sup> will be smaller by a factor of s than in the former case, for all sets of explanatory variables — but critically, it will always remain on the same side of 0, and hence lead to the same Y<sub>i</sub> choice.

(This predicts that the irrelevancy of the scale parameter may not carry over into more complex models where more than two choices are available.)

It turns out that this formulation is exactly equivalent to the preceding one, phrased in terms of the generalized linear model and without any latent variables. This can be shown as follows, using the fact that the cumulative distribution function (CDF) of the standard logistic distribution is the logistic function, which is the inverse of the logit function, i.e.

:<math>\Pr(\varepsilon_i < x) = \operatorname{logit}^{-1}(x)</math>

Then:

:<math>

\begin{align}

\Pr(Y_i=1\mid\mathbf{X}_i) &= \Pr(Y_i^\ast > 0\mid\mathbf{X}_i) \\[5pt]

&= \Pr(\boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon_i > 0) \\[5pt]

&= \Pr(\varepsilon_i > -\boldsymbol\beta \cdot \mathbf{X}_i) \\[5pt]

&= \Pr(\varepsilon_i < \boldsymbol\beta \cdot \mathbf{X}_i) & & \text{(because the logistic distribution is symmetric)} \\[5pt]

&= \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) & \\[5pt]

&= p_i & & \text{(see above)}

\end{align}

</math>

This formulation—which is standard in discrete choice models—makes clear the relationship between logistic regression (the "logit model") and the probit model, which uses an error variable distributed according to a standard normal distribution instead of a standard logistic distribution. Both the logistic and normal distributions are symmetric with a basic unimodal, "bell curve" shape. The only difference is that the logistic distribution has somewhat heavier tails, which means that it is less sensitive to outlying data (and hence somewhat more robust to model mis-specifications or erroneous data).

Two-way latent-variable model

Yet another formulation uses two separate latent variables:

: <math>

\begin{align}

Y_i^{0\ast} &= \boldsymbol\beta_0 \cdot \mathbf{X}_i + \varepsilon_0 \, \\

Y_i^{1\ast} &= \boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 \,

\end{align}

</math>

where

: <math>

\begin{align}

\varepsilon_0 & \sim \operatorname{EV}_1(0,1) \\

\varepsilon_1 & \sim \operatorname{EV}_1(0,1)

\end{align}

</math>

where EV<sub>1</sub>(0,1) is a standard type-1 extreme value distribution: i.e.

:<math>\Pr(\varepsilon_0=x) = \Pr(\varepsilon_1=x) = e^{-x} e^{-e^{-x</math>

Then

: <math> Y_i = \begin{cases} 1 & \text{if }Y_i^{1\ast} > Y_i^{0\ast}, \\

0 &\text{otherwise.} \end{cases} </math>

This model has a separate latent variable and a separate set of regression coefficients for each possible outcome of the dependent variable. The reason for this separation is that it makes it easy to extend logistic regression to multi-outcome categorical variables, as in the multinomial logit model. In such a model, it is natural to model each possible outcome using a different set of regression coefficients. It is also possible to motivate each of the separate latent variables as the theoretical utility associated with making the associated choice, and thus motivate logistic regression in terms of utility theory. (In terms of utility theory, a rational actor always chooses the choice with the greatest associated utility.) This is the approach taken by economists when formulating discrete choice models, because it both provides a theoretically strong foundation and facilitates intuitions about the model, which in turn makes it easy to consider various sorts of extensions. (See the example below.)

The choice of the type-1 extreme value distribution seems fairly arbitrary, but it makes the mathematics work out, and it may be possible to justify its use through rational choice theory.

It turns out that this model is equivalent to the previous model, although this seems non-obvious, since there are now two sets of regression coefficients and error variables, and the error variables have a different distribution. In fact, this model reduces directly to the previous one with the following substitutions:

:<math>\boldsymbol\beta = \boldsymbol\beta_1 - \boldsymbol\beta_0</math>

:<math>\varepsilon = \varepsilon_1 - \varepsilon_0</math>

An intuition for this comes from the fact that, since we choose based on the maximum of two values, only their difference matters, not the exact values — and this effectively removes one degree of freedom. Another critical fact is that the difference of two type-1 extreme-value-distributed variables is a logistic distribution, i.e. <math>\varepsilon = \varepsilon_1 - \varepsilon_0 \sim \operatorname{Logistic}(0,1) .</math> We can demonstrate the equivalent as follows:

:<math>\begin{align}

\Pr(Y_i=1\mid\mathbf{X}_i) = {} & \Pr \left (Y_i^{1\ast} > Y_i^{0\ast}\mid\mathbf{X}_i \right ) & \\[5pt]

= {} & \Pr \left (Y_i^{1\ast} - Y_i^{0\ast} > 0\mid\mathbf{X}_i \right ) & \\[5pt]

= {} & \Pr \left (\boldsymbol\beta_1 \cdot \mathbf{X}_i + \varepsilon_1 - \left (\boldsymbol\beta_0 \cdot \mathbf{X}_i + \varepsilon_0 \right ) > 0 \right ) & \\[5pt]

= {} & \Pr \left ((\boldsymbol\beta_1 \cdot \mathbf{X}_i - \boldsymbol\beta_0 \cdot \mathbf{X}_i) + (\varepsilon_1 - \varepsilon_0) > 0 \right ) & \\[5pt]

= {} & \Pr((\boldsymbol\beta_1 - \boldsymbol\beta_0) \cdot \mathbf{X}_i + (\varepsilon_1 - \varepsilon_0) > 0) & \\[5pt]

= {} & \Pr((\boldsymbol\beta_1 - \boldsymbol\beta_0) \cdot \mathbf{X}_i + \varepsilon > 0) & & \text{(substitute } \varepsilon\text{ as above)} \\[5pt]

= {} & \Pr(\boldsymbol\beta \cdot \mathbf{X}_i + \varepsilon > 0) & & \text{(substitute }\boldsymbol\beta\text{ as above)} \\[5pt]

= {} & \Pr(\varepsilon > -\boldsymbol\beta \cdot \mathbf{X}_i) & & \text{(now, same as above model)}\\[5pt]

= {} & \Pr(\varepsilon < \boldsymbol\beta \cdot \mathbf{X}_i) & \\[5pt]

= {} & \operatorname{logit}^{-1}(\boldsymbol\beta \cdot \mathbf{X}_i) \\[5pt]

= {} & p_i

\end{align}</math>

As a "log-linear" model

Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of the multinomial logit.

Here, instead of writing the logit of the probabilities p<sub>i</sub> as a linear predictor, we separate the linear predictor into two, one for each of the two outcomes:

: <math>

\begin{align}

\ln \Pr(Y_i=0) &= \boldsymbol\beta_0 \cdot \mathbf{X}_i - \ln Z \\

\ln \Pr(Y_i=1) &= \boldsymbol\beta_1 \cdot \mathbf{X}_i - \ln Z

\end{align}

</math>

Two separate sets of regression coefficients have been introduced, just as in the two-way latent variable model, and the two equations appear a form that writes the logarithm of the associated probability as a linear predictor, with an extra term <math>- \ln Z</math> at the end. This term, as it turns out, serves as the normalizing factor ensuring that the result is a distribution. This can be seen by exponentiating both sides:

: <math>

\begin{align}

\Pr(Y_i=0) &= \frac{1}{Z} e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} \\[5pt]

\Pr(Y_i=1) &= \frac{1}{Z} e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}

\end{align}

</math>

In this form it is clear that the purpose of Z is to ensure that the resulting distribution over Y<sub>i</sub> is in fact a probability distribution, i.e. it sums to 1. This means that Z is simply the sum of all un-normalized probabilities, and by dividing each probability by Z, the probabilities become "normalized". That is:

:<math> Z = e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i}</math>

and the resulting equations are

:<math>

\begin{align}

\Pr(Y_i=0) &= \frac{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i \\[5pt]

\Pr(Y_i=1) &= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i.

\end{align}

</math>

Or generally:

:<math>\Pr(Y_i=c) = \frac{e^{\boldsymbol\beta_c \cdot \mathbf{X}_i{\sum_h e^{\boldsymbol\beta_h \cdot \mathbf{X}_i</math>

This shows clearly how to generalize this formulation to more than two outcomes, as in multinomial logit.

This general formulation is exactly the softmax function as in

:<math>\Pr(Y_i=c) = \operatorname{softmax}(c, \boldsymbol\beta_0 \cdot \mathbf{X}_i, \boldsymbol\beta_1 \cdot \mathbf{X}_i, \dots) .</math>

To prove that this is equivalent to the previous model, we start by recognizing the above model is overspecified, in that <math>\Pr(Y_i=0)</math> and <math>\Pr(Y_i=1)</math> cannot be independently specified: rather <math>\Pr(Y_i=0) + \Pr(Y_i=1) = 1</math> so knowing one automatically determines the other. As a result, the model is nonidentifiable, in that multiple combinations of <math>\boldsymbol\beta_{0}</math> and <math>\boldsymbol\beta_{1}</math> will produce the same probabilities for all possible explanatory variables. In fact, it can be seen that adding any constant vector to both of them will produce the same probabilities:

:<math>

\begin{align}

\Pr(Y_i=1) &= \frac{e^{(\boldsymbol\beta_1 +\mathbf{C}) \cdot \mathbf{X}_i{e^{(\boldsymbol\beta_0 +\mathbf{C})\cdot \mathbf{X}_i} + e^{(\boldsymbol\beta_1 +\mathbf{C}) \cdot \mathbf{X}_i \\[5pt]

&= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i} e^{\mathbf{C} \cdot \mathbf{X}_i \\[5pt]

&= \frac{e^{\mathbf{C} \cdot \mathbf{X}_i}e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i{e^{\mathbf{C} \cdot \mathbf{X}_i}(e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i})} \\[5pt]

&= \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i{e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i.

\end{align}

</math>

As a result, we can simplify matters, and restore identifiability, by picking an arbitrary value for one of the two vectors. We choose to set <math>\boldsymbol\beta_0 = \mathbf{0} .</math> Then,

:<math>e^{\boldsymbol\beta_0 \cdot \mathbf{X}_i} = e^{\mathbf{0} \cdot \mathbf{X}_i} = 1</math>

and so

:<math>

\Pr(Y_i=1) = \frac{e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i{1 + e^{\boldsymbol\beta_1 \cdot \mathbf{X}_i = \frac{1}{1+e^{-\boldsymbol\beta_1 \cdot \mathbf{X}_i = p_i</math>

which shows that this formulation is indeed equivalent to the previous formulation. (As in the two-way latent variable formulation, any settings where <math>\boldsymbol\beta = \boldsymbol\beta_1 - \boldsymbol\beta_0</math> will produce equivalent results.)

Most treatments of the multinomial logit model start out either by extending the "log-linear" formulation presented here or the two-way latent variable formulation presented above, since both clearly show the way that the model could be extended to multi-way outcomes. In general, the presentation with latent variables is more common in econometrics and political science, where discrete choice models and utility theory reign, while the "log-linear" formulation here is more common in computer science, e.g. machine learning and natural language processing.

As a single-layer perceptron

The model has an equivalent formulation

:<math>p_i = \frac{1}{1+e^{-(\beta_0 + \beta_1 x_{1,i} + \cdots + \beta_k x_{k,i}). \, </math>

This functional form is commonly called a single-layer perceptron or single-layer artificial neural network. A single-layer neural network computes a continuous output instead of a step function. The derivative of p<sub>i</sub> with respect to X&nbsp;=&nbsp;(x<sub>1</sub>, ..., x<sub>k</sub>) is computed from the general form:

: <math>y = \frac{1}{1+e^{-f(X)</math>

where f(X) is an analytic function in X. With this choice, the single-layer neural network is identical to the logistic regression model. This function has a continuous derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated:

: <math>\frac{\mathrm{d}y}{\mathrm{d}X} = y(1-y)\frac{\mathrm{d}f}{\mathrm{d}X}. \, </math>

In terms of binomial data

A closely related model assumes that each i is associated not with a single Bernoulli trial but with n<sub>i</sub> independent identically distributed trials, where the observation Y<sub>i</sub> is the number of successes observed (the sum of the individual Bernoulli-distributed random variables), and hence follows a binomial distribution:

:<math>Y_i \,\sim \operatorname{Bin}(n_i,p_i),\text{ for }i = 1, \dots , n</math>

An example of this distribution is the fraction of seeds (p<sub>i</sub>) that germinate after n<sub>i</sub> are planted.

In terms of expected values, this model is expressed as follows:

:<math>p_i = \operatorname{\mathbb E}\left[\left.\frac{Y_i}{n_{i\,\right|\,\mathbf{X}_i \right]\,, </math>

so that

:<math>\operatorname{logit}\left(\operatorname{\mathbb E}\left[\left.\frac{Y_i}{n_i}\,\right|\,\mathbf{X}_i \right]\right) = \operatorname{logit}(p_i) = \ln \left(\frac{p_i}{1-p_i}\right) = \boldsymbol\beta \cdot \mathbf{X}_i\,,</math>

Or equivalently:

:<math>\Pr(Y_i=y\mid \mathbf{X}_i) = {n_i \choose y} p_i^y(1-p_i)^{n_i-y} ={n_i \choose y} \left(\frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i\right)^y \left(1-\frac{1}{1+e^{-\boldsymbol\beta \cdot \mathbf{X}_i\right)^{n_i-y}\,.</math>

This model can be fit using the same sorts of methods as the above more basic model.

Model fitting

Maximum likelihood estimation (MLE)

The regression coefficients are usually estimated using maximum likelihood estimation. Unlike linear regression with normally distributed residuals, it is not possible to find a closed-form expression for the coefficient values that maximize the likelihood function so an iterative process must be used instead; for example Newton's method. This process begins with a tentative solution, revises it slightly to see if it can be improved, and repeats this revision until no more improvement is made, at which point the process is said to have converged.

Iteratively reweighted least squares (IRLS)

Binary logistic regression (<math>y=0</math> or <math> y=1</math>) can, for example, be calculated using iteratively reweighted least squares (IRLS), which is equivalent to maximizing the log-likelihood of a Bernoulli distributed process using Newton's method. If the problem is written in vector matrix form, with parameters <math>\mathbf{w}^T=[\beta_0,\beta_1,\beta_2, \ldots]</math>, explanatory variables <math>\mathbf{x}(i)=[1, x_1(i), x_2(i), \ldots]^T</math> and expected value of the Bernoulli distribution <math>\mu(i)=\frac{1}{1+e^{-\mathbf{w}^T\mathbf{x}(i)</math>, the parameters <math>\mathbf{w}</math> can be found using the following iterative algorithm:

:<math>\mathbf{w}_{k+1} = \left(\mathbf{X}^T\mathbf{S}_k\mathbf{X}\right)^{-1}\mathbf{X}^T \left(\mathbf{S}_k \mathbf{X} \mathbf{w}_k + \mathbf{y} - \mathbf{\boldsymbol\mu}_k\right) </math>

where <math>\mathbf{S}=\operatorname{diag}(\mu(i)(1-\mu(i)))</math> is a diagonal weighting matrix, <math>\boldsymbol\mu=[\mu(1), \mu(2),\ldots]</math> the vector of expected values,

:<math>\mathbf{X}=\begin{bmatrix}

1 & x_1(1) & x_2(1) & \ldots\\

1 & x_1(2) & x_2(2) & \ldots\\

\vdots & \vdots & \vdots

\end{bmatrix}</math>

the regressor matrix and <math>\mathbf{y}(i)=[y(1),y(2),\ldots]^T</math> the vector of response variables. More details can be found in the literature.

Bayesian

right|300px|thumb|Comparison of [[logistic function with a scaled inverse probit function (i.e. the CDF of the normal distribution), comparing <math>\sigma(x)</math> vs. <math display="inline">\Phi(\sqrt{\frac{\pi}{8x)</math>, which makes the slopes the same at the origin. This shows the heavier tails of the logistic distribution.]]

In a Bayesian statistics context, prior distributions are normally placed on the regression coefficients, for example in the form of Gaussian distributions. There is no conjugate prior of the likelihood function in logistic regression. When Bayesian inference was performed analytically, this made the posterior distribution difficult to calculate except in very low dimensions. Now, though, automatic software such as OpenBUGS, JAGS, PyMC, Stan or Turing.jl allows these posteriors to be computed using simulation, so lack of conjugacy is not a concern. However, when the sample size or the number of parameters is large, full Bayesian simulation can be slow, and people often use approximate methods such as variational Bayesian methods and expectation propagation.

"Rule of ten"

Widely used, the "one in ten rule", states that logistic regression models give stable values for the explanatory variables if based on a minimum of about 10 events per explanatory variable (EPV); where event denotes the cases belonging to the less frequent category in the dependent variable. Thus a study designed to use <math>k</math> explanatory variables for an event (e.g. myocardial infarction) expected to occur in a proportion <math>p</math> of participants in the study will require a total of <math>10k/p</math> participants. However, there is considerable debate about the reliability of this rule, which is based on simulation studies and lacks a secure theoretical underpinning. According to some authors the rule is overly conservative in some circumstances, with the authors stating, "If we (somewhat subjectively) regard confidence interval coverage less than 93 percent, type I error greater than 7 percent, or relative bias greater than 15 percent as problematic, our results indicate that problems are fairly frequent with 2–4 EPV, uncommon with 5–9 EPV, and still observed with 10–16 EPV. The worst instances of each problem were not severe with 5–9 EPV and usually comparable to those with 10–16 EPV".

Others have found results that are not consistent with the above, using different criteria. A useful criterion is whether the fitted model will be expected to achieve the same predictive discrimination in a new sample as it appeared to achieve in the model development sample. For that criterion, 20 events per candidate variable may be required. including the following can be used instead.

Deviance and likelihood ratio tests

In linear regression analysis, one is concerned with partitioning variance via the sum of squares calculations – variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, deviance is used in lieu of a sum of squares calculations.

Four of the most commonly used indices and one less commonly used one are examined on this page:

  • Likelihood ratio <sup>2</sup>
  • Cox and Snell <sup>2</sup>
  • Nagelkerke <sup>2</sup>
  • McFadden <sup>2</sup>
  • Tjur <sup>2</sup>

Hosmer–Lemeshow test

The Hosmer–Lemeshow test uses a test statistic that asymptotically follows a <math>\chi^2</math> distribution to assess whether or not the observed event rates match expected event rates in subgroups of the model population. This test is considered to be obsolete by some statisticians because of its dependence on arbitrary binning of predicted probabilities and relative low power.

Coefficient significance

After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor.

Wald statistic

Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the Wald statistic. The Wald statistic, analogous to the t-test in linear regression, is used to assess the significance of coefficients. The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution.

Logistic regression is unique in that it may be estimated on unbalanced data, rather than randomly sampled data, and still yield correct coefficient estimates of the effects of each independent variable on the outcome. That is to say, if we form a logistic model from such data, if the model is correct in the general population, the <math>\beta_j</math> parameters are all correct except for <math>\beta_0</math>. We can correct <math>\beta_0</math> if we know the true prevalence as follows:

Consider a generalized linear model function parameterized by <math>\theta</math>,

:<math>

h_\theta(X) = \frac{1}{1 + e^{-\theta^TX = \Pr(Y=1 \mid X; \theta)

</math>

Therefore,

:<math>

\Pr(Y=0 \mid X; \theta) = 1 - h_\theta(X)

</math>

and since <math> Y \in \{0,1\}</math>, we see that <math> \Pr(y\mid X;\theta) </math> is given by <math> \Pr(y \mid X; \theta) = h_\theta(X)^y(1 - h_\theta(X))^{(1-y)}. </math> We now calculate the likelihood function assuming that all the observations in the sample are independently Bernoulli distributed,

:<math>\begin{align}

L(\theta \mid y; x) &= \Pr(Y \mid X; \theta) \\

&= \prod_i \Pr(y_i \mid x_i; \theta) \\

&= \prod_i h_\theta(x_i)^{y_i}(1 - h_\theta(x_i))^{(1-y_i)}

\end{align}</math>

Typically, the log likelihood is maximized,

:<math>

N^{-1} \log L(\theta \mid y; x) = N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta)

</math>

which is maximized using optimization techniques such as gradient descent.

Assuming the <math>(x, y)</math> pairs are drawn uniformly from the underlying distribution, then in the limit of large&nbsp;N,

:<math>\begin{align}

& \lim \limits_{N \rightarrow +\infty} N^{-1} \sum_{i=1}^N \log \Pr(y_i \mid x_i; \theta)

= \sum_{x \in \mathcal{X \sum_{y \in \mathcal{Y \Pr(X=x, Y=y) \log \Pr(Y=y \mid X=x; \theta) \\[6pt]

= {} & \sum_{x \in \mathcal{X \sum_{y \in \mathcal{Y \Pr(X=x, Y=y) \left( - \log\frac{\Pr(Y=y \mid X=x)}{\Pr(Y=y \mid X=x; \theta)} + \log \Pr(Y=y \mid X=x) \right) \\[6pt]

= {} & - D_\text{KL}( Y \parallel Y_\theta ) - H(Y \mid X)

\end{align}</math>

where <math>H(Y\mid X)</math> is the conditional entropy and <math>D_\text{KL}</math> is the Kullback–Leibler divergence. This leads to the intuition that by maximizing the log-likelihood of a model, you are minimizing the KL divergence of your model from the maximal entropy distribution. Intuitively searching for the model that makes the fewest assumptions in its parameters.

Comparison with linear regression

Logistic regression can be seen as a special case of the generalized linear model and thus analogous to linear regression. The model of logistic regression, however, is based on quite different assumptions (about the relationship between the dependent and independent variables) from those of linear regression. In particular, the key differences between these two models can be seen in the following two features of logistic regression. First, the conditional distribution <math>y \mid x</math> is a Bernoulli distribution rather than a Gaussian distribution, because the dependent variable is binary. Second, the predicted values are probabilities and are therefore restricted to (0,1) through the logistic distribution function because logistic regression predicts the probability of particular outcomes rather than the outcomes themselves.

Alternatives

A common alternative to the logistic model (logit model) is the probit model, as the related names suggest. From the perspective of generalized linear models, these differ in the choice of link function: the logistic model uses the logit function (inverse logistic function), while the probit model uses the probit function (inverse error function). Equivalently, in the latent variable interpretations of these two methods, the first assumes a standard logistic distribution of errors and the second a standard normal distribution of errors. Other sigmoid functions or error distributions can be used instead.

Logistic regression is an alternative to Fisher's 1936 method, linear discriminant analysis. If the assumptions of linear discriminant analysis hold, the conditioning can be reversed to produce logistic regression. The converse is not true, however, because logistic regression does not require the multivariate normal assumption of discriminant analysis.

The assumption of linear predictor effects can easily be relaxed using techniques such as spline functions. In his more detailed paper (1845), Verhulst determined the three parameters of the model by making the curve pass through three observed points, which yielded poor predictions.

The logistic function was independently developed in chemistry as a model of autocatalysis (Wilhelm Ostwald, 1883). An autocatalytic reaction is one in which one of the products is itself a catalyst for the same reaction, while the supply of one of the reactants is fixed. This naturally gives rise to the logistic equation for the same reason as population growth: the reaction is self-reinforcing but constrained.

The logistic function was independently rediscovered as a model of population growth in 1920 by Raymond Pearl and Lowell Reed, published as , which led to its use in modern statistics. They were initially unaware of Verhulst's work and presumably learned about it from L. Gustave du Pasquier, but they gave him little credit and did not adopt his terminology. Verhulst's priority was acknowledged and the term "logistic" revived by Udny Yule in 1925 and has been followed since. Pearl and Reed first applied the model to the population of the United States, and also initially fitted the curve by making it pass through three points; as with Verhulst, this again yielded poor results.

In the 1930s, the probit model was developed and systematized by Chester Ittner Bliss, who coined the term "probit" in , and by John Gaddum in , and the model fit by maximum likelihood estimation by Ronald A. Fisher in , as an addendum to Bliss's work. The probit model was principally used in bioassay, and had been preceded by earlier work dating to 1860; see . The probit model influenced the subsequent development of the logit model and these models competed with each other.

The logistic model was likely first used as an alternative to the probit model in bioassay by Edwin Bidwell Wilson and his student Jane Worcester in . However, the development of the logistic model as a general alternative to the probit model was principally due to the work of Joseph Berkson over many decades, beginning in , where he coined "logit", by analogy with "probit", and continuing through and following years. The logit model was initially dismissed as inferior to the probit model, but "gradually achieved an equal footing with the probit", particularly between 1960 and 1970. By 1970, the logit model achieved parity with the probit model in use in statistics journals and thereafter surpassed it. This relative popularity was due to the adoption of the logit outside of bioassay, rather than displacing the probit within bioassay, and its informal use in practice; the logit's popularity is credited to the logit model's computational simplicity, mathematical properties, and generality, allowing its use in varied fields.

Various refinements occurred during that time, notably by David Cox, as in .

The multinomial logit model was introduced independently in and , which greatly increased the scope of application and the popularity of the logit model. In 1973 Daniel McFadden linked the multinomial logit to the theory of discrete choice, specifically Luce's choice axiom, showing that the multinomial logit followed from the assumption of independence of irrelevant alternatives and interpreting odds of alternatives as relative preferences; this gave a theoretical foundation for the logistic regression.

Extensions

There are large numbers of extensions:

  • Multinomial logistic regression (or multinomial logit) handles the case of a multi-way categorical dependent variable (with unordered values, also called "classification"). The general case of having dependent variables with more than two values is termed polytomous regression.
  • Ordered logistic regression (or ordered logit) handles ordinal dependent variables (ordered values).
  • Mixed logit is an extension of multinomial logit that allows for correlations among the choices of the dependent variable.
  • An extension of the logistic model to sets of interdependent variables is the conditional random field.
  • Conditional logistic regression handles matched or stratified data when the strata are small. It is mostly used in the analysis of observational studies.

See also

  • Logistic function
  • Discrete choice
  • Jarrow–Turnbull model
  • Limited dependent variable
  • Multinomial logit model
  • Ordered logit
  • Hosmer–Lemeshow test
  • Brier score
  • mlpack - contains a C++ implementation of logistic regression
  • Local case-control sampling
  • Logistic model tree

References

Sources

  • Published in:
  • by Mark Thoma
  • Logistic Regression tutorial
  • mlelr: software in C for teaching purposes