Mean time between failures

Mean time between failures (MTBF) is the predicted elapsed time between inherent failures of a mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems while mean time to failure (MTTF) denotes the expected time to failure for a non-repairable system.

The definition of MTBF depends on the definition of what is considered a failure. For complex, repairable systems, failures are considered to be those out of design conditions which place the system out of service and into a state for repair. Failures which occur that can be left or maintained in an unrepaired condition, and do not place the system out of service, are not considered failures under this definition. In addition, units that are taken down for routine scheduled maintenance or inventory control are not considered within the definition of failure. The higher the MTBF, the longer a system is likely to work before failing.

Overview

Mean time between failures (MTBF) describes the expected time between two failures for a repairable system. For example, three identical systems starting to function properly at time 0 are working until all of them fail. The first system fails after 100 hours, the second after 120 hours and the third after 130 hours. The MTBF of the systems is the average of the three failure times, which is 116.667 hours. If the systems were non-repairable, then their MTTF would be 116.667 hours.

In general, MTBF is the "up-time" between two failure states of a repairable system during operation as outlined here:

File:Time between failures.svg

For each observation, the "down time" is the instantaneous time it went down, which is after (i.e. greater than) the moment it went up, the "up time". The difference ("down time" minus "up time") is the amount of time it was operating between these two events.

By referring to the figure above, the MTBF of a component is the sum of the lengths of the operational periods divided by the number of observed failures:

:<math>

\text{MTBF} = \frac{\sum{(\text{start of downtime} - \text{start of uptime}){\text{number of failures.

</math>

In a similar manner, mean down time (MDT) can be defined as

:<math>

\text{MDT} = \frac{\sum{(\text{start of uptime} - \text{start of downtime}){\text{number of failures.

</math>

Mathematical description

The MTBF is the expected value of the random variable <math>T</math> indicating the time until failure. Thus, it can be written as

: <math>\text{MTBF} = \mathbb{E}\{T\} = \int_0^\infty tf_T(t)\, dt</math>

where <math>f_T(t)</math> is the probability density function of <math>T</math>. Equivalently, the MTBF can be expressed in terms of the reliability function <math>R_T(t)</math> as

: <math>\text{MTBF} = \int_0^\infty R(t)\, dt </math>.

The MTBF and <math>T</math> have units of time (e.g., hours).

Any practically-relevant calculation of the MTBF assumes that the system is working within its "useful life period", which is characterized by a relatively constant failure rate (the middle part of the "bathtub curve") when only random failures are occurring.

Application

The MTBF value can be used as a system reliability parameter or to compare different systems or designs. This value should only be understood conditionally as the “mean lifetime” (an average value), and not as a quantitative identity between working and failed units.

By integrating MTBF with TPM principles, manufacturers can achieve a more proactive maintenance approach. This synergy allows for the identification of patterns and potential failures before they occur, enabling preventive maintenance and reducing unplanned downtime. As a result, MTBF becomes a key performance indicator (KPI) within TPM, guiding decisions on maintenance schedules, spare parts inventory, and ultimately, optimizing the lifespan and efficiency of machinery. This strategic use of MTBF within TPM frameworks enhances overall production efficiency, reduces costs associated with breakdowns, and contributes to the continuous improvement of manufacturing processes.

MTBF and MDT for networks of components

Two components <math>c_1,c_2</math> (for instance hard drives, servers, etc.) may be arranged in a network, in series or in parallel. The terminology is here used by close analogy to electrical circuits, but has a slightly different meaning. We say that the two components are in series if the failure of either causes the failure of the network, and that they are in parallel if only the failure of both causes the network to fail. The MTBF of the resulting two-component network with repairable components can be computed according to the following formulae, assuming that the MTBF of both individual components is known:

:<math>\text{mtbf}(c_1 ; c_2) = \frac{1}{\frac{1}{\text{mtbf}(c_1)} + \frac{1}{\text{mtbf}(c_2) = \frac{\text{mtbf}(c_1)\times \text{mtbf}(c_2)} {\text{mtbf}(c_1) + \text{mtbf}(c_2)}\;,</math>

where <math>c_1 ; c_2</math> is the network in which the components are arranged in series.

For the network containing parallel repairable components, to find out the MTBF of the whole system, in addition to component MTBFs, it is also necessary to know their respective MDTs. Then, assuming that MDTs are negligible compared to MTBFs (which usually stands in practice), the MTBF for the parallel system consisting from two parallel repairable components can be written as follows: and likewise

:<math>\text{mdt}(c_1\parallel\dots\parallel c_n) = \left(\sum_{k=1}^n \frac 1{\text{mdt}(c_k)}\right)^{-1}\;,</math>

since the formula for the mdt of two components in parallel is identical to that of the mtbf for two components in series.

Variations of MTBF

There are many variations of MTBF, such as mean time between system aborts (MTBSA), mean time between critical failures (MTBCF) or mean time between unscheduled removal (MTBUR). Such nomenclature is used when it is desirable to differentiate among types of failures, such as critical and non-critical failures. For example, in an automobile, the failure of the FM radio does not prevent the primary operation of the vehicle.

It is recommended to use Mean time to failure (MTTF) instead of MTBF in cases where a system is replaced after a failure ("non-repairable system"), since MTBF denotes time between failures in a system which can be repaired.

MTBF considering censoring

In fact the MTBF counting only failures with at least some systems still operating that have not yet failed underestimates the MTBF by failing to include in the computations the partial lifetimes of the systems that have not yet failed. With such lifetimes, all we know is that the time to failure exceeds the time they've been running. This is called censoring. In fact with a parametric model of the lifetime, the likelihood for the experience on any given day is as follows:

:<math>L = \prod_i \lambda(u_i)^{\delta_i} S(u_i)</math>,

where

:<math>u_i</math> is the failure time for failures and the censoring time for units that have not yet failed,

:<math>\delta_i</math> = 1 for failures and 0 for censoring times,

:<math>S(u_i)</math> = the probability that the lifetime exceeds <math>u_i</math>, called the survival function, and

:<math>\lambda(u_i) = f(u)/S(u)</math> is called the hazard function, the instantaneous force of mortality (where <math>f(u)</math> = the probability density function of the distribution).

For a constant exponential distribution, the hazard, <math>\lambda</math>, is constant. In this case, the MBTF is

:MTBF = <math>1 / \hat\lambda = \sum u_i / k</math>,

where <math>\hat\lambda</math> is the maximum likelihood estimate of <math>\lambda</math>, maximizing the likelihood given above and <math>k = \sum \sigma_i</math> is the number of uncensored observations.

We see that the difference between the MTBF considering only failures and the MTBF including censored observations is that the censoring times add to the numerator but not the denominator in computing the MTBF.