A One Parameter Exponential Family of Distributions Has a Unique Mle
The exponential family is a mathematical abstraction that unifies common parametric probability distributions. Exponential families play a prominent role in GLMs and graphical models, two methods oft employed in parametric statistical genomics. In this post we ascertain exponential families and review their basic backdrop. We take a adequately conceptual approach, omitting proofs for the most part. This is the first mail in a iii-office mini series on exponential families, information matrices, and GLMs.
I-parameter exponential family
A parametric model is a family of probability distributions indexed past a finite set of parameters. A one-parameter exponential family unit is a special type of parametric model indexed by a single scalar parameter.
Definition
Let \(X = (X_1, \dots, X_d)\) be a random vector with distribution \(P_\theta,\) where \(\theta \in \Theta \subset \mathbb{R}\). Assume the support of \(10\) is \(South^d \subset \mathbb{R}^d\). Nosotros say \(\{P_{\theta} : \theta \in \Theta \}\) belongs to the one-parameter exponential family if the density function \(f\) of \(X\) can be written as \[ \brainstorm{equation}\label{def} f(x | \theta) = due east^{ \eta(\theta) T(x) - \psi(\theta) } h(x), \finish{equation}\] where \(\psi, \eta : \Theta \to \mathbb{R}\) and \(T, h : Due south^d \to \mathbb{R}\) are functions.
Note 1: The functions \(\psi, \eta, T,\) and \(h\) are non-unique.
Note 2: Technically, we also must specify an integrator \(\alpha\) with respect to which we integrate the density \(f\). That is, we must specify an \(\alpha\) such that \[ P(X \in A) = \int_{A} f \blastoff.\] When \(d = 1\), \(\blastoff(x) = x\) for continuous distributions and \(\alpha(10) = \textrm{floor}(x)\) for discrete distributions. The integrator \(\alpha\) typically is clear from the context, so nosotros exercise non explicitly state information technology.
The exponential family encompasses the distributions most normally used in statistical modeling, including the normal, exponential, gamma, beta, Bernoulli, Poisson, binomial (assuming fixed number of trials), and negative binomial (assuming fixed number of failures) distributions.
Examples
-
Poisson distribution. The density function of a Poisson distribution is \[ f(x|\theta) = \frac{\theta^x e^{-\theta}}{x!}.\] We can write this density equally \[ f(10|\theta) = due east^{x \log(\theta) -\theta} \frac{1}{x!}.\] Written in this way, it is clear that \[ \begin{cases} \eta(\theta) = \log(\theta) \\ T(x) = x \\ \psi(\theta) = \theta \\ h(x) = \frac{1}{ten!}. \end{cases} \] Therefore, the Poisson distribution belongs to the one-parameter exponential family unit.
-
Poisson product distribution. Let \(10 = X_1, \dots, X_n \sim \textrm{Pois}(\theta).\) The density function of \(X\) is \[ f(x|\theta) = \prod_{i=1}^n \frac{\theta^{x_i}e^{-\theta}}{x_i!} = \frac{ \theta^{\sum_{i=ane}^n x_i}e^{-n\theta}}{\prod_{i=one}^n x_i!}.\] Similar to in a higher place, nosotros can write this function as \[ f(x|\theta) = e^{\log(\theta) \left(\sum_{i=1}^n x_i\right) - n\theta} \frac{ane}{\prod_{i=one}^n x_i!}.\] We have \[ \begin{cases} \eta(\theta) = \log(\theta) \\ T(x) = \sum_{i=1}^due north x_i \\ \psi(\theta) = n\theta \\ h(10) = \frac{1}{\prod_{i=one}^n x_i!}. \end{cases} \] Therefore, the Poisson production distribution, like the Poisson distribution, is a member of the one-parameter exponential family (of course, with different constituent functions).
-
Negative binomial distribution. Recall the density function of a negative binomial distribution with parameters \(r, \theta\) is \[ f(x| r, \theta) = \binom{10 + r - 1}{10} \theta^{x}(i-\theta)^r.\] Presume that \(r\) is a known, fixed parameter. We can express the density function every bit \[ f(x|r, \theta) = e^{ \log(\theta) x + r\log(i - \theta)} \binom{x + r - 1}{10}.\] Writing \[ \begin{cases} \eta(\theta) = \log(\theta) \\ T(x) = x \\ \psi(\theta) = - r\log(i - \theta) \\ h(x) = \binom{x + r - 1}{x}, \end{cases} \] we see that the negative binomial distribution (with fixed \(r\)) is an exponential family. The \(northward-\)fold negative binomial product distribution besides is a one-parameter exponential family unit.
Properties
We list several of import properties of the 1-parameter exponential family. The properties we state relate to sufficiency, reparameterization of the density part, convexity of the likelihood function, and moments of the distribution.
- Sufficiency
Sufficiency mathematically formalizes the notion of "no loss of information." As far equally I tin can tell, sufficiency once played a primal part in mathematical statistics but since fallen out of favor to some extent. Still, the concept of sufficiency is important to understand in the context of the exponential family.
Let \((X_1, \dots, X_d)\) be a random vector with distribution \(P_\theta,\) \(\theta \in \Theta \subset \mathbb{R}\). Let \(T(X)\) exist a statistic. We call \(T(X)\) sufficient for \(\theta\) if \(T(X)\) preserves all information almost \(\theta\) contained in \((X_1, \dots, X_d)\). More precisely, \(T(X)\) is sufficient for \(\theta\) if the distribution of \(X\) given \(T(X)\) is abiding in (i.e., does non depend on) \(\theta\).
Theorem. Let \(X = (X_1, \dots, X_d)\) exist distributed co-ordinate to the i-parameter exponential family \(\{P_\theta\}\). Then the statistic \(T(X)\) is a sufficient statistic for \(\theta\).
The proof of this fact follows easily from the Fisher–Neyman factorization theorem. Retrieve that \(T(Ten) = \sum_{i=1}^n X_i\) is a sufficient statistic of the \(due north-\)fold Poisson product distribution (meet previous department). Too recall that the MLE of the Poisson distribution is \((1/n)\sum_{i=1}^due north X_i\). Intuitively, information technology makes sense that the sufficient statistic would coincide with the MLE (upwards to a constant). We meet like patterns for other members of the exponential family unit.
- Reparameterization
A common strategy in statistical analysis is to reparateterize a probability distribution. Suppose a family of probability distributions \(\{ P_{\theta} \}\) is parameterized by \(\theta \in \mathbb{R}\). Suppose we set \(\gamma = f(\theta)\), where \(f: \mathbb{R} \to \mathbb{R}\) is an invertible office. So we tin write \(\theta = f^{-1}(\gamma)\) and parameterize the family of probability distributions in terms of \(\gamma\) instead of \(\theta\). There is no loss of data in this reparameterization.
Consider a family of distributions \(\{ P_{\theta} : \theta \in \Theta \subset \mathcal{R} \}\) that is a member of the 1-parameter exponential family, i.due east. that has density \(f(10|\theta) = e^{\eta(\theta)T(ten) - \psi(\theta)}h(x).\) Typically, the office \(\eta: \Theta \to \mathbb{R}\) is invertible. Therefore, nosotros tin reparameterize the family of distributions in terms of the part \(\eta\).
Set \(\eta^* = \eta(\theta)\). Then \(\theta = \eta^{-i}(\eta^*)\), then we tin can write the density \(f\) as \[f(x|\theta) = e^{\eta^* T(x) - \psi( \eta^{-ane}(\eta^*)) }h(x).\] Setting \(\psi^* = \psi \circ \eta^{-ane},\) we derive the reparameterization \[ f(x|\eta^*) = eastward^{ \eta^* T(x) - \psi^*(\eta^*)}h(x),\] which is parameterized in terms of \(\eta^*\) rather than \(\theta\). Nether this new parameterization, the parameter infinite \(\mathcal{T}\) consists of the set of values for which the density role \(f\) integrates to unity, i.due east. \[ \mathcal{T} = \{ \eta^* \in \mathbb{R} : \int_{R} e^{ \eta^* T(x) - \psi^*(\eta^*)}h(x) d \alpha(10) = 1 \}.\] To ease note, nosotros drop the asterisk (*) from \(\eta\) and \(\psi\). A family of probability distributions is said to be in approved one-parameter exponential family unit form if its density function can be written as
\[f(10|\eta) = e^{ \eta T(x) - \psi(\eta)}h(x), \eta \in \mathcal{T}.\] The set \(\mathcal{T}\) is sometimes called the natural parameter infinite. The canonical form of an exponential family is easy to work with mathematically. Thus, most theorems well-nigh exponential families are expressed in canonical class.
Written in canonical form, the terms \(\eta, T(x), \psi(\eta),\) and \(h(x)\) have special names:
-
\(\eta\) is chosen the canonical (or natural) parameter,
-
\(T(x)\) is called the sufficient statistic,
-
\(\psi(\eta)\) is called the cumulant-generating part, and
-
\(h(x)\) is chosen the conveying density.
Vocabulary can exist a bit abrasive, but in the case of exponential families information technology facilitates discussion. We expect at a couple examples of exponential families in canonical course.
Example: Poisson canonical course. Recall that we tin can write the Poisson density in exponential family unit form as \[f(x | \theta) = \left(east^{ten \log(\theta)} - \theta \correct)\frac{one}{x!}.\] Set \(\eta^* = \log(\theta)\). Then \(\theta = eastward^{\eta^*},\) and we can re-express \(f\) as \[f(x| \eta^*) = \left(e^{x \eta^* - east^{\eta^*}} \right) \frac{ane}{ten!}.\] Dropping the asterisk from \(\eta\) to ease annotation, we end up with the approved parameterization \[f(x| \eta) = \left(eastward^{10 \eta - e^{\eta}}\correct) \frac{i}{x!},\] where \(\eta \in \mathcal{T} = \mathbb{R}.\) The canonical parameter is \(\eta\), the sufficient statistic is \(x\), the cumulant-generating office is \(e^\eta\), and the conveying density is \(1/ten!\).
Instance: Negative binomial canonical form. We expressed the negative binomial density in exponential family form equally \[ f(x|\theta) = due east^{ \log (\theta)x + r \log(1 - \theta)} \binom{x + r - one}{x}.\] Setting \(\eta = \log(\theta)\), we can re-write this density in canonical form as \[ f(x|\eta) = e^{x \eta + r \log(1-eastward^\eta)}\binom{10+r-1}{x}.\] The canonical parameter is \(\eta\), the sufficient statistic is \(x\), the cumulant-generating part is \(r\log(ane - e^\eta)\), and the carrying density is \(\binom{x + r - 1}{10}.\)
- Convexity
The exponential family enjoys some useful convexity properties.
Theorem: Consider a approved exponential family with density \[f(ten|\eta) = east^{\eta T(x) - \psi(\eta)}h(x)\] and natural parameter space \(\mathcal{T}\). The prepare \(\mathcal{T}\) is convex, and the cumulant-generating function \(\psi\) is convex on \(\mathcal{T}\).
The proof of this theorem is simple and involves the application of Holder's inequality. This theorem has an important corollary.
Corollary: Allow \(X = (X_1, \dots, X_d)\) exist a random vector distributed according to the exponential family \(\{ P_\eta : \eta \in \mathcal{T}\}.\) The log-likelihood \[\mathcal{L}(\eta; x) = \log\left(f(x|\eta)\right)\] is a concave function defined on a convex set.
The proof of this corollary is simple. Considering \(\psi\) is convex, \(-\psi\) is concave. The log-likelihood \[\mathcal{L}(\eta;x) = \eta T(x) - \psi(\eta) + \log\left( h(10) \right)\] is the sum of concave functions and is therefore concave.
This corollary has important implications: the MLE for \(\eta\) exists and is easily computable (through convex optimization). When \(\psi\) is strictly convex (which by and large is the case), the MLE is unique.
- Moments
We easily can compute the moments of an exponential family.
Theorem. Allow \(X = (X_1, \dots, X_d)\) be distributed according to the approved exponential family unit \(\{ P_\eta:\eta \in \mathcal{T}\}.\) So \[ \begin{cases} \mathbb{East}_\eta [T(X)] = \psi'(\eta) \\ \mathbb{V}_\eta [T(10)] = \psi''(\eta). \cease{cases} \]
We provide a proof sketch. The density function of the expoential family integrates to unity, i.e. \[ \int_{\mathbb{R}^d} east^{\eta T(x) - \psi(\eta)}h(x) d\alpha(x) = 1.\] Therefore, \[ due east^{\psi(\eta)} = \int_{ \mathbb{R}^d } due east^{\eta T(ten)}h(x) d\alpha(x).\] Differentiating with respect to \(\eta\), we detect \[ e^{\psi(\eta)} \psi'(\eta) = \int_{\mathbb{R}^d} T(x)due east^{\eta T(x)} h(x) d\alpha(x),\] and then \[ \psi'(\eta) = \int_{\mathbb{R}^d} T(ten) due east^{\eta T(x) - \psi(\eta)}h(10) d\blastoff(x) = \int_{\mathbb{R}^d} T(ten) f(x|\eta) d\alpha(x) = \mathbb{East}_\eta[T(X)].\] Differentiating with respect to \(\eta\) again and rearranging, nosotros find \(\mathbb{V}_\eta [T(X)] = \psi''(\eta)\).
Case: Call up that the Poisson distribution tin be written in canonical form as \[ f(10|\eta) = e^{10 \eta - due east^{\eta}} \frac{1}{10!}.\] We have that \(\psi(\eta) = e^{\eta}\) and \(T(X) = 10\). Therefore, \(\mathbb{E}_\eta[10] = \psi'(\eta) = due east^{\eta}\) and \(\mathbb{V}_\eta[X] = \psi''(\eta) = eastward^{\eta}.\) Recalling that \(\eta = \log(\theta)\), we recover \(\mathbb{Due east}[Ten] = \mathbb{Five}[X] = \theta\).
Our theorem nigh the moments of \(T(Ten)\) has several intriguing corollaries.
Corollary 1. The 2d derivative of \(\psi\) is equal to the variance of \(T(10)\). Because variance is non-negative, the second derivative of \(\psi\) is non-negative. This implies that \(\psi\) is convex. This is an alternating sit-in of the convexity of \(\psi\).
Corollary 2. Presume the variance of \(T(10)\) is nonzero. So \(\psi\) is strictly convex, implying \(\eta \to \mathbb{E}_\eta[T(X)]\) is injective. Thus, we can reparameterize the exponential family in terms of \(\mathbb{Due east}_\eta[T(Ten)].\) Because \(T(Ten) = X\) for the Poisson and negative binomial distributions in detail, we can parameterize the Poisson and negative binomial densities in terms of their means.
Multiparameter exponential family unit
We extend the definition of the exponential family to multiparameter distributions. Results that concur for one-parameter exponential families concur analogously for multiparameter exponential families.
Definition
Let the random vector \(X = (X_1, \dots, X_d)\) have distribution \(P_\theta,\) where \(\theta \in \Theta \subset \mathbb{R}^m\). The family \(\{P_\theta \}\) belongs to the \(k\)-parameter exponential family unit if its density can be written equally \[f(x|\theta) = eastward^{ \sum_{i=1}^thou \eta_i(\theta)T_i(x) - \psi(\theta)}h(ten),\] where \[ \begin{cases} \eta_1, \dotsm \eta_k : \mathbb{R}^k \to \mathbb{R} \\ T_1, \dots, T_k : \mathbb{R}^d \to \mathbb{R} \\ \psi : \mathbb{R}^k \to \mathbb{R} \\ h : \mathbb{R}^d \to \mathbb{R} \terminate{cases} \] are functions, and \(\theta \in \mathbb{R}^k\). We also require that the dimension of \(\theta = (\theta_1, \dots, \theta_n)\) equal the dimension of \((\eta_1(\theta), \dots, \eta_k(\theta))\). (If the dimension of the latter exceeds that of the one-time, the distribution is said to belong to the curved exponential family.)
Examples
Examples of the multiparameter exponential family unit include the normal distribution with unknown mean and variance and generalized linear models (GLMs). We volition provide more than detailed examples of multiparameter exponential families in the upcoming postal service on GLMs.
Properties
Nosotros briefly list some backdrop of the multiparameter exponential family related to sufficiency, reparameterization, convexity, and moments.
- Sufficiency
The vector \([T_1(X), \dots, T_k(X)]^T\) is a sufficient statistic for the parameter \(\theta\). This follows from the Neyman-Fisher factorization theorem.
- Reparameterization
We can reparameterize multivariate distributions equally well. Similar to the one-parameter case, let \[ \begin{cases} \eta_1^* = \eta_1(\theta_1, \dots, \theta_k) \\ \dots \\ \eta_k^* = \eta_k(\theta_1, \dots, \theta_k). \terminate{cases} \] Typically, nosotros can capsize this vector-valued function, i.e., we tin can express \(\theta_1, \dots, \theta_k\) in terms of \(\eta_1^*, \dots, \eta_k^*\). In this case, nosotros can write the multiparameter exponential family in canonical form: \[f(x|\eta) = due east^{ \sum_{i=1}^g \eta_i T_i(x) - \psi(\eta)}h(x),\] where \(\eta \in \mathbb{R}^thou, T_1, \dots T_k: \mathbb{R}^d \to \mathbb{R},\) \(\psi: \mathbb{R}^g \to \mathbb{R},\) and \(h:\mathbb{R}^d \to \mathbb{R}\). The set \(\mathcal{T}\) over which the natural parameters vary is called the natural parameter infinite.
- Convexity
The natural parameter infinite \(\mathcal{T}\) is convex, and the part \(\psi: \mathbb{R}^one thousand \to \mathbb{R}\) is convex over \(\mathcal{T}\). Thus, the log-likelihood for the natural parameter \(\eta\) of a \(g\)-parameter exponential family is concave. The proof of this exclamation in multiparameter families besides leverages Holder'south inequality.
- Moments
Permit \(X = (X_1, \dots, X_d)\) have distribution \(\{P_\eta\}\) belonging to the canonical \(k-\)parameter exponential family. And so \[ \nabla \psi(\eta) = \mathbb{East}_\eta[ T_1(10), \dots, T_k(10) ] \] and \[ \nabla^2 \psi(\eta) = \textrm{Cov}_\eta[T_1(X), \dots, T_k(X)],\] where \(\nabla \psi\) is the slope of \(\psi\), and \(\nabla^two\psi\) is the Hessian of \(\psi\). In words, the gradient of the cumulant-generating function \(\psi\) is the expected value of the vector of sufficient statistics \([T_1(10), \dots, T_k(Ten)]\), and the hessian of \(\psi\) is the variance-covariance matrix of the vector of sufficient statistics. Because variance-covariance matrices are positive semi-definite, the office \(\psi\) is convex. This is an alternate demonstration of the convexity of \(\psi\). The Hessian of \(\psi\) evaulated at \(\eta\) sometimes is called the Fisher information matrix (evaluated at \(\eta\)).
Conclusion
The exponential family unit is a mathematical abstraction that unifies common parametric probability distributions. In this post nosotros divers exponential families and explored some of their basic properties. In the remaining two posts of this mini-serial, we will explore the connection betwixt exponential families, information matrices, and GLMs.
Referenes
- Lecture notes provided by Professor Anirban DasGupta.
Source: https://timothy-barry.github.io/posts/2020-06-22-exponential-families/
0 Response to "A One Parameter Exponential Family of Distributions Has a Unique Mle"
Post a Comment