[Abstract] [PDF] [HTML] [Linked
References]
Information theoretic approach in Parameter Estimation
Sandeep Kumar^{1}, Parmil Kumar^{2}, Mamta Khajuria^{3}, Ameena Rajput^{4}
Department of Statistics, University of Jammu, Jammu, Jammu and Kashmir, 180006, INDIA.
Corresponding Addresses:
^{1}[email protected] , ^{2}[email protected] ,^{3}[email protected] , ^{4}[email protected]
Abstract: Let be the probability density function of a random variable X, where functional form of pdf is known except for the parameter. This parameter can be a scalar or a vector. One of the most important tasks in statistical inference is of estimating on basis of a random sample drawn from the population. The traditional methods of parameter estimation are methods of moments, least squares, minimum chisquare, maximum likelihood, minimum distance and recent one called method of probability weighted moment due to Greenwood et al [4]. Amongst all methods, Fisher [3] method of maximum likelihood is widely accepted and is considered as one of the best method for parameter estimation. Akaike [1] work paved the way for the information theoretic approach in parameter estimation. Lind and Solana [8] method is based on the principle of least information. Kapur [6] compared the Gauss’ method of estimation with a method based on the principle of maximum entropy. In the present communication we have used Parameter estimation methods using entropy optimization principles and compare these with classical methods such as method of moments and method of m.l.e. The basic principle is that, subject to the information available we should choose in such a way that the entropy is as large as possible or the distribution as nearly uniform as possible. We have also derived some parameter estimation methods from entropy optimization principles, while their relation among methods of parameter estimation is also discussed. Further, the asymptotic behaviour of the estimator is also studied for exponential and geometric distribution.
1. Introduction
Let f(x, θ) be the probability density function of a random variable X, where functional form of probability density function is known except for the parameter θ. This parameter θ can be a scalar or a vector quantity. One of the most important task in statistical inference is of estimating the parameter θ on basis of a random sample (x_{1}, x^{}_{2},…,x_{n}) drawn from the population. The most commonly used traditional methods of parameter estimation are: methods of moments, least squares, minimum chisquare, maximum likelihood, minimum distance and recent one called method of probability weighted moment due to Greenwood et al [4]. Amongst all these methods, Fisher [3] method of maximum likelihood is widely accepted, often used and is considered as one of the best method for parameter estimation. But with the growth of information theoretic methods in Statistics, efforts were made by researchers in using the information theory in estimating the parameters and other problems.
Akaike [1] work paved the way for the information theoretic approach in parameter estimation. This paper gave the direction to researchers not only to estimate parameters but also of the model building. Further development took place for estimation when the information is not complete. Lind and Solana [8] method is based on the principle of least information. Kapur [6] compared the Gauss’ method of estimation with a method based on the principle of maximum entropy. In this paper, we present a critical appraisal of parameter estimation methods using entropy optimization principles and compare these with classical methods such as method of moments and method of maximum likelihood. The basic principle is that, subject to the information available we should choose θ in such a way that the entropy is as large as possible or the distribution as nearly uniform as possible. In section 2, we discuss the problem of parameter estimation using maximum entropy principle. In section 3, we derive some parameter estimation methods from entropy optimization principles, while their relation among methods of parameter estimation is discussed in section 4. In section 5, we discuss method of parameter estimation using entropy optimization principle when population proportions are given and the asymptotic behaviour of the estimator is also studied for exponential and geometric distribution.
2. Maximum Entropy Principle in Parameter Estimation
In this section, we shall discuss the problem of parameter estimation using entropy optimization principle when along with the known form of density function, a random sample from the population is also given. Let us consider f(x, θ) as the given functional form of probability density estimation and we have to estimate the parameter θ for a given random sample x_{1},x_{2},..., x_{n} from the population. Fisher [3] suggested the method of maximum likelihood i.e. θ should be chosen such that it maximizes the likelihood function
L (x,θ) = f (x_{i}, θ) (2.1)
or log L (x, θ) = log f(x_{i}, θ) (2.2)
Now a probability distribution can be formed such that
p_{i} , i = 1, 2,...,n (2.3)
Where f (x_{i}, θ) is the value of pdf at X = x_{i}. For making p_{i}'s as equal as possible, we choose parameter θ such that it maximizes Burg’s[2] measure of entropy for this distribution. However, it may be noted that we can use any measure of uncertainty. Burg’s entropy measure for probability distribution (p_{1}, p_{2}......., p_{n}; p_{i} > 0; =1) is given by
H (P) = log pi (2.4)
Substituting (2.3) in (2.4), we have
H (P) = (2.5)
For maximizing (2.5) w.r.t. θ, we put the first derivative of (2.4) w.r.t.θ equal to zero and thus we get
(2.6)
But Fisher’s method of maximum likelihood requires to solve
= 0
Since f (x_{i},θ) is not independent of θ, therefore (2.5) and (2.6) will give different estimates of θ.
It is worth mentioning here that f(x_{1},θ), f(x_{2},θ),..., f(x_{n},θ) are not probabilities. Actually, these are the values of pdf at . Their sum is not necessarily unity or independent of θ as x_{1}, x_{2},..., x_{n} represents only a random sample and not all the values which the variate X can take.
3. Principles of Entropy Optimization, Maximum Likelihood and Minimum ChiSquare
In this section, we discuss the conventional estimation methods visavis entropy optimization principle.
Principle of Maximum Likelihood:
Let be a random sample from a population with pdf f(x,θ). We choose
or estimate parameter θ in terms of the sample values such that it maximizes likelihood function. But according to Maximum Entropy Principle, we choose the value of θ such that the uncertainty that remains after the sample values are known is as large as possible. Or, we can say that the entropy of the sample itself has to be a minimum. Thus, the sample entropy is given by
.H_{S} = (3.1)
=
H_{S}=
= (3.2)
where L(x, ) is the maximum likelihood function given by (2.1).
Thus, we choose such that it minimizes the entropy of the sample or maximizes the likelihood function. It implies that maximum entropy principle leads to the principle of maximum likelihood.
Now let us consider as the cumulative density function of the second distribution in case of minimum cross entropy principle. We shall choose such that for this value of the distribution function f(x, ) is as close as possible to the distribution function determined by the random sample x_{1}, x_{2},...., x_{n.}
Thus, Minimum Discrimination Information Statistic based on Kullback Leibler [7] measure is
D (': f) =
= (3.3) Equation (3.3.) attains minimum when its second part is maximum. It means, we choose θ which can maximize
= (3.4)
Hence, we choose θ such as to maximize L(x, θ). Thus, both Maximum Entropy and Minimum Cross Entropy Principles lead to Maximum Likelihood Principle.
Principle of Minimum Chisquare:
Let us consider that there are n classes and be the expected frequencies on the basis of parameter θ in these classes, where N is total frequency. Further, we consider that are the observed frequencies in these n classes. Then we choose θ so as to minimize divergence measure D (P:Q) or D (Q:P).
Let qi = pi + , where is very small
Then , since
We have, D (P: Q) = (3.5)
(3.6)
Next, similarly we have
(3.7)
It may be pointed here that (3.6) corresponds to modified chisquare while (3.7) is chisquare statistic. Thus, from (3.6) and (3.7) we can infer that θ is chosen to minimize either or , where O_{i} and E_{i} are observed and expected frequencies in the ith class and E_{i}’s are function of θ.
Fisher’s Measure of Information (FMI)
Let f(x, θ) = f and f (x, θ+∆θ) = g, be the two density functions, then divergence measure of f from g is given by
Since ,we have
Since f (x, θ) dx = 1, therefore
and (3.9)
(3.8) and (3.9) together gives
D (f: g) (3.10)
in (3.10) is called Fisher's information measure. It can be noted Fisher’s Measure of Information measures the power of discrimination or divergence between two density functions and .Thus, greater the value of FMI, greater is the power of discrimination or it can be said that it gives us more information about θ.
Fisher’s Measure of Information is different in many aspects from Shannon’s measure of information and KullbackLeibler’s measure of divergence. Shannon’s measure of information gives us information about the probability density functions while FMI gives information about the estimators of population parameters. When interval is finite FMI measures the directed divergence of from ,while Shannon’s measure gives the directed divergence of f (x, θ) from uniform density function.
Fisher’s Measure of Information gives directed divergence of f (x, θ) from density function depending on both f and q, while Shannon's measure gives the directed divergence of f(x, θ) from a density function which is independent of both f and q.
The KullbackLeibler measure of directed divergence can discriminate between any two density functions f (x, θ) and g(x, θ) while FMI discriminate between f (x, θ) and only. Thus, these measures have different purposes, while deciding the relative merits of information measures difficulty arises when the problems of discriminate are viewed in isolation. In generalized model, these measures are considered in relation with the probability distribution and their moment.
4. Equivalence of classical and information theoretic methods of parameter estimation
In this section, we have studied the relations between traditional and information theoretic methods of parameter estimation and observe that in most of cases these are equivalent.
Entropy optimization Principle and Laplace’s principle of insufficient reasoning
If the constraints are absent in Jaynes', Maximum Entropy Principle (MEP), and then maximization of uncertainty gives the uniform distribution. Thus, the Laplace principle is a special case of MEP. However, Hadgiswas [5] has shown that the MEP and the MDI principles can be deduced from the principle of insufficient reasoning and thus, MEP and MDI can be regarded as the special case of Laplace’s principle, while Laplace’s principle can be regarded as a particular case of MDI principle when there are no constraints and the prior distribution is uniform.
Minimum discrimination Information and Maximum Likelihood principle
In section 4.3, a correspondence between the MDI and Fisher’s maximum likelihood principle has been established. Suppose we are given g(x) then we find f(x) which minimizes
= f(x) log f(x) dx  f(x) log g(x) dx (4.1)
and satisfies the given constraints or we may be given f(x) and have to find g(x) so that we have to maximize
(log g(x)) f(x) dx = log g(x) dF(x), (4.2) where F(x) is the cumulative distribution function of X. In section 3 we have shown that maximization of (4.2) correspond to maximization of the likelihood function. Thus, Maximum Likelihood Principle can be regarded as a special case of MDI principle.
Entropy Optimization principle and Guiasu’s principle of Minimum Interdependence (PMI)
If the probability distributions of the individual random variables are included in the set of constraints, as the marginal probability distributions of the joint probability distribution, the PMI is equivalent to the MEP. PMI is also a particular case of Kullback’s MDI principle if a priori joint probability density function is the independent product density of n individual variables.
5. Estimation of parameter when interval proportions are given
In this section, we discuss the problem of parameter estimation in case proportions in different intervals are given.
Let us consider a random variate X over the interval [a,b] and let the random sample be arranged in order as
a = x_{0} < x_{1} < x_{2} < …< x_{i} < x_{i+1} < …< x_{n} < x_{n+1} = b (5.1)
So that the interval [a, b] is divided into (n +1) subintervals and Q_{0}, Q_{1,}... ,Q_{n} are the given proportions of the population in these (n + 1) subintervals.
Let us define a probability function over subinterval (x_{i}, x_{i+1}) as
Pi = f(x, θ) dx, i = 0, 1, 2,.., n (5.2)
where θ is the population parameter.
Thus, (P_{0}, P_{1}, ...., P_{n}) gives us a probability distribution depending on θ. Now, we have to choose parameter θ such that P_{0}, P_{1},..., P_{n} are as close as possible to given Q_{0}, Q_{1},..., Q_{n}. This can be achieved by minimizing the measure of cross entropy or directed divergence. We can make use of any measure of cross entropy that gives rise to a convex function of θ. But here, we minimize the Kullback Leibler measure of cross entropy,
D (Q: P)= = (5.3)
Minimization of (5.3) is same as maximization of . So, we have to maximize
= Q_{i} log f (x_{i}, θ) dx (5.4)
This principle have wide applications in estimating parameters when interval propertions are given to us, e.g. proportions of students in different intervals of marks obtained, proportion of failed equipments in different intervals of time etc.
Let us consider the case when f(x_{i},θ), functional form of distribution is exponentially distributed with unknown parameter θ. Then, (4.5.4) reduces to maximize
(5.5)
The above principle is illustrated in the following example having randomly generated population data. We have simulated the results for different sizes of the random samples.
Example:Let us consider a randomly generated population of size 50 (from exponential distributed with mean = 20) with interval proportions as
Intervals: 010 1020 2030 3040 4050 >75
Frequency: 19 13 4 4 7 3
Q_{i} = Proportion: 0.38 0.26 0.08 0.08 0.14 0.06
Here
x_{0} = 0, x_{1 }= 10, x_{2 }= 20, x_{3 }= 30, x_{4 }= 40, x_{5 }= 60, x_{6} = ∞,
we choose θ which maximizes (4.5.5) i.e.
= 0.38 log + 0.26 log + 0.08 + 0.08 + 0.14 + 0.06
=  15.2θ + 0.94 log (5.6)
To maximize (5.6), differentiate it w.r.t. θ and put the resultant form equal to zero, we get
The estimated value of the parameter is quite close to the population parameter value i.e. we have small bias.
Further, we can study the asymptotic behaviour of the estimator.
Table 1: Exponential Distribution (Mean=20)
Sample size 
MLE 
Estimates obtained by MEP 
Bias 
30 
21 
20.23 
0.23 
200 
21.55 
21.414 
1.414 
1000 
21.8 
20.18 
0.18 
10000 
21.492 
19.84 
0.16 
Table 2: Geometric Distribution with p=0.2
Sample size 
MLE 
Estimates obtained by MEP 
Bias 
30 
0.1668 
0.1955 
0.0045 
200 
0.1765 
0.1933 
0.0067 
1000 
0.1677 
0.2177 
0.0177 
10000 
0.1690 
0.2207 
0.0207 
Fig.1 and Fig.2 shows the graph between the sample size and the estimates obtained by MLE, MEP and bias for geometric and exponential distribution respectively.
Figure 1
Figure 2
References

H. Akaike, Informationtheoretical considerations on estimation problems. Information and Control, 19(3) (1971), 181194.

J.P. Burg, The relationship between maximum entropy spectra and maximum likelihood spectra. In D.G.Childers, Editor, Modern Spectral Analysis. (1972). 130131.

R.A. Fisher, On the mathematical foundations of theoretical Statistics. Phil. Trans. Roy. Soc. 222(A), (1921), 309368.

A. J. Greenwood, N. C. Matalas, J.R.Wallis, Probability weighted moments; Definition and relation to parameters of several distributions expressible in reversible form. Water Resources Researh.15 (5), (1979), 10491055.

N.Hadgiwas, The maximum entropy principle as a consequence of the principle of Laplace . J. Stat. Phy.26, (1981), 807815

J.N. Kapur, Maximum Entropy Models in Science in Engineering. Wiley Eastern, New Delhi. 1989.

S.Kullback, R.A. Leibler, On Information and Sufficiency. Ann. Math. Stat. 22, (1951),7986

N.C. Lind, Solana, Cross Entropy Estimation of Random Variables with Fractile constraints. Paper no. 11, (1988), Institute for Risk Research, University of Waterloo, Canada.

C.E. Shannon, A Mathematical Theory of Communication. Bell System Tech. J.27, (1948), 379423.
