Home

Department of Statistics, University of Jammu, Jammu, Jammu and Kashmir, 180006, INDIA.

Abstract: Let be the probability density function of a random variable X, where functional form of pdf is known except for the parameter. This parameter can be a scalar or a vector. One of the most important tasks in statistical inference is of estimating on basis of a random sample drawn from the population. The traditional methods of parameter estimation are methods of moments, least squares, minimum chi-square, maximum likelihood, minimum distance and recent one called method of probability weighted moment due to Greenwood et al [4]. Amongst all methods, Fisher [3] method of maximum likelihood is widely accepted and is considered as one of the best method for parameter estimation. Akaike [1] work paved the way for the information theoretic approach in parameter estimation. Lind and Solana [8] method is based on the principle of least information. Kapur [6] compared the Gauss’ method of estimation with a method based on the principle of maximum entropy. In the present communication we have used Parameter estimation methods using entropy optimization principles and compare these with classical methods such as method of moments and method of m.l.e. The basic principle is that, subject to the information available we should choose in such a way that the entropy is as large as possible or the distribution as nearly uniform as possible. We have also derived some parameter estimation methods from entropy optimization principles, while their relation among methods of parameter estimation is also discussed. Further, the asymptotic behaviour of the estimator is also studied for exponential and geometric distribution.

Let f(x, θ) be the probability density function of a random variable X, where functional form of probability density function is known except for the parameter θ. This parameter θ can be a scalar or a vector quantity. One of the most important task in statistical inference is of estimating the parameter θ on basis of a random sample (x₁, x₂,…,x_n) drawn from the population. The most commonly used traditional methods of parameter estimation are: methods of moments, least squares, minimum chi-square, maximum likelihood, minimum distance and recent one called method of probability weighted moment due to Greenwood et al [4]. Amongst all these methods, Fisher [3] method of maximum likelihood is widely accepted, often used and is considered as one of the best method for parameter estimation. But with the growth of information theoretic methods in Statistics, efforts were made by researchers in using the information theory in estimating the parameters and other problems.

Akaike [1] work paved the way for the information theoretic approach in parameter estimation. This paper gave the direction to researchers not only to estimate parameters but also of the model building. Further development took place for estimation when the information is not complete. Lind and Solana [8] method is based on the principle of least information. Kapur [6] compared the Gauss’ method of estimation with a method based on the principle of maximum entropy. In this paper, we present a critical appraisal of parameter estimation methods using entropy optimization principles and compare these with classical methods such as method of moments and method of maximum likelihood. The basic principle is that, subject to the information available we should choose θ in such a way that the entropy is as large as possible or the distribution as nearly uniform as possible. In section 2, we discuss the problem of parameter estimation using maximum entropy principle. In section 3, we derive some parameter estimation methods from entropy optimization principles, while their relation among methods of parameter estimation is discussed in section 4. In section 5, we discuss method of parameter estimation using entropy optimization principle when population proportions are given and the asymptotic behaviour of the estimator is also studied for exponential and geometric distribution.

2. Maximum Entropy Principle in Parameter Estimation

In this section, we shall discuss the problem of parameter estimation using entropy optimization principle when along with the known form of density function, a random sample from the population is also given. Let us consider f(x, θ) as the given functional form of probability density estimation and we have to estimate the parameter θ for a given random sample x₁,x₂,..., x_n from the population. Fisher [3] suggested the method of maximum likelihood i.e. θ should be chosen such that it maximizes the likelihood function

Where f (x_i, θ) is the value of pdf at X = x_i. For making p_i's as equal as possible, we choose parameter θ such that it maximizes Burg’s[2] measure of entropy for this distribution. However, it may be noted that we can use any measure of uncertainty. Burg’s entropy measure for probability distribution (p₁, p₂......., p_n; p_i > 0; =1) is given by

For maximizing (2.5) w.r.t. θ, we put the first derivative of (2.4) w.r.t.θ equal to zero and thus we get

Since f (x_i,θ) is not independent of θ, therefore (2.5) and (2.6) will give different estimates of θ.

It is worth mentioning here that f(x₁,θ), f(x₂,θ),..., f(x_n,θ) are not probabilities. Actually, these are the values of pdf at . Their sum is not necessarily unity or independent of θ as x₁, x₂,..., x_n represents only a random sample and not all the values which the variate X can take.

3. Principles of Entropy Optimization, Maximum Likelihood and Minimum Chi-Square

In this section, we discuss the conventional estimation methods vis-a-vis entropy optimization principle.

Let be a random sample from a population with pdf f(x,θ). We choose
or estimate parameter θ in terms of the sample values such that it maximizes likelihood function. But according to Maximum Entropy Principle, we choose the value of θ such that the uncertainty that remains after the sample values are known is as large as possible. Or, we can say that the entropy of the sample itself has to be a minimum. Thus, the sample entropy is given by

Thus, we choose such that it minimizes the entropy of the sample or maximizes the likelihood function. It implies that maximum entropy principle leads to the principle of maximum likelihood.

Now let us consider as the cumulative density function of the second distribution in case of minimum cross entropy principle. We shall choose such that for this value of the distribution function f(x, ) is as close as possible to the distribution function determined by the random sample x₁, x₂,...., x_n.

Thus, Minimum Discrimination Information Statistic based on Kullback Leibler [7] measure is

Equation (3.3.) attains minimum when its second part is maximum. It means, we choose θ which can maximize

Hence, we choose θ such as to maximize L(x, θ). Thus, both Maximum Entropy and Minimum Cross Entropy Principles lead to Maximum Likelihood Principle.

Let us consider that there are n classes and be the expected frequencies on the basis of parameter θ in these classes, where N is total frequency. Further, we consider that are the observed frequencies in these n classes. Then we choose θ so as to minimize divergence measure D (P:Q) or D (Q:P).

It may be pointed here that (3.6) corresponds to modified chi-square while (3.7) is chi-square statistic. Thus, from (3.6) and (3.7) we can infer that θ is chosen to minimize either or , where O_i and E_i are observed and expected frequencies in the ith class and E_i’s are function of θ.

Let f(x, θ) = f and f (x, θ+∆θ) = g, be the two density functions, then divergence measure of f from g is given by

in (3.10) is called Fisher's information measure. It can be noted Fisher’s Measure of Information measures the power of discrimination or divergence between two density functions and .Thus, greater the value of FMI, greater is the power of discrimination or it can be said that it gives us more information about θ.

Fisher’s Measure of Information is different in many aspects from Shannon’s measure of information and Kullback-Leibler’s measure of divergence. Shannon’s measure of information gives us information about the probability density functions while FMI gives information about the estimators of population parameters. When interval is finite FMI measures the directed divergence of from ,while Shannon’s measure gives the directed divergence of f (x, θ) from uniform density function.

Fisher’s Measure of Information gives directed divergence of f (x, θ) from density function depending on both f and q, while Shannon's measure gives the directed divergence of f(x, θ) from a density function which is independent of both f and q.

The Kullback-Leibler measure of directed divergence can discriminate between any two density functions f (x, θ) and g(x, θ) while FMI discriminate between f (x, θ) and only. Thus, these measures have different purposes, while deciding the relative merits of information measures difficulty arises when the problems of discriminate are viewed in isolation. In generalized model, these measures are considered in relation with the probability distribution and their moment.

4. Equivalence of classical and information theoretic methods of parameter estimation

In this section, we have studied the relations between traditional and information theoretic methods of parameter estimation and observe that in most of cases these are equivalent.

Entropy optimization Principle and Laplace’s principle of insufficient reasoning

If the constraints are absent in Jaynes', Maximum Entropy Principle (MEP), and then maximization of uncertainty gives the uniform distribution. Thus, the Laplace principle is a special case of MEP. However, Hadgiswas [5] has shown that the MEP and the MDI principles can be deduced from the principle of insufficient reasoning and thus, MEP and MDI can be regarded as the special case of Laplace’s principle, while Laplace’s principle can be regarded as a particular case of MDI principle when there are no constraints and the prior distribution is uniform.

In section 4.3, a correspondence between the MDI and Fisher’s maximum likelihood principle has been established. Suppose we are given g(x) then we find f(x) which minimizes

and satisfies the given constraints or we may be given f(x) and have to find g(x) so that we have to maximize

(log g(x)) f(x) dx = log g(x) dF(x), (4.2) where F(x) is the cumulative distribution function of X. In section 3 we have shown that maximization of (4.2) correspond to maximization of the likelihood function. Thus, Maximum Likelihood Principle can be regarded as a special case of MDI principle.

Entropy Optimization principle and Guiasu’s principle of Minimum Interdependence (PMI)

If the probability distributions of the individual random variables are included in the set of constraints, as the marginal probability distributions of the joint probability distribution, the PMI is equivalent to the MEP. PMI is also a particular case of Kullback’s MDI principle if a priori joint probability density function is the independent product density of n individual variables.

In this section, we discuss the problem of parameter estimation in case proportions in different intervals are given.

Let us consider a random variate X over the interval [a,b] and let the random sample be arranged in order as

So that the interval [a, b] is divided into (n +1) subintervals and Q₀, Q_1,... ,Q_n are the given proportions of the population in these (n + 1) subintervals.

Thus, (P₀, P₁, ...., P_n) gives us a probability distribution depending on θ. Now, we have to choose parameter θ such that P₀, P₁,..., P_n are as close as possible to given Q₀, Q₁,..., Q_n. This can be achieved by minimizing the measure of cross entropy or directed divergence. We can make use of any measure of cross entropy that gives rise to a convex function of θ. But here, we minimize the Kullback Leibler measure of cross entropy,

This principle have wide applications in estimating parameters when interval propertions are given to us, e.g. proportions of students in different intervals of marks obtained, proportion of failed equipments in different intervals of time etc.

Let us consider the case when f(x_i,θ), functional form of distribution is exponentially distributed with unknown parameter θ. Then, (4.5.4) reduces to maximize

The above principle is illustrated in the following example having randomly generated population data. We have simulated the results for different sizes of the random samples.

Example:Let us consider a randomly generated population of size 50 (from exponential distributed with mean = 20) with interval proportions as

To maximize (5.6), differentiate it w.r.t. θ and put the resultant form equal to zero, we get

The estimated value of the parameter is quite close to the population parameter value i.e. we have small bias.

Sample size	MLE	Estimates obtained by MEP	Bias
30	21	20.23	-0.23
200	21.55	21.414	-1.414
1000	21.8	20.18	-0.18
10000	21.492	19.84	0.16

Fig.1 and Fig.2 shows the graph between the sample size and the estimates obtained by MLE, MEP and bias for geometric and exponential distribution respectively.

Information theoretic approach in Parameter Estimation

2. Maximum Entropy Principle in Parameter Estimation