Lindley-Shannon Information for Comparison of Priors Under Paired Comparisons Model

One of the main differences between classical statistics and Bayesian statistics is that the latter can utilize prior information in a formal way. This information can be quantified in terms of a probability distribution which is known as the prior distribution. If there is no relevant prior information available then there are ways to derive a 'non-informative' prior distribution. In this study, the informative and non-informative prior distributions for the parameters of the Bradley-Terry model are compared through the LindleyShannon information.


Introduction
For a Bayesian analysis, prior information in terms of probability distribution is used.If there is no relevant prior information available then there are ways to derive a 'noninformative' prior distribution.In Section 2, the Bradley-Terry model for paired comparisons is defined.Section 3 presents the informative prior distribution for the parameters.Non-informative prior distribution from Jeffreys rule is derived in Section 4. Comparison of non-informative and informative priors through the Lindley-Shannon information {Lindley (1956), Shannon (1948)} is discussed in Section 5. Conclusion is drawn in the last Section 6.The graphs of non-informative priors are also drawn for this purpose in this study.

The Bradley-Terry Model for Paired Comparison
Sometimes it may be difficult for a panelist to rank or compare more than two objects or treatments (m>2) at the same time especially when differences between objects are small or the criteria are rather subtle (m=2).For this reason paired comparison data is sometimes regarded as more reliable and can be obtained more readily from panelists.In a paired comparison trial a panelist is given a pair of objects and asked to say 'with respect to a given attribute' which is preferable.This is repeated for several pairs.This method is widely used in industry for assessing customer preference and designing products using trained panelists.David (1988) has a detailed survey of the literature and references concerning the method of paired comparisons.Further literature can be seen in Gilani and Abbas (2008), Abbas (2010) and Altaf, et. al (2011).
Let consider a paired comparison trial with 'm' treatments or objects T T T m 1 2 , ,..., .There are 'm(m-1)2' all possible pairs of 'm' treatments.Each pair ) is ranked r ij times independently.Also let    1 2 , ,..., m be the 'true' ratings (or preferences) of 'm' treatments on a subjective continuum.A model can be based on the idea that when a panelist is confronted with treatment T i , she responds with an 'unconscious' or 'latent' variable X i .The assumed mechanism is that she prefers treatment Bradley and Terry (1952) propose a model of paired comparisons for the trial mentioned above.The Bradley-Terry model implies that difference between two latent variables ( X X i j  ) has a logistic (squared hyperbolic secant) density with location parameter (ln ln )  ) when the treatments T i and .
The Bradley-Terry model is defined by ( 1) The following notations are used in the comparisons: n ijk = 1 or 0 according as treatment or not in the k'th repetition (k=1,2,..., r ij ) of the comparison.r ij = the number of times treatment We put constraints on the parameters of the model that they are positive and they sum to unity, these conditions ensure that the parameters are well defined i.e. identifiable.
The probability of the observed result in the k'th repetition of the pair (T i ,T j ) is: Hence, the likelihood function of the observed outcome x which represents the data of the trial is equal to

The Choice of an Informative Prior Distribution
The appropriate model of paired comparison having been selected and the likelihood thereby determined, it is necessary to choose a prior distribution for the parameters of the model.Typically this will be supposed to belong to a parametric family; when the problem reduced to assessing these parameters.To distinguish them from the parameters in the likelihood, they are termed hyperparameters.We consider the Bradley-Terry model having a likelihood (3).The prior is supposed to belong to the member of the Dirichlet family with density: where  = ( , ,..., ) stands for a generalization of the beta function and a i m i ( ,2,..., )  1 are the hyperparameters.

The Choice of a Non-informative Prior Distribution
When prior elicitation is difficult or little prior information is available, one may consider analysis with conventionally chosen priors that may reflect little prior information.These priors are known as non-informative priors or indifference or vague or reference priors.Berger (1985) argues that Bayesian analysis using non-informative priors is the single most powerful method of statistical analysis, in the sense of being the adhoc method most likely to yield a sensible answer.This topic has an extensive literature, eg.Jeffreys (1946), Bernardo (1979), Ghosh and Mukerjee (1992), Kass (1990), Clarke and Wasserman (1993) and Aslam (1995).

The Non-informative Prior from 'Jeffreys Rule'
This is defined as the density of the parameters proportional to the square root of the determinant of the Fisher information matrix.Symbolically, let  ( ,..., ) t is a vector of parameters, the Jeffreys prior distribution { p J ()} is defined as: where 'det' denotes the determinant and I() is the (m m  ) Fisher information matrix which is obtained as: where E denotes expectation on data and i and j stand for rows and columns of determinant, respectively.

Properties of the Non-informative Jeffreys Prior
The Jeffreys prior shows many nice properties that make it an attractive non-informative prior.The Jeffreys prior has an invariance property with respect to a one-to-one transformation of the parameters in the sense that we get consistent answer in any parameterization.The Jeffreys prior works well in one-dimensional problem but needs adhoc modification for multi-dimensional case as suggested by Jeffreys (1946) himself.Bernardo (1979) shows that Jeffreys prior is an appropriate reference prior if there are no parameters which are regarded as nuisance parameters and if the (joint) posterior distribution of all the parameters is asymptotically normal.Kass (1990) explains that a main feature of the Jeffreys prior is that it is a uniform measure in an information metric which can be regarded as the natural metric for statistical inference.Another important aspect of the Jeffreys prior is that it is not affected by a restriction on the parameter space.Some of the applications of the Jeffreys prior are given in Ibrahim and Laud (1991) and Poirier (1994).

The Non-informative Jeffreys Prior for the Parameters of the Model
The appropriateness of the Jeffreys prior requires that there should be no parameters which are regarded as nuisance parameters and the posterior distribution of all the parameters should be asymptotically normal.The likelihood function (3) of the model can be represented in the following form: The likelihood function (7) belongs to the exponential family and it follows from the regularity conditions of Johnson and Ladalla (1991) that the posterior distribution is asymptotically normal.Hence, here is no parameter which is regarded as 'nuisance' so the Jeffreys prior is the appropriate choice of non-informative prior and hereafter will called NJP ((NJP stands for non-informative Jeffreys prior) distribution.
The NJPs of the parameters for the case of two, three and four treatments in the model are derived.Some transformations and techniques are applied to understand and to find a simple symmetric closed form for the case of three and four treatments.

Case (i) When the number of treatments m=2:
In this case, the treatment parameters are  1 and  2 .Consider   1  and   2 1   .Now the likelihood function for the observed outcomes x from (3) is: which is the binomial model: n Bin r 12 12 ~( , )  .
The Fisher's information for the likelihood function ( 8) is obtained as: The NJP { p J ( )  } for the parameter  of the model is: The graph of the NJP ( 10) is presented in Figure 1.
Case (ii).When the number of treatments m=3: Here the treatment parameters are   1 2 , and  3 .Using the constraint    A program is designed in the MAPLE package to get the NJP for an unbalanced design that the numbers of comparisons for each pair are not equal.We obtain the following form for the NJP:   If a balanced design is considered i.e. r ij =r (i<j=1,2,3), then 'r' can be merged in a normalizing constant, so the NJP is:   where 'C=0.052798' is evaluated by numerical integration.It is noted that the parameters  1 and  2 are exchangeable which is clear algebraically in (13).
The closed forms for the marginal NJPs of  1 and  2 are intractable and we evaluate the marginal NJPs of  1 and  2 numerically.Thus the marginal NJP of is: Similarly, the marginal Jeffreys prior of  2 can be obtained.The graph of the marginal NJP of  1 is given in Figure 2.
We have found the marginal density of the NJP out of curiosity and to see whether we can believe, it is appropriate as a consequence of the NJP.In terms of calculating our 'answer' to the overall analysis of the data, the calculation of these marginal is not necessary.Some transformations and techniques are used to find out a simple closed form for the marginal NJP of  1 (see Appendix 1).We conclude that although the graph of the marginal density of  1 has a simple appearance, it does not appear to be expressible in a closed form or have any reasonably elegant or simple characterization.
Case (iii) When the number of treatments m=4: The likelihood for the case of four treatments can be written from (3).The treatment parameters are  1 ,  2 ,  3 and  4 , the NJP for the parameters  1 ,  2 and  3 is derived with the identifiability constraint:     1     .A program similar to the case of 3 treatments is written in the MAPLE package to obtain the NJP for a balanced design.The derived NJP (given in Appendix 3) has a very complicated form and does not seem to be easy for application.
However we can determine the NJP numerically via numerical partial differentiation of the log likelihood function.A program for computing NJP numerically is written in the SAS package which can be seen in Aslam (1995).
The graph of the marginal NJP (without a constant of proportionality) for the parameter  1 , is given in Figure 3 which is obtained through the program in the SAS package.

NJP FOR THE PARAMETER  OF THE MODEL WHEN M=2
p J ( )

Comparison of the NJP and the Dirichlet Prior
The NJP and the informative prior Dirichlet distribution are compared through the Lindley-Shannon information (LSI) {Lindley (1956) and Shannon (1948)} (see Appendix 2 for definition of LSI).Our choice of informative prior for the parameters belongs to the Dirichlet family, so for comparison with NJP, we choose the Dirichlet distribution.The LSI of the Dirichlet distribution is computed for the values of hyperparameters from 0 to 5. We consider this range because the range of Lindley-Shannon information generated includes the LSI of the NJP in each case (m=2, 3, 4).The LSI is used because it is a fundamental measure of information (with an axiomatic basis of justification for its choice).

(a) The number of treatments m=2
The LSI for the NJP (10) has the following form: The closed form for ( 14) is intractable and convergence is very slow with numerical integration.We use the transformation:      ( ) 1 1 e z for fast convergence then the LSI is: A closed form for the integral ( 16) is also intractable but numerically I J =0.24156.The LSI for the informative prior beta distribution B(,) with equal hyperparameters (= =c) from '0.1' to '5' is given in the Table given below.The LSI of the NJP corresponds to c=0.5 in B(c, c).We note that although the NJP satisfies an invariance property a uniform distribution B(1,1) has a small LSI.Generally, we can conclude that a B(c, c) prior with 'c' between 0.5 and 1.5 represents very little prior information.This indicates that the NJP is a reasonable choice from the standpoint of the LSI and is compatible with other 'similar' contenders.

(b) The number of treatments m=3
The LSI of the NJP ( 13) is compared with the LSI of the prior Dirichlet distribution D a a a ( , , ) for equal hyperparameters ( a a a c  c, c, c, c

Conclusion
For 2 parameters case, it is concluded that a prior Beta distribution B(c, c) with hyperparameter 'c' between 0.5 and 1.5 represents very little prior information.This indicates that the NJP is a reasonable choice from the standpoint of the LSI and is compatible with other 'similar' contenders.When 3 parameters are considered then the LSI for the prior Dirichlet distribution D(c, c, c) indicates that the Dirichlet distribution is a reasonable choice for the values of 'c' between 0.5 and 4. For 4 parameters case, this shows that the prior Dirichlet Distribution D(c, c, c, c) is a sensible choice when c < 2.5.

(a)
In order to try and characterize, the marginal distribution of  1 (14), we consider the implied distribution of i  (i-1, 2, 3) where i  is one of the following functions: (iii)   (c) The graphs of logarithm and reciprocal of the marginal NJP of  1 are drawn but these haven't standard (well-known) functional shape.Shannon (1948) has proposed the theory of information in the context of communication engineering.He introduces the idea that information is a statistical concept.Now measures of information are closely related to ideas of uncertainty and probability.Lindley (1956) suggests the use of information as a statistical criterion in the design of experiments.He explains that one of the purposes of experimentation is to gain knowledge about the state of nature (that is, about the parameters) without having specific actions in mind.The knowledge is measured by the amount of information.

Lindley-Shannon Information
Let p( )  be a prior distribution, the amount of information I p { ( )}  is defined as: An alternative but equivalent form of (2.1) is where E  denotes the expectation operator with respect to .

1 2 3 
  ).It is found that the LSI is equal to 1.46817 for the NJP whereas the LSI for the prior Dirichlet distribution is given in the Table given below.Comparison of the LSI for a D(c, c, c) distribution with the value 1.46817 of the NJP indicates that the Dirichlet distribution is a reasonable choice when there is little prior information for the values of 'c' between 0.5 and 4.(c) The number of treatments m=4Now consider the Dirichletdistribution D(c, c, c, c) of equal hyperparameters for the case of m=4.The LSI for the values of 'c' from 0.1 to 5 is given in the Table.The LSI of the NJP is obtained to be 2.53346.This shows that the DirichletDistribution D(c,c,c,c) is a sensible choice for a non-informative prior when c < 2.5.

1 is
where  is the cumulative distribution function of the standard normal distribution.It is observed that the graphs of the functions (ii) and (iii) are approximately normally distributed.(b) The marginal NJP p J ( )  compared with the beta distribution.It is observed that the marginal Jeffreys prior of  1 approximately follows the beta distribution.