Evaluation of Biomarker using Two Parameter Bi-exponential ROC Curve

Receiver Operating Characteristic (ROC) Curve is used for assessing the ability of a biomarker/screening test to discriminate between non-diseased and diseased subject. In this paper, the parametric ROC curve is studied by assuming two-parameter exponential distribution to the biomarker values. The ROC model developed under this assumption is called bi-exponential ROC (EROC) model. Here, the research interest is to know how far the biomarker will make a distinction between diseased and non-diseased subjects when the gold standard is available using parametric EROC curve and its Area Under the EROC Curve (AUC). Here, the standard error is used as an estimate of the precision of the accuracy measure AUC. The properties of EROC curve that explains the behavior of the EROC curve are also discussed. The AUC along with its asymptotic variance and confidence interval are derived.


Diagnostic Accuracy
In a medical diagnosis, a subject is categorized into either non-diseased or diseased group (a binary classification) by using some clinical measurement based on the selected cut-off t. If the clinical measurement is "greater than or equal to" t, then the subject is labeled as diseased and if the measurement is "less than" t, then the subject is labeled as nondiseased. The clinical measurements are often called as test results or test scores or Biomarker. The accuracy of a biomarker is defined as "its ability to distinguish the diseased group from non-diseased group". The purpose of evaluating the potentiality of a biomarker in diagnosing a disease is to filter out the patients as those belonging to "high risk" and "no risk" for disease during the initial stage of medical diagnosis/screening process, because it is not necessary for all the in-patient to undergo a gold standard test (e.g. endoscopy) as it proves to be a costly, time consuming process and invasive.

ROC curve and Diagnostic Accuracy
The accuracy of a binary classification can be visualized as well as quantified by a renowned statistical technique called Receiver Operating Characteristic (ROC) curve. It is a plot of two probabilities namely Sensitivity versus 1-Specificity for various threshold t where the sensitivity can be defined as likelihood of classifying a diseased subject correctly and the specificity can be defined as likelihood of classifying a non-diseased subject correctly. The measure of accuracy explained by the plotted ROC curve is Gy . Similarly, let the biomarker values of non-diseased subject by the random variable X with PDF, (x) X f and CDF, (x).
Assume that X and Y are independent, continuous and Y > X this is due to the fact that higher values of the biomarker indicates a condition of disease in an individual which in turn implies mean of Y is greater than mean of X for a better discrimination of subjects.
Sensitivity of the biomarker can be evaluated using, which is the probability of correctly categorizing a diseased subject when a cut-off t for the classification is given. It is also known as "True Positive Rate" (TPR). Similarly, the Specificity of the biomarker can be evaluated using, F ( ) ( ), X t P X t  which is the probability of correctly categorizing a non-diseased subject for a given t. and it is also known as False Negative Rate (FNR). 1-Specificity is called as "False Positive Rate" (FPR).
Then ROC curve is defined as a plot of TPR, ( For an appropriate diagnostic test, the ROC curve should lie very close to upper left corner of the unit square.

Properties
Once the ROC curve is plotted, it is important to study the properties of it, in order to highlight some key understanding from the plot. It is known that a typical parametric ROC curve must satisfy the basic three properties viz. monotonicity, invariance to monotone increasing transformation and the slope defined at a particular threshold t (Krzanowski and Hand, 2002). Recently, Hughes and Bhattacharya (2013) have given a quite interesting property known as asymmetry property of the ROC curve for few ROC models viz. Bi-Exponential, Bi-Normal and Bi-Gamma. In this section, we have more generally discussed the properties satisfied by a parametric ROC curve.   3.
The slope of the ROC curve at any operating point is equal to the ratio of PDF of diseased to PDF of non-diseased at cut-off point "t" (Krzanowski and Hand, 2002) is given by 4. If f(x) and g(y) denote the continuous PDF for non-diseased and diseased groups respectively. Let ( , ) KL f g denote the Kullback -Leibler (KL) divergence between the distributions of non-diseased and diseased group with f(x) as the comparison distribution and g(y) as the reference distribution (Hughes and Bhattacharya, 2013 Similarly, ( , ) KL g f denote the KL divergence between the distribution of diseased and non-diseased population with g(y) as the comparison distribution and f(x) as the reference distribution, then It is to be noted that Proof: Consider any two points "x" and "y" (say) where 0 x, y 1  on the FPR.

Fig. 1 A proper ROC curve in the interval [x, y]
By the definition of concavity, the line segment connecting the point on the ROC curve parallel to x and y never lies above the curve. If we take the extreme point i.e., x=0 and y=1, it becomes the chance line which never lies above the curve. Hence we have proved that the concavity property of ROC curve implies it is also proper.
Area under the ROC curve is the frequently used measure for quantifying the biomarker or performance of a diagnostic test. It is defined as the probability that in a randomly selected pair of non-diseased and diseased subjects, the biomarker value of diseased subject is higher than the non-diseased subject. The analytical expression for AUC is given by The evaluation of AUC and its inference are the crucial part of ROC curve analysis. In this paper, the bi-exponential ROC curve is studied by assuming two parameter exponential distribution along with its properties, asymptotic variance and confidence interval for AUC. This paper is organized as follows: In Section 2, Bi-exponential ROC model, its properties and Maximum Likelihood Estimation of parameters are discussed. Section 3, provides estimation of AUC, asymptotic distribution and confidence interval for AUC. In Section 4, the proposed theory is validated by using Monte Carlo Simulation.

Two Parameter Bi-Exponential Roc Model
Let Z be a random variable that follows two parameter exponential distribution with scale parameter λ and location parameter γ which is denoted by Z~exp (λ, γ). It possess the PDF as The CDF of Z is given by Let us assume that X and Y are independent and exponentially distributed with different parametric values (λ x , γ x ) and (λ y , γ y ) respectively. Notationally, X ~ exp(λ x , γ x ) and Y~ exp(λ y , γ y ) with the constraint γ y >γ x , λ x >λ y . This restrictive condition is because to satisfy the assumption of higher values of biomarker values of the diseased subjects. The ROC model developed under this assumption is represented by "EROC".
The FPR of EROC curve at the threshold "t" is found to be The TPR of EROC curve at the threshold "t" is found to be where the parameters can be estimated from the sample data by any of the standard estimation procedure. If X and Y are two independent random variables of size "m" and "n" respectively, with PDF given in (2.1), the MLE of parameters (Johnson, Kotz and Balakrishnan, 2004) are determined as follows: where (

Aliter:
One can also obtain an analytical form of the EROC curve as follows: From (2.3), we get the expression for threshold "t' as Since, ROC model is TPR as a function of FPR. By substituting (2.6) in (2.4), we can get the two parameter Bi-exponential ROC model as Plotting EROC(t) along y-axis and () X Ft along x-axis for different values of t, we get an estimate of EROC curve. Now, we will discuss some of the properties of two parameter exponential ROC curve.

Properties and Characteristics
1. EROC curve is monotonically increasing in nature for yx  > .
Proof: Since, the first derivative of EROC curve with respect to () EROC curve is monotonically increasing in nature. The ratio y x   will always be less than one since we assumed that yx  > and hence the term On the whole, we will get Though normal distribution is thought to fit many real world datasets (Hanley, 1988), it is not concave in nature in [0,1] i.e. the Bi-Normal ROC curve may lies below the chance line which in turn reduces the AUC. Therefore, we prefer a model that estimates ROC curve for biomarker which is concave in nature.
3. The slope of the EROC curve at the threshold "t" is found as . EROC curve is TNR asymmetric.

Proof:
The KL divergence between the distribution of diseased and non-diseased group with f(x) as the comparison distribution and g(x) as the reference distribution has been derived as ( , ) [ Similarly, the KL divergence between the distribution of non-diseased and diseased group with g(x) as the comparison distribution and f(x) as the reference distribution has been given as It is found that KL g f . A numerical check for this condition is also presented in simulation studies. These two divergence measures would be zero, if the non-diseased and diseased groups are identical. Hence, we have proved that, the EROC curve is TNR asymmetric. 5. EROC curve is invariance with respect to monotonically increasing transformation.

AUC of EROC Curve and Its Asymptotic Confidence Interval
Let X and Y be two independent and continuous random variables representing nondiseased and diseased group respectively, following two parameter exponential distribution individually with respective parameters () xx ,  and ( The MLE of AUC can be numerically be obtained by substituting the estimates from the sample data by using (2.5) with the help of invariant property of MLE (Casella and Berger, 2002).

Evaluation of Biomarker using Two Parameter Bi-exponential ROC Curve
The I(θ) is the Fisher Information matrix is given by x y x y y y  ( , , , ) x y x y       , we will adopt the delta method for finding the approximate variance. ( ) V AUC is obtained as follows: The estimate of variance is obtained by substituting the estimates of the parameters xx ,  and . yy ,  Hence, where AUC is the true area under the ROC curve. Thus, it is proved that ˆ~( AUC, ) AUC N  . Numerically, this result can be visualized from a numerical example presented in the simulation studies. where α is the level of significance and Z α/2 is the critical value. confidence interval for Â UC of EROC curve using asymptotic MLE and Monte Carlo methods are presented in Table 1.  Table 1, first column represents the assumed parametric values, second column represents the MLE of parameters with their standard errors given within the parenthesis along with the 95% asymptotic confidence interval for ˆ, AUC the third column provides the Monte Carlo estimates of parameters with their standard errors given within parenthesis along with the 95% confidence interval for Â UC and the final column represents the difference between the mean of diseased   Y and the mean of non-diseased   X samples i.e. () YX  As far as the discrimination is concerned, the measure ( YX  ) indirectly tell us the degree of separation between the two groups.
We also observe that, as the measure () YX  or the deviation between the estimated parameters of non-diseased and diseased group increases, the AUC tends to increase which in turn decreases the standard error of AUC. We also notice that there is no major difference in the estimates of parameters and ˆ, AUC obtained by asymptotic and Monte Carlo methods from Table 1. The EROC curves for different parametric values are plotted in Figure 1.    In Table 2, the data has been generated by assuming different parametric values for various sample sizes namely {(5, 5), (10, 10), (30, 30), (50, 50), (80, 80), (100, 100)}. The first element in each row represents the accuracy; second element is the standard error of ˆ, AUC the third and fourth value being the lower and upper confidence limits. It is obvious that the asymptotic estimate of standard error holds good for large sample size. As we go through the columns, the standard error tends to decrease and the confidence intervals get narrow as the sample size increase.

(ii) Real life example
The proposed method is applied to Prostate cancer markers(PSA). PSA is a biomarker which is significant in detecting the prostate cancer. The data consisted of 50 randomly chosen individuals who were affected by prostate cancer and 50 non-diseased individuals who were participated in a lung cancer prevention trial (Etzioni , 1999). The two correlated prostate cancer biomarkers were considered namely total serum PSA (tPSA) and the ratio of (percent) free to total PSA (fPSA). Among these biomarkers tPSA has higher AUC than fPSA and hence it is preferred to assess the accuracy of diagnosis for prostate cancer.
The tPSA has been evaluated for the Goodness of Fit for two parameter exponential distribution using Kolmogrov-Smirnov, Anderson-Darling and Chi-Square test. The results are reported for significance level (α) 20, 10, 5, 2 and 1% in Table 3 from the software "Easy Fit". The EROC curve is plotted for tPSA and it is presented in Figure 2. The EROC model showed that the marker "tPSA" is able to identify the prostate cancer individual with an accuracy of 93% with the estimated Standard error, 0.055 with a confidence interval [0.8079, 1.000]. The sensitivity and specificity of tPSA by using EROC are 83% and 76% respectively at the threshold 2.865 ng/ml.
From sensitivity and specificity, we could infer that an individual who is having the "tPSA" marker value greater than 2.865 ng/ml is 83% likely to be detected with the prostate cancer. Similarly, an individual having the marker value less than 2.865 ng/ml is 76% likely to be not detected with the prostate cancer.

Conclusion
In this paper, we have also extended the one parameter Bi-Exponential ROC curve analysis to two parameter exponential ROC curve analysis. The properties of the two parameter bi-exponential ROC curve have been studied. It is found that, EROC is monotonically increasing, concavity an important property for a ROC to be proper, TNR asymmetric which is justified theoretically as well as graphically. The 95% asymptotic confidence interval for Â UC have been derived. For the prostate cancer data, the EROC model showed that the biomarker "tPSA" is able to identify the prostate cancer individual with an accuracy of 91% with the estimated standard error, 0.055 with a confidence interval [0.8079, 1.000]. The sensitivity and specificity are 83% and 76% respectively at the threshold 2.865 ng/ml. The proposed EROC curve analysis can be adopted for assessing the accuracy of classification made by a particular biomarker provided the goodness of fit test is evaluated.