Model-Assisted Estimation of Population Mean in Two-Stage Cluster Sampling

Estimation of finite population parameters has been an area of concern to statisticians for decades. This paper presents an estimation of the population mean under a model-assisted approach. Dorfman (1992), Breidt and Opsomer (2000) and Ouma et al (2010) carried out the estimation of finite population total on the assumption that the sample size is large and the sampling distribution is approximately normal. Unlike their researches, this paper considered a case when the sample size is small under a model-assisted approach. A model-assisted regression model was considered in a case where the cluster sizes are known only for the sampled clusters in order to predict the unobserved part of the population mean. Under mild assumptions, the proposed estimator is asymptotically unbiased and its conditional error variance tends to zero. Simulation studies show that model assisted estimation performs better than model based estimation of a finite population mean in a case where the sample size is small.


Introduction
Sample surveys are concerned with obtaining desired information from a population. In sample survey methods, some portion of the population called a sample is used to make inferences about the entire population.
Every nation in the world uses surveys to estimate their rates of unemployment, basic prevalence of immunization against diseases, opinions about the central government, intentions to vote in an upcoming election, and people's satisfaction with goods and services that they purchase among other application areas. Surveys aid in tracking global economic trends, the rate of inflation in prices, and investments in new economic enterprises (Shewhart, 2004).
The theory of sample survey is concerned with the development of sampling strategies that yield the selection of a sample that best represents the entire population. It provides the procedures that are employed to make statistical inferences about the survey variable. It is also used in choosing the criteria for comparing various strategies while attempting to obtain optimal results from a sample survey. Generally there are four methods used in sample surveys namely model based, model assisted, design based and design assisted surveys (Ouma et al 2010). In this paper we used model-assisted approach to estimate the population mean in two-stage cluster sampling.

Background of the Problem
Model based methods in estimation of population parameters assume that there is an underlying model that generate survey units. However, if the assumed model is incorrect, the estimators of population parameters will certainly be incorrect. To address this problem, we wish to implement non-parametric approach to estimation of population mean then obtain a model that will be used to generate survey values given in finite populations in two stage cluster sampling. Model-assisted estimation of population total has been considered by Breidt and Opsomer (2000) under the assumption that the sample size is large and the sampling distribution is approximately normal. However, this is not always the case. In this study we consider a situation when the sample size is small and develop an estimator for estimating population mean.
In particular, the problem is to estimate the population mean defined as ̅ ∑

Summary of the paper
The rest of the paper is organized as follows: section 2 gives a summary of the two stage cluster sampling that we proposed to use. In section 3 we introduce the estimator for the finite population mean in two stage cluster sampling. In section 4 we look at the asymptotic properties of the proposed estimator. In section 5 we give the conclusion of our paper.

Review of Two Stage Cluster Sampling
Consider a finite population of primary sample units (PSU's) or clusters labelled i.e where is a known number. Let , be the number of secondary sampling units(SSU's) in the PSU.
Let be the value of the survey variable for the SSU belonging to the PSU. We further assumed that the auxilliary data are known for all the selected clusters and the population elements. The non-sampled units are not known and we therefore use the auxiliary information together with the sampled units of the survey variable to estimate the non-sampled part of the population. We proposed to estimate the population mean defined by equation (1.1). In this paper, we generate the survey values in the cluster by the model given by Where are independent random variables with mean zero and variance, . Further is a smooth function of and is smooth and strictly positive, is a function of the auxiliary variables.

Proposed Estimator of the Population Mean
The population mean to be estimated is given by equation (1.1). The non-sampled survey values of the population are unknown. We therefore use the sampled units of the population together with the model defined in equation (2.1) to estimate the non-sampled part of the population mean. The proposed estimator is given by

Asymptotic Unbiasedness of the Estimator
The estimator of population mean is given by Denoting the initial sampling weights by which is equal to the inverse of their selection probabilities i.e. probability of including unit from cluster in the sample, we have: But as shown by Ouma et al (2010), , therefore the proposed estimator for the population mean becomes Let the sampling weights of the primary sampling units be given by

( )
Rupert (2003) applied as a basis and suggested that the sampling weights in this kind of survey can be computed using Where is the number of times the primary sampling unit is chosen. Thus using this procedure, it follows that But since we sample with replacement we have ; Rao and Wu (1998) observed that there is considerable benefit and little loss in choosing ; therefore we put , so that Let the initial sampling weights defined in ) be given by the kernel weights: be the kernel estimator of ( ) then This consequently gives us We make the following substitutions: Using substitutions in (4.19), from equation (4.20) we have ] is the conditional mean of ( ̂ ) given the auxiliary values of .
We let ( ⁄ ) where is a kernel and is a chosen bandwidth. In addition is a symmetric density function such that Since the relationship between the conditional mean and the selected bandwidth is too complex to establish we utilize the following theorem whose proof is given in Dorfman (1992  ( ̅ ) ̅ ( ) * + As b , and for n small,

( ̅ ) ̅
And therefore the proposed estimator is asymptotically unbiased.

The Bias of the Proposed Estimator
The bias of the proposed estimator is given by But the inclusion probability in two-stage cluster sampling is given by As n becomes small the bias asymptotically tends to zero i.e ( ̅ ) ̅  From Table 1 above we observe that at different bandwidths considered, the mean squared errors obtained in estimating population mean using the model-assisted estimator are significantly smaller than those incurred in using model-based estimator when the sample size is small i.e when n < 30 as used in the empirical study.  Table 2 above we observe that the bias associated with the model assisted estimator increases insignificantly while that of model based estimator fluctuates. However, the bias due to model assisted estimator is comparatively low with respect to that of model based estimation at the different bandwidths considered.

Asymptotic Error Variance of the Proposed Estimator
It can be noted also that there is a trade-off between these two approaches as illustrated by the graphs. So none of the approaches is overally better but they can be compared at specific bandwiths.

Conclusion
The main aim of the paper was to estimate a finite population mean using model assisted approach in two stage cluster sampling when the sample size is small i.e . The empirical study compared the performance of a model based estimator of the finite