文献3下载_在线阅读_29

is_568446

暂无简介

文献3 Environment and Planning A 2013, volume 45, pages 2515 – 2534 doi:10.1068/a44710 Sandwich estimation for multi-unit reporting on a stratified heterogeneous surface Jin-Feng Wang LREIS, Institute of Geographic Sciences and Nature Resources Research, Chinese Aca...

Environment and Planning A 2013, volume 45, pages 2515 – 2534 doi:10.1068/a44710 Sandwich estimation for multi-unit reporting on a stratified heterogeneous surface Jin-Feng Wang LREIS, Institute of Geographic Sciences and Nature Resources Research, Chinese Academy of Sciences, Beijing 100101, PR China; e-mail: wangjf@Lreis.ac.cn Robert Haining Department of Geography, University of Cambridge, Cambridge CB2 3EN, England; e-mail: rph26@cam.ac.uk Tie-Jun Liu, Lian-Fa Li, Cheng-Sheng Jiang LREIS, Institute of Geographic Sciences and Nature Resources Research, Chinese Academy of Sciences, Beijing 100101, PR China; e-mail: liutj@Lreis.ac.cn, lilf@Lreis.ac.cn, jiangcs@Lreis.ac.cn Received 31 December 2011; in revised form 17 October 2012 Abstract. Spatial sampling is widely used in environmental and social research. In this paper we consider the situation where instead of a single global estimate of the mean of an attribute for an area, estimates are required for each of many geographically defined reporting units (such as counties or grid cells) because their means cannot be assumed to be the same as the global figure. Not only may survey costs greatly increase if sample size has to be a function of the number of reporting units, estimator sampling error tends to be large if the population attribute of each reporting unit can be estimated by using only those samples actually lying inside the unit itself. This study proposes a computationally simple approach to multi-unit reporting by using analysis of variance and incorporating ‘twice-stratified’ statistics. We assume that, although the area is heterogeneous (the mean varies across the area), it can be zoned (or stratified) into homogeneous subareas (the mean is constant within each subarea) and, in addition, that it is possible to acquire prior knowledge about this partition. This zoning of the study area is independent of the reporting units. The zone estimates are transferred to the reporting units. We call the methodology sandwich estimation and we report two contrasting empirical studies to demonstrate the application of the methodology and to compare its performance against some other existing methods for tackling this problem. Our study shows that sandwich estimation performs well against two other frequently used, probabilistic, model-based approaches to multi-unit reporting on stratified heterogeneous surfaces whilst having the advantage of computational simplicity. We suggest those situations where sandwich estimation might be expected to do well. Keywords: sandwich estimation, heterogeneous surface, zoning; kriging estimates, hierarchical Bayesian estimates 1 Introduction The provision of statistical estimates of an attribute across many geographically defined reporting units is often required in environmental and social research. Consider the following four examples that demonstrate the different practical contexts in which this can arise. (i) In order to construct pollution maps and to carry out spatially differentiated risk assessments, estimates of heavy metal content in soil are required for each of a large number of 1 km2 grid cells or, for example, across each of the 2862 counties of China (Zhou, 2006). (ii) Urban land price mapping is required across a metropolitan area partitioned into a large number of urban tracts. But relevant land price data are available at sample sites where transactions 2516 J-F Wang, R Haining, T-J Liu, L-F Li, C-S Jiang have taken place recently (Tsutsumi et al, 2011). (iii) China comprises thirty-four provincial units, 2862 county units, and 41 836 town units, (see http://qhs.mca.gov.cn/article/zlzx/ qhtj/200711/20071100003177.shtml). In order to estimate overall tuberculosis (TB) prevalence rates for the country 146 towns are randomly selected from the town units in China (Advisory Panel and Office of 5th National Sampling Survey of Tuberculosis of China, 2010). Guizhou, one of the thirty-four provincial units in China, wants to use the national survey data to estimate its own TB prevalence rate but only five sampled towns are located in Guizhou which is too small a sample size on which to base a reliable estimate. (iv) Flood damage is surveyed by sampling areas close to rivers as well as other areas, but disaster relief resources have to be distributed by administrative units (Xu, 1996). Losses therefore have to be estimated for administrative units using the sample survey data. The first example involves constructing estimates for a very large number of small geographical areas or reporting units so that overall costs are potentially very high. The second example is a case where only partial data are available on which to base estimates. The third example involves small-sample estimation where, in order to strengthen the estimate for Guizhou (or indeed any other province), there is a need to try to “borrow information” or “borrow strength” from other parts of the sample dataset. The fourth is an example of a problem where statistics collected for one spatial framework have to be transferred to a reporting framework that is quite different. This last problem shares some common ground with the areal interpolation or ‘incompatible areal units’ problem but where the data to be transferred are themselves the outcome of a sampling process (Gotway and Young, 2002). The four examples all have in common the problem of arriving at a good estimate of the mean of an attribute together with an estimate of the sampling error for a set of reporting units which may be very large in number. In all these cases, survey costs will increase greatly if direct sampling has to be employed in which at least two samples have to be taken in each reporting unit, whilst of course the smaller the sample size in a reporting unit the larger the standard error of the estimate of the mean (Cochran, 1977). In the next section we briefly review existing methods for handling the multi-unit reporting problem and introduce the sandwich estimation approach that will be developed in this paper. Later sections model the propagation of information and uncertainty from what we shall term the sample layer (the geographically distributed sample points) to the zoned layer (the geographical areas on the real surface each assumed to have a constant mean) to the reporting layer (the geographically defined areas for which estimates are required). Our previous study (Wang et al, 2010) investigated how zoning can be employed to improve the estimate of a single global mean, also analyzing the consequences of basing the estimation on an imperfect zoning. This paper develops a novel method for multi-unit reporting (obtaining estimates of the mean which may be different from one reporting unit to another) using a small sample and assuming a zoning that corresponds to the spatial variation in the mean. We then demonstrate the use of sandwich estimation using two contrasting empirical studies and compare findings with other, commonly adopted, approaches to the problem. Finally, we draw conclusions and discuss the implications of this methodology. Readers can implement the two cases presented in the paper and apply the sandwich technique to their own data by visiting http://www.sssampling.org/sandwich. 2 Background and review Several approaches exist for the multi-unit reporting problem, where the reporting units are many, often small, geographical areas for each of which an estimate of the mean of an attribute (eg, heavy metal content, TB prevalence, flood damage) at some point in time is required (Cochran, 1977, pages 34–39; Särndal et al, 1992, pages 408–412). However, to separate methods it is important at the outset to distinguish between model-based and design-based Sandwich estimation for multi-unit reporting 2517 approaches to sampling as described in the spatial sampling literature (Brus and de Gruijter, 1997; Christakos, 1992; de Gruijter and ter Braak, 1990; Overton and Stehman, 1995). In design-based approaches the population of values in a region is considered fixed and randomness enters through the process of selecting the locations to sample. The mean value for the region is a fixed but unknown quantity and the sample mean is an estimator of it. Repeated sampling according to a given scheme, such as stratified random sampling, will generate a distribution of estimates of the (regional or population) mean (de Gruijter and ter Braak, 1990). In contrast, the model-based approach assumes the values observed in a region represent one realization of some underlying stochastic model, so observations are treated as random variables with a probability distribution (Cressie, 1993). The model-based approach to sampling is most appropriate for estimating the parameters of the underlying stochastic model such as its mean or a proportion (Christakos, 1992; Wang et al, 2009), for predicting values at unsampled locations (Matheron, 1963), and certain forms of mapping such as the area-level risk of becoming a victim of burglary (Haining 2003, pages 307–320). The design-based approach is most often used for tackling ‘here and now’ and ‘how much’ questions—estimating global properties such as the mean value of an attribute or the proportion of an area under a particular land use (Cochran, 1977). For a review of these issues see, for example, Haining (2003, pages 96–99). Universal kriging is a model-based approach to the multi-unit reporting problem. When its assumptions are met it yields the best linear unbiased predictor for attributes distributed continuously in geographical space (Cressie, 1993). Ordinary kriging is a special case of universal kriging under the assumption of second-order stationarity of the attribute where the spatial correlation between two sample points depends only on the distance between them; this assumption implies intrinsic stationarity although the converse is not true (Goovaerts, 1997). Point-to-area kriging can be applied to transfer point estimates to any specified areal framework (Tan et al, 1997), again under the assumption of second-order stationarity. Kriging requires considerable experience to implement, not least in estimating and modeling the variogram which is central to the application of this statistical methodology. More seriously, however, when the spatially varying mean cannot be modeled by a continuous function, even if the area can be partitioned into homogeneous zones and the means subtracted from the sample data in each zone, unless these ‘residuals’ are second-order stationary across the area, kriging has to be implemented on each zone independently which will severely reduce the sample size for the estimation and modeling of the variograms on which the methodology depends (Wang et al, 1997; 2009; 2010). Another model-based approach to the problem, particularly appropriate to the case of small areas, uses hierarchical Bayesian (HB) modeling (Cressie and Wikle, 2011; Haining, 2003; Rao, 2003; Särndal et al, 1992). This method depends on ‘borrowing information’ (or ‘borrowing strength’) from neighbouring areas. A strategy similar to this has been employed by the US Census Bureau to estimate missing household data and also in the US Medical Expenditure Panel Survey. Estimating area means is based on some specified spatial function that represents the spatial correlation in the observations. There are two potential problems with this approach in the present context: first, there is often a homogeneity assumption—the spatial function is the same across the map, an assumption that may be difficult to sustain across a large study area; second, neighbouring areas are not necessarily the most appropriate areas from which to borrow information. These model-based approaches assume an underlying probability model generating the observations, so that what is observed is but one realization of the underlying model. It is precisely the willingness to assume a model for the data that makes it possible to obtain estimates for locations or for areas for which no data exist. By contrast, design-based approaches treat observations as fixed quantities, apart from any measurement error (Cressie and Wikle, 2011). For example, 2518 J-F Wang, R Haining, T-J Liu, L-F Li, C-S Jiang Cochran (1977, pages 142–144) addresses the multi-unit reporting of a zoned surface but in order to calculate means and variances samples are needed in each of the intersected substrata. Särndal et al (1992, chapter 10) also address the problem, but again their solution requires at least one observation in each reporting unit (page 409). Of course, more than one observation is required in each reporting unit so as to provide an estimate of the standard error of the estimate of the mean. In the absence of a model for the observed data, there is no possibility of interpolating to areas where no samples have been taken—hence the need to sample in every reporting unit, preferably taking several samples so as to estimate the standard error but also to improve the estimate of the mean. As the number of reporting units increases, survey costs increase. In this study, we introduce an approach to stratified estimation of the mean in areas from which no, or too few, sample observations have been taken. The conceptual model of spatial variation which is invoked in order to make estimation possible is referred to as the ‘formal’ (or ‘uniform’) regional model of spatial variation. This model has a long tradition in geography [for an extended discussion see Grigg (1967)], predating the probabilistic models of spatial variation used by statisticians and which underpin the previously described ‘model-based’ approaches to spatial sampling (eg, Cressie, 1993; Ripley, 1981). Unlike statistical models, there is no assumption with the geographers’ regional model that the observations are random variables, so in that sense the approach here is similar to the design-based approach in which observations are treated as fixed. However, the assumption of an underlying model (albeit a model quite different from that usually assumed for model-based approaches to sampling) is critical to the methodology. Formal regional models partition space into homogeneous or quasi-homogeneous areas (regional “patches”) which represent a classification of space in terms of attribute similarity and spatial contiguity (Haining, 2003, page 183). This model of spatial variation allows the mean for any area (eg, any reporting unit) to be estimated as a function of the means of the homogeneous zones that overlap it. Versions of this conceptual model of spatial variation arise in a number of areas of spatial analysis, providing a basis for stratified estimation and for some nonparametric solutions to problems including spatial interpolation and areal interpolation (Haining, 2003, pages 131–135; 164–165). The use of this model in the current context allows us to generalize existing spatial estimation theory without the need to assume a probabilistic model of spatial variation of the sort that underpins kriging (which depends on estimating a permissible semivariogram function) or HB estimation (which is based on probability density functions for the data and the prior). We develop what we call sandwich estimation of parameters for multi-unit reporting on heterogeneous surfaces that have been zoned into homogeneous subareas. The procedure consists of two phases: first, the heterogeneous surface is zoned into subareas (partitioned into regions) within which the mean is constant and which provide the framework for spatial sampling; then, two transfer functions are specified. First, the sample data are used to provide estimates of sample means and their variances for each of the zones; next these estimates are transferred onto the multi-unit reporting layer. Of course, the zoning (the model) must provide a good representation of the ‘real’ spatial variability in the mean and there will be bias and a loss of estimator precision in the final estimates when this is not achieved (see Wang et al, 2010). The zoning layer and the reporting layer are two independent partitions of the study area. This model-based strategy allows estimation of the mean and its sampling error even if there is no sample in a reporting unit. We propose this methodology as a simple and direct approach to multi- unit reporting on real surfaces that are heterogeneous in the mean of the attribute but where the mean is constant within each defined zone. The sandwich estimation approach shares common ground with the concept of a layer in a GIS and this model-based assumption, of the presence of regional ‘patches’, is one that has been previously invoked in GIS-based map Sandwich estimation for multi-unit reporting 2519 operations [see, for example, Flowerdew and Green (1989) and Goodchild et al (1993) in the case of areal interpolation]. For this reason we believe it will be easier for environmental and social scientists to apply, relative to probabilistic model-based approaches, particularly in a GIS computing environment. 3 Sandwich estimation 3.1 Framework for sandwich estimation Figure 1 provides a pictorial representation of sandwich estimation, which consists of a reporting layer, a zoning layer, and a sampling layer. It is called sandwich estimation because of the three layers of the structure. First, produce a zoning or surface classification that partitions the research area into subareas that are spatially homogeneous (constant mean). Note that any particular class may occur in the form of several geographically separate zones. The purpose of the partitioning is to create distinct zones that subdivide the study area (Wang et al, 2010). Second, distribute sample units over each zone and estimate the sample means and their sampling errors for each of the zones. We recommend stratified random sampling within each zone, wherever possible, as this method of spatial sampling has a long history of application in the geography and spatial statistics literature (see for example Berry and Baker, 1968; Dunn and Harrison, 1993; Hancock, 1995; Overton, 1987; Overton and Stehman, 1993; Ripley, 1981; Wang et al, 2010). Finally, transfer the values from the zoning onto the reporting layer, which consists of many reporting units. Information flows from the sampling layer to the zoning layer and finally to the reporting layer, and comprises estimates of means and their sampling errors. 3.1.1 Reporting layer {Ψ} The reporting layer consists of spatial units. They could be administrative units of a city, counties, postal zones, or census units of a region. They could be a grid system in a soil, ecological or meteorological survey, or physical units such as watersheds or defined by elevation. Reporting layer (eg, provinces/counties/grids) Zoning layer (eg, land-use zones) Sampling layer (ie, candidate sites for sampling) In fo rm at io n flo w Error flow Object layer (object to be sampled) Figure 1. Conceptual model of sandwich estimation. Note the sampling layer is also a function of the zoning layer which is a product of the object layer. 2520 J-F Wang, R Haining, T-J Liu, L-F Li, C-S Jiang 3.1.2 Zoning layer {R} Prior knowledge about variability on the real surface can be used either to reduce sample size for a given level of estimator precision or to improve estimator precision (Griffith et al, 1994; Ripley, 1981; Wang et al, 2002; 2010). The importance of taking into account spatial variability and spatial structure in the various stages associated with spatial sampling has been well documented (Anselin, 1988; Griffith, 2005; Haining, 1988; Rodriguez-Iturbe and Media, 1974; Tobler and Kennedy, 1985). Spatial heterogeneit

本文档为【文献3】，请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑，图片更改请在作品中右键图片并更换，文字修改请直接点击文字进行修改，也可以新增和删除文档中的内容。

文献3

热门搜索

历史搜索