Group for Research in Decision Analysis

Sampling bias in logistic models

Peter Mccullagh

This talk is concerned with regression models for the effect of covariates on correlated binary and correlated polytomous responses. In a generalized linear mixed model, correlations are induced by a random effect, additive on the logistic scale, so that the joint distribution \(p_{\bf x}({\bf y})\) obtained by integration depends on the covariate values \(\bf x\) on the sampled units. The thrust of this talk is that the conventional formulation is inappropriate for most natural sampling schemes in which the sampled units arise from a random process. The conventional analysis incorrectly predicts parameter attenuation due to the random effect, thereby giving a misleading impression of the magnitude of treatment effects. The error in the conventional analysis is a subtle consequence of selection bias that arises from random sampling of units. This talk will describe a non-standard but mathematically natural formulation in which the units are auto-generated by an explicit process and sampled following a well-determined plan. For a quota sample in which the covariate configuration \(\bf x\) is pre-specified, the model distribution coincides with \(p_{\bf x}({\bf y})\) in the GLMM. However, if the sample units are selected at random, either by sequential recruitment or by simple random sampling from the available population, the conditional distribution \(p({\bf y}\) given \({\bf x})\) is different from \(p_{\bf x}({\bf y})\). By contrast with conventional models, conditioning on \(\bf x\) is not equivalent to stratification by \(\bf x\). The implications for likelihood computations and estimating equations will be discussed.