Group for Research in Decision Analysis

# Sampling bias in logistic models

## Peter Mccullagh

This talk is concerned with regression models for the effect of covariates on correlated binary and correlated polytomous responses. In a generalized linear mixed model, correlations are induced by a random effect, additive on the logistic scale, so that the joint distribution $$p_{\bf x}({\bf y})$$ obtained by integration depends on the covariate values $$\bf x$$ on the sampled units. The thrust of this talk is that the conventional formulation is inappropriate for most natural sampling schemes in which the sampled units arise from a random process. The conventional analysis incorrectly predicts parameter attenuation due to the random effect, thereby giving a misleading impression of the magnitude of treatment effects. The error in the conventional analysis is a subtle consequence of selection bias that arises from random sampling of units. This talk will describe a non-standard but mathematically natural formulation in which the units are auto-generated by an explicit process and sampled following a well-determined plan. For a quota sample in which the covariate configuration $$\bf x$$ is pre-specified, the model distribution coincides with $$p_{\bf x}({\bf y})$$ in the GLMM. However, if the sample units are selected at random, either by sequential recruitment or by simple random sampling from the available population, the conditional distribution $$p({\bf y}$$ given $${\bf x})$$ is different from $$p_{\bf x}({\bf y})$$. By contrast with conventional models, conditioning on $$\bf x$$ is not equivalent to stratification by $$\bf x$$. The implications for likelihood computations and estimating equations will be discussed.