Proposal of a new term: “DOCO”

I am proposing a new term: DOCO. I will, in spirit, add it to Gelman’s already impressive list of useful terminology. DOCO stands for Data(or datum) Otherwise Considered Outliers.

That is, if you have a X-assumptive model, and you see statistical “outliers”, then you should probably change the model assumptions. These are not “wrong” data, but rather data that, if you assume X, are notably outlying observations.

Thus: DOCOs. If you have data that are outliers according to your statistical model, then that implies that your model is most likely incomplete, and does not fully describe the data generating process. Have several positive “outliers”? Your data are skewed, and you should use a skew normal distribution or some other skew-permissive distribution. Have several tail-end “outliers”? Your data have fat tails, and you should use a student-t distribution. Have coding errors? Then part of the data generating process involves an error-in-coding component; if you’re only interested in the non-error-in-coding generative component, then sure, remove those points. If not, maybe you can simultaneously model the secondary component. Data seem to be very ill-defined by any single generative distribution? You should probably model the data under some fixed-mixture model that could sufficiently generate observations.

The point is: Outliers shouldn’t exist in principle. Something generated that data, and if some data are considered outliers, then your model isn’t adequately accounting for such observations and needs modifications. As such these types of data should be called DOCOs with a revised model.

Leave a Reply