Should you use data integration for your distribution model?

Abstract

Data integration—-the analysis of two or more observational datasets in a single statistical model—-is on the rise in species distribution modelling. Recent papers showcase the usefulness of data integration, but few highlight cases where data integration produces equal or worse outcomes compared to single-dataset modelling. Here, we offer a decision-making framework to assess whether data integration may provide improvements over simpler modelling approaches. We focus on joint likelihood data integration, in which two or more datasets are linked to a single shared process model. We highlight three considerations for analysts deciding whether to use data integration: (1) the practical costs associated with developing and validating an integrated model; (2) the marginal benefits to model performance, which vary depending on data volume and coverage; and (3) the concordance (or compatibility) of the two datasets. Using a simulation study, we illustrate modelling outcomes under a variety of conditions of data volume and bias, showing consistent patterns across three distinct formulations of joint likelihood models. We explore a priori and a posteriori tests of data concordance, but we find that such tests fail to usefully differentiate between cases where joint modelling produces better or worse outcomes. Ultimately, we outline a decision-­making workflow and illustrate its application to the joint modelling of real data.

Publication
Journal of Animal Ecology: https://doi.org/10.1111/1365-2656.70210
Jeffrey W. Doser
Jeffrey W. Doser
Assistant Professor