Dataset compilation. The dataset used in this work was compiled using literature records of waterborne outbreaks of hepatitis E available in the PubMed database and in WHO reports. In this work we use the term “waterborne HEV” to refer to outbreaks of hepatitis E caused by either genotype 1 and 2. The criteria to include literature entries reporting HEV outbreaks were that (i) the literature records provide the geographical coordinates of the location where the outbreak was detected, and (ii) the literature records provide clear evidence that the reported outbreak was related to the consumption or use of contaminated water, or alternatively that the causative agent was proven to be HEV genotype 1 or 2. All outbreaks included in the final dataset (n=59) were geo-referenced using Google Earth. A map showing our dataset entries is available here.

Environmental data. Nineteen layers of environmental data were downloaded from the WorldClim dataset with a spatial resolution of 2.5 arc-min. The bioclimatic variables in this dataset are derived from monthly temperature and rainfall values. Specifically, the variables represent annual trends (such as mean annual temperature or the annual precipitation), seasonality (as the annual range in temperature and precipitation) and extreme environmental factors (such as the temperature of the warmest month of the year or the precipitation of the driest quarter of the year). The Worldclim bioclimatic values are generated by interpolation of average monthly data registered in many global meteorological stations from the year 1950 until the year 2000. In addition to the bioclimatic variables, we also included global data of soil wetness, potential evapotranspiration (http://www.cgiar-csi.org/data) and human population density at 2.5 arc-min resolution. Human population density data for the year 2015 was obtained from the Gridded Population of the World dataset.

In order to identify and exclude highly correlated environmental variables prior to developing the SDM we calculated the variance inflation factor (VIF) for each variable, and excluded those which showed values higher than 6. VIF quantifies the level of multicollinearity in a least squares regression analysis. Out of the 21 initial variables included in the correlation analyses, we retained those detailed in the table below and used them to obtain the distribution model of HEV.

Species distribution models (SDMs). The compiled dataset was used to build species distribution models for the waterborne HEV genotypes using the MaxEnt algorithm. To obtain more information on the factors influencing the spatial occurrence of HEV outbreaks we processed four different models: (i) a model using population density data, environmental data and our complete dataset of hepatitis E outbreaks (general model, available here), (ii) a model using environmental data and our complete dataset of hepatitis E outbreaks (environmental model, available here).

List of environmental variables used for the development of HEV distribution models:

Variable (id. WorldClim dataset) General model (AUC=0.91)* Environmental model (AUC=0.90)*
Population density 80.9 -
Annual potential evapotranspiration 11.4 56.9
Precipitation seasonality (bio15) 3.8 21.5
Mean diurnal range (bio2) 1.3 0.4
Precipitation of warmest quarter (bio18) 1.1 1.6
Precipitation of the driest quarter (bio17) 0.7 8.4
Precipitation of the coldest quarter (bio19) 0.5 5.4
Mean temperature of the wettest quarter (bio8) 0.3 2.5
Soil topographic index 0 2.1
Mean temperature of the driest quarter (bio9) 0 1.3

*Values correspond to the percent contribution to the model of each variable.

