Intercomparison of six national empirical models for PM 2.5 air pollution in the contiguous US

25 Empirical models, previously called land-use regression (LUR), are used to understand and predict spatial


Introduction
Empirical models can be used to understand and predict levels of outdoor air pollution, including at unmeasured locations.The name ("empirical") emphasizes that the models reflect empirical measurements.Such model results have been used, for example, in health risk assessment, environmental epidemiology, and environmental justice analysis.
Generating empirical-model results typically involves three steps: (1) Model building: generating an empirical model to predict measured concentrations (i.e., the dependent variable; the model is calibrated to and attempts to predict these), using several parameters that might correlate with concentrations (i.e., the potential independent variables).( 2) Model testing, to quantify parameters such as uncertainty, robustness, error, and bias.If multiple models were built by a research group, the model-testing phase could involve a model-selection process.Hold-out cross-validation typically occurs in this step.(3) Model application, wherein the final selected model(s) is used to estimate concentrations throughout the domain of interest (e.g., at all Census Block centroids in the continuous US).
Studies to intercompare empirical models are scarce, especially for large geographies.Some studies have compared empirical models with mechanistic models (e.g., CMAQ) (Marshall et al., 2008;Samoli et al., 2020), satellite-based models (e.g., aerosol optical depth, AOD) (Yu et al., 2018;Cowie et al., 2019), or hybrid models (Michanowicz et al., 2016;Zhang et al., 2021).Other studies have compared results using different methods for model-building (e.g., LUR vs. machine learning vs. kriging vs. hybrid empirical models) (Adam-Poupart et al., 2014;Jain et al., 2021;Dharmalingam et al., 2022).However, most prior comparisons were at the city or region level, and comparisons were generally within a single research team.We identified only one study that compared empirical models nationwide (Lu et al., 2021).This paper adds to the literature by comparing concentration predictions from six annual-average PM2.5 empirical models for the contiguous US.Each model was generated by a different research group; they differ in their approaches.Our analysis compares the predictions from these models at three spatial scales: nationally, regionally, and urban/rural.

General
Our approach is to intercompare a sample of six national empirical models for annual-average ambient PM2.5.We focused on annual-averages for fine particles (PM2.5) for several reasons: PM2.5 is an important criteria-pollutant, regulated by the US EPA through a health-based National Ambient Air Quality Standard (NAAQS); millions of people in the US live in areas that exceed the NAAQS (US EPA, 2022a); and the health effects associated with annual-average PM2.5 are large.Importantly, multiple national empirical models predict annual-average PM2.5 available for this pollutant.
In general, one way to intercompare models would be for all modelers to pre-agree to a set of modelbuilding and model-testing observations.(Or, if there were a set of measurements that no model included in model-buildinge.g., a dataset that was unknown or otherwise unusedthen the outcome would be similar: a dataset that could be used to test all of the models.)In this case, it would be possible to compare each method against the held-out cross-validation measurements.However, in the current intercomparison, each research group used their own held-out data, comparison metrics, and approach to investigate model uncertainty.Furthermore, the models incorporate the monitoring data in different ways (e.g., via a kriging component); for that reason, simply comparing the six models against observations (which were used during model-building) may not shed light on model reliability at locations without measurements.
Instead, we directly intercompare the models, without comparing against held-out measurements.We do not have "gold-standard" observations to compare against.Nevertheless, we believe that useful insights can be gained from the intercomparisons conducted.

Input data
We obtained year-2010 predicted PM2. 5 concentrations for six empirical models (see Table 1) via data download or direct request from researchers.Three models are "point based" (their predictions apply to a specific spatial location): (1) The CACES EPA ACE center model is based on universal kriging with partial least squares data-reduction (PLS-UK) (Kim et al., 2020).(2) The EPA downscaler provides a Bayesian space-time "fuse" of monitoring data and 12 km CMAQ model outputs (US EPA, 2022b).( 3) The MESA-Air models use space-time PLS and expectation-maximization to fill in missing observations (Keller et al., 2015).The other three models are "gridded models" (predictions are for grid locations, reflecting the average concentration in that region [e.g., in a ~ 1 km 2 area]): (4) The Harvard/MIT EPA ACE center model employes a generalized additive model to integrate multiple machine-learning algorithms (Di et al., 2019).( 5) The SEARCH EPA ACE center model is based on a fusion of WRF-Chem, satellite data (MAIAC AOD), and a kriging of EPA monitor data (Goldberg et al., 2019).( 6 To make direct intercomparisons, we aligned spatiotemporal aspects of the models to be annual-average and by Census Tract (n ~ 74,000).When sub-annual (e.g., monthly) predictions were provided, we calculated annual averages.When sub-tract (e.g., block) predictions were provided, we calculated Tract means.When predictions were gridded, we converted to Census geographies by extracting values at block locations and then population weighting to the tract level.
One of the models (SEARCH model) is only available for the eastern half of the contiguous US (90° W longitude), which includes US cities as far west as Chicago.The other five models are available for the entire contiguous US.

Analysis
We conducted three pairwise comparisons of the model-predictions: (1) scatterplot matrices, (2) Pearson's r, and (3) root mean square difference (RMSD) between predictions.We also generated boxplots showing distribution of predictions, and calculated the two values in each tract to indicate the range of model predictions: range (i.e., max minus min), and trimmed range (second-highest value minus second-lowest value).
To assess factors that may modify model agreement, we conducted comparisons for the following geographies: (1) all locations, (2) urban vs. rural (urban defined as all tracts intersecting with Census urbanized areas, all remaining tracts are considered rural), (3) by region (using the 9 NOAA climate regions), and ( 4) stratified by population density (using the 2010 tract-level population density).
Model-model comparisons by geography (Figure 2) suggests modest differences among most regions, and minor differences between urban/rural locations.Pearson correlation coefficients indicate that modelmodel agreement is slightly lower in the Midwest and South than in other regions.RMSDs indicate agreement is slightly lower in the West.
Figure 3 shows the prediction variability by concentration and location.Two aspects stand out: first, the relative agreement among the models, across the range of concentrations (Figure 3D).In locations for which the median predicted concentration is comparatively low (less than 6 μg/m 3 ), EPA predictions tend to be slightly higher than the other models.For the very lowest-concentration locations, with median predicted concentrations less than 3 μg/m 3 , the Martin2019 predictions too tend to be slightly larger than the other models.The SEARCH model is only available for the eastern half of the contiguous US and so therefore excludes lower-population-density lower-concentration regions found in the western half of the contiguous US.The CACES and Harvard models tend to agree with each other and to be near the median prediction, for each concentration range (Figure 3D).
Second, the range of model predictions (a measure of between-model disagreement) exhibits a potentially surprising relationship with concentration level (see Figure 3E, 3F).We might have expected that the range of predictions would be wider for higher-concentration locations (e.g., consistent with the models having a certain percent-error in their predictions).Instead, the range of model predictions is approximately constant across levels of pollution (Figure 3E,3F).This is consistent with an additive rather than multiplicative error model.To the extent that there is a pattern (more so for Figure 3E than Figure 3F), the range of predictions is greater in lower-than in higher-concentration locations.The finding reflects the patterns mentioned in the previous paragraph: below 5 or 6 μg/m 3 , the EPA predictions (and, below 3 μg/m 3 , the Martin2019 predictions too) are larger than the other models' predictions; it suggests that predicting concentrations in low-concentration locations might be more challenging (greater model-model difference) than in medium-or high-concentration locations.
Overall, while some model-model differences are revealed in Figure 3, the main finding is relative agreement.Model-model comparisons can identify the level of model agreement/disagreement, but not of accuracy or error.In cases where the models agree (or disagree), it's possible all of the models are incorrect.Thus, a useful step for future research would be to compare against held-out measurements --either via a coordinated effort by the researchers to hold out a consistent set of measurements, or via an independent dataset of concentrations that none of the researchers employed in model-building.
Limitations of this research include the following.(1) We considered one set of spatiotemporal comparisons (annual-average; national/regional/urban-rural) and one set of metrics (RMSE, correlation), but did not compare all possible comparisons (e.g., did not investigate seasonal or daily models, nor subregional or local/community model results) or metrics.Other metrics or spatiotemporal representations of the models too may be useful for health, environmental justice, or risk analysis.(2) We have not specifically investigated the fitness of these models for specific purposes, including epidemiological studies, environmental justice studies, public outreach, regulatory analysis, or risk assessment.(3) As mentioned above, we did not compare against measurements; this paper presents only a model-model comparison.Model-model agreement is not the same as a model being "correct".( 4) We have identified that the empirical models are relatively consistent with each other, but we have not investigated, within the models themselves, why.For example, it may be that the models use the same or similar independent variables; or, it may be that the similarities in model-prediction are despite large differences in independent variables employed.
Strengths of this research include the following.We inter-compared several models, and shed light on similarities and differences nationally, regionally, for urban/rural differences, by pollution level, and by population density.This is, to our knowledge, the first intercomparison of national empirical models.As noted above, we did not compare against monitors; however, that aspect can partially be viewed as a strength, because the monitoring network is not evenly distributed spatially.Comparisons of models at monitor locations may or may not shed light on concentrations at unmonitored locations; the comparisons here are at Census geographics (Tracts) and so reflect locations where people live.
The models employ different techniques for model building.Some are closer to a linear model, some use machine learning or highly complex mathematical relationships that would be difficult for a human to create or understand.They employ a wide variety of independent variables.However, all of the models use EPA monitoring station data as the model-building dataset.Whatever strengths or weaknesses exist in using EPA monitors (and their locations) for empirical models, those likely impact all of the models.
We conducted several sensitivity analyses.First, reflecting that SEARCH results are only available in the eastern half of the US, we generated pairwise scatterplots for only the eastern half of the US (Figure S1).
Next, we generated separate scatterplots for urban-only (Figure S2) and urban-only in the eastern half of the US (Figure S3) and for rural-only (Figure S4) and for rural-only in the eastern half of the US (Figure S5).We find, for example, that the maximum RMSD is slightly larger for rural areas than for urban areas, a finding that may differ from expectations but is consistent with results described above (Figure 3E) and may reflect the lower density of monitors in rural areas or that the correlation between concentrations and land use may be lower in rural than in urban areas.
We repeated the analyses in Figure 3 but for the eastern half of the US (Figure S6 and S7).The findings are generally consistent with results above: the models generally agree with each other.The range of predictions (a measure of model-model disagreement) is greater at lower-concentration locations than at high-concentration locations.* Tract centroid predictions may be publicly released at a later date.
) The model by van Donkelaar et al. (2019) statistically "fuses" a chemical transport model (GEOS-Chem), satellite observations of aerosol optical depth, and ground-based observations using a geographically weighted regression.

Figure 1 :Figure 2 :
Figure 1: Scatterplot matrix for 2010 tract-level PM2. 5. Scatterplots in the upper right show pairwise tract-level predictions from each model (µg/m 3 ).Grey dashed line shows 1:1 line, red solid line shows linear trendline.Corresponding boxes in the bottom left show Pearson's correlation (r; unitless) and root mean squared difference (RMSD; µg/m 3 ) between model predictions.

Figure 3 :Table 1 :
Figure 3: Variability by concentration and location.Maps show median concentration among model predictions within each tract (A) and within-tract variability of model predictions calculated as the max minus min (B) and 2 nd max minus 2 nd min (C).Boxplots show (y-axis) range of tract-level model predictions (D) and within-tract variation calculated as either max minus the min (E) or 2 nd max minus 2 nd min (E) of model predictions within each tract as a function of the median concentration among model predictions within each tract, binned to 1 µg/m 3 bins (x-axis).In the boxplots, horizontal bar shows the median, box shows the interquartile range, and vertical lines show the 5 th and 95 th percentiles of the variability for tracts within each bin.