Exploring predictors’ importance in binomial logistic regressions

Summary

This tutorial presents a real example where dominance analysis is used to determine predictors’ importance in a binomial logistic regression model (Azen and Traxel, 2009). More specifically, we model the distribution of a tropical native bird species, inhabiting a small oceanic island, using a binomial generalized linear model, and dominance analysis to identify the most important environmental variables.

Introducing a tropical bird
Fitting a logistic regression model
Using dominance analysis
Applying bootstrap analysis

This document explains how to perform a dominance analysis to compare the relative importance of predictors in a binomial logistic regression model, using dominanceanalysis package. It is important to note that it only describes a small part of all dominanceanalysis functions. Nonetheless, it includes the most significant ones, such as dominanceAnalysis(), da.glm.fit(), bootDominanceAnalysis(), and bootAverageDominanceAnalysis().

1. Introducing a tropical bird

Understanding how a species is distributed throughout the landscape is key to create effective conservation actions. By disentangling the limiting factors behind a species distribution, we can start to understand its ecological requirements, which can help support conservation and population management actions.

In this tutorial, we explore the distribution of a tropical native bird species inhabiting a small oceanic island. Since human occupation, the island’s forests have disappeared from the flat lowland areas, located closer to the coastline. Nowadays, these areas are considered anthropogenic areas, which include cities, agricultural areas (e.g., horticultures), savannas, and oil-palm monocultures.

We use binomial generalized linear models (GLMs) to explore the role of several environmental variables in shaping this species distribution. Furthermore, we use dominance analysis to determine the relative importance of every variable in the logistic regression model.

We use the tropicbird dataset, which is a collection of points distributed across the island (Soares, 2017). In each of these points, a 10-minute count was carried out to record the species presence (assuming 1 if the species was present, or 0 if it was absent). The species’ presence/absence is the binary response variable (i.e., dependent variable). Additionally, we characterized all sampled points for each of the following environmental variables (i.e., independent variables, or predictors):

remoteness (rem) is an index that represents the difficulty of movement through the landscape, with the highest values corresponding to the most remote areas;
land use (land) is an index that represents the land-use intensification, with the highest values corresponding to the more humanized areas (e.g., cities, agricultural areas, horticultures, oil-palm monocultures);
altitude (alt) is a continuous variable, with the highest values corresponding to the higher altitude areas;
slope (slo) is a continuous variable, with the highest values corresponding to the steepest areas;
rainfall (rain) is a continuous variable, with the highest values corresponding to the rainy wet areas;
distance to the coast (coast) is the minimum linear distance between each point and the coast line, with the highest values corresponding to the points further away from the coastline.

Please note that in this dataset there are no false negatives, hence the bird was always recorded if present. Also, the dataset has no missing values, so it is already prepared for the analysis.

We start by loading the csv data using the data function.

library(dominanceanalysis)
data("tropicbird")

To see how the dataset is organized we can just write str(tropicbird). The first column represents the ID of each point. From the second to the seventh column we have the six continuous predictors (rem, alt, slo, rain, coast, and land), and finally in the last column we have the binary response variable (pres).

With the str() function, we can conclude the dataset has 2398 points, which we already know are distributed throughout the island.

str(tropicbird)
#> 'data.frame':    2398 obs. of  8 variables:
#>  $ ID   : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ rem  : num  5.77 5.78 5.79 5.8 5.82 ...
#>  $ alt  : int  67 49 32 23 53 243 244 307 356 380 ...
#>  $ slo  : num  9.99 10.28 7.63 7.43 12.27 ...
#>  $ rain : num  6396 6398 6401 6404 6456 ...
#>  $ coast: num  0.00657 0.00471 0.00287 0.00108 0.00231 0.00443 0.00433 0.00445 0.00374 0.00427 ...
#>  $ land : int  2 2 2 2 2 2 2 2 2 2 ...
#>  $ pres : int  0 0 0 0 0 0 0 0 0 0 ...

We randomly select 70% of the points as the training set to develop the binomial generalized linear model (train), and the remaining 30% as the testing set to validate the model (test). However, we have to import caTools package.

library(caTools)
set.seed(101) 
sample <- caTools::sample.split(tropicbird$ID, SplitRatio = .70)
train <- subset(tropicbird, sample == TRUE)
test  <- subset(tropicbird, sample == FALSE)

2. Fitting a logistic regression model

The next step is to fit the model to the data frame train, using the glm() function. Note that distribution of the response variable is defined by the argument family=, which in this case corresponds to the binomial distribution.

modpres <- glm(pres~rem+land+alt+slo+rain+coast, data=train, family=binomial(link='logit'))

The function summary() provides the summarized results of our model.

summary(modpres)
#> 
#> Call:
#> glm(formula = pres ~ rem + land + alt + slo + rain + coast, family = binomial(link = "logit"), 
#>     data = train)
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept) -1.255e+01  1.669e+00  -7.518 5.54e-14 ***
#> rem          6.225e-01  1.998e-01   3.116 0.001833 ** 
#> land         3.009e-01  4.048e-01   0.743 0.457210    
#> alt          1.920e-03  4.835e-04   3.972 7.13e-05 ***
#> slo          1.588e-02  1.567e-02   1.013 0.310834    
#> rain         6.892e-04  1.706e-04   4.040 5.35e-05 ***
#> coast        2.865e+01  8.476e+00   3.380 0.000725 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 530.66  on 1677  degrees of freedom
#> Residual deviance: 352.15  on 1671  degrees of freedom
#> AIC: 366.15
#> 
#> Number of Fisher Scoring iterations: 8

Remember, our goal with this analysis is to describe the occurrence of our tropical bird species in terms of a linear combination of six environmental variables (i.e., predictors) and a constant term (intercept).

First, we can see that remoteness (rem), altitude (alt), rainfall (rain) and distance to coast (coast) are statistically significant predictors (p < 0.05). Altitude has the lower p-value suggesting this predictor has a strong association with the species occurrence. All significant predictors have a positive coefficient, which indicates that our bird is associated to remote, rainy, higher altitude areas, further way from the coast.

If we use the anova() function, we can interpret the table of deviance.

anova(modpres, test="Chisq")
#> Analysis of Deviance Table
#> 
#> Model: binomial, link: logit
#> 
#> Response: pres
#> 
#> Terms added sequentially (first to last)
#> 
#> 
#>       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
#> NULL                   1677     530.66              
#> rem    1  145.298      1676     385.36 < 2.2e-16 ***
#> land   1    1.446      1675     383.92 0.2292467    
#> alt    1   10.647      1674     373.27 0.0011027 ** 
#> slo    1    0.718      1673     372.55 0.3967159    
#> rain   1    8.057      1672     364.49 0.0045317 ** 
#> coast  1   12.346      1671     352.15 0.0004418 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We must look at the difference between the null deviance, which is the deviance of the null model (i.e., model with only the intercept), and the residual deviance. The bigger is difference the best our model is doing against the null model. In our case, we can see that adding remoteness alone reduces the deviance drastically (i.e., from 530.66 to 385.36). The small deviance value of slope indicates this variable does not add much to the model, meaning that almost the same amount of variation is explained when this variable is added.

The McFadden index, R²M, is sometimes referred to as closer equivalent of the coefficient of determination, R², in linear regressions. We can obtain the R²M for our model by using the pR2() function in the pscl package.

library(pscl)
#> Classes and Methods for R originally developed in the
#> Political Science Computational Laboratory
#> Department of Political Science
#> Stanford University (2002-2015),
#> by and under the direction of Simon Jackman.
#> hurdle and zeroinfl functions by Achim Zeileis.
pR2(modpres)
#> fitting null model for pseudo-r2
#>          llh      llhNull           G2     McFadden         r2ML         r2CU 
#> -176.0740171 -265.3300722  178.5121103    0.3363963    0.1009205    0.3722362

After fitting the logistic regression to our dataset, we know how our species responds to the six environmental variables. Still, we are interested in the relative importance of each variable. Please note that we do not seek to identify which one of these predictors must be eliminated to achieve the best model. If that was our goal, we would use one of the several statistical model selection procedures, like the lasso. Instead, we want to assess predictors’ importance through the comparison of their individual contributions in the selected model. In other words, we want to explore the importance of each environmental variable to predict the species occurrence. Furthermore, this information will allow us to recognize the limiting factors constraining our species’ distribution, which if you remember can be used to create effective conservation actions.

3. Using dominance analysis

Initially, dominance analysis was developed for ordinary least squares regressions (Budescu, 1993), but currently it can be used with logistic regression models (Azen and Traxel, 2009) and hierarchical linear models (Luo and Azen, 2013).

This method is used to determine the relative importance of predictors in a regression analysis. Importance is defined as a qualitative comparison between pairs of predictors (Budescu, 1993). With this being said, one predictor is more important than another, if it contributes more to the prediction of the response variable in all possible subset models (i.e., all possible combinations of predictors) (Azen and Budescu, 2003). Nevertheless, we highlight that importance depends upon the set of predictors considered in the analysis.

In dominance analysis, one predictor is said to completely dominate another, if its additional contribution to every possible subset model (that doesn’t include any of the predictors) is greater than that of the other predictor (Azen and Traxel, 2009). However, if this level of dominance cannot be established, but the average additional contribution of that predictor within each model size is greater than that of the other predictor, we can say that the first conditionally dominates the latter. Yet, if this is still not established, but this predictor’s average conditional contribution is greater than that of the other predictor over all model sizes, we can say that the first generally dominates the latter.

For ordinary least squares regressions, the predictors’ additional contribution to a certain subset model is defined as the change in R² when the predictor is added to the model. In logistic regressions, several analogues of R² were proposed as measures of model fit, but only four were considered according to three criteria (Azen and Traxel, 2009). These are included in the dominanceanalysis package: McFadden (r2.m), Cox and Snell (r2.cs), Nagelkerke (r2.n), and Estrella (r2.e). To view these indices, we can use the da.glm.fit() function.

da.glm.fit()("names")
#> [1] "r2.m"  "r2.cs" "r2.n"  "r2.e"

With this very brief introduction to dominance analysis, we can start to analyze our dataset. We can use the glm object to perform dominance analysis, by using the dominanceAnalysis() function.

dapres<-dominanceAnalysis(modpres)

To show all the results, we use the print() function. However, we can explore just the raw values of each fit index by using getFits(). These raw values are organized in tables, one for each fit index. For simplicity, we only consider the McFadden index, R²M, to interpret the results (e.g., $fits$r2.m).

getFits(dapres,"r2.m")
#> 
#> Dominance analysis fit matrices
#> * Fit index:  r2.m 
#>                               rem  land   alt   slo  rain coast
#> 1                           0.274 0.156 0.074 0.038 0.069 0.125
#> rem                               0.003 0.023 0.005 0.000 0.015
#> land                        0.121       0.014 0.006 0.033 0.040
#> alt                         0.222 0.095       0.014 0.157 0.058
#> slo                         0.240 0.123 0.050       0.072 0.105
#> rain                        0.205 0.121 0.163 0.041       0.186
#> coast                       0.164 0.071 0.007 0.018 0.130      
#> rem+land                                0.020 0.004 0.001 0.012
#> rem+alt                           0.000       0.001 0.014 0.005
#> rem+slo                           0.002 0.020       0.001 0.013
#> rem+rain                          0.004 0.037 0.005       0.028
#> rem+coast                         0.000 0.013 0.003 0.014      
#> land+alt                    0.127             0.003 0.093 0.029
#> land+slo                    0.118       0.011       0.038 0.039
#> land+rain                   0.088       0.073 0.010       0.077
#> land+coast                  0.093       0.003 0.005 0.071      
#> alt+slo                     0.209 0.084             0.153 0.058
#> alt+rain                    0.079 0.030       0.010       0.077
#> alt+coast                   0.169 0.067       0.014 0.177      
#> slo+rain                    0.169 0.090 0.132             0.159
#> slo+coast                   0.148 0.057 0.003       0.126      
#> rain+coast                  0.047 0.011 0.054 0.014            
#> rem+land+alt                                  0.001 0.015 0.006
#> rem+land+slo                            0.018       0.002 0.011
#> rem+land+rain                           0.034 0.004       0.025
#> rem+land+coast                          0.014 0.003 0.014      
#> rem+alt+slo                       0.000             0.015 0.004
#> rem+alt+rain                      0.000       0.002       0.023
#> rem+alt+coast                     0.001       0.001 0.032      
#> rem+slo+rain                      0.003 0.034             0.027
#> rem+slo+coast                     0.000 0.011       0.015      
#> rem+rain+coast                    0.000 0.031 0.004            
#> land+alt+slo                0.126                   0.094 0.030
#> land+alt+rain               0.049             0.004       0.050
#> land+alt+coast              0.103             0.003 0.114      
#> land+slo+rain               0.082       0.066             0.076
#> land+slo+coast              0.091       0.002       0.075      
#> land+rain+coast             0.036       0.046 0.010            
#> alt+slo+rain                0.071 0.024                   0.072
#> alt+slo+coast               0.156 0.056             0.167      
#> alt+rain+coast              0.025 0.004       0.005            
#> slo+rain+coast              0.037 0.007 0.045                  
#> rem+land+alt+slo                                    0.015 0.005
#> rem+land+alt+rain                             0.002       0.023
#> rem+land+alt+coast                            0.001 0.032      
#> rem+land+slo+rain                       0.032             0.025
#> rem+land+slo+coast                      0.012       0.015      
#> rem+land+rain+coast                     0.032 0.004            
#> rem+alt+slo+rain                  0.000                   0.022
#> rem+alt+slo+coast                 0.001             0.033      
#> rem+alt+rain+coast                0.001       0.002            
#> rem+slo+rain+coast                0.000 0.029                  
#> land+alt+slo+rain           0.047                         0.050
#> land+alt+slo+coast          0.101                   0.114      
#> land+alt+rain+coast         0.022             0.004            
#> land+slo+rain+coast         0.031       0.040                  
#> alt+slo+rain+coast          0.021 0.002                        
#> rem+land+alt+slo+rain                                     0.023
#> rem+land+alt+slo+coast                              0.033      
#> rem+land+alt+rain+coast                       0.002            
#> rem+land+slo+rain+coast                 0.030                  
#> rem+alt+slo+rain+coast            0.001                        
#> land+alt+slo+rain+coast     0.020                              
#> rem+land+alt+slo+rain+coast

The first row represents the raw values of each univariate model. The following rows show the additional contribution of each predictor added to the subset model (indicated by the first column). For example, the univariate model containing altitude produces R²M = 0.074 (see entry in the first row under the alt column of the table). If distance to coast(coast) is added to this model, R²M increases by 0.058 (see entry in the alt row under the coast column of the table). In contrast, if rainfall (rain) is added to the altitude subset model, the increase in R²M is 0.157 (see entry in the alt row under the rain column of the table). These results indicate that rainfall dominates distance to coast when added to the altitude subset model, because the additional contribution of rainfall to this model (0.157) is larger than the additional contribution of distance to coast to the same model (0.058).

Furthermore, if the additional contributions of a predictor are higher than that of other predictor for every subset model, the first predictor is said to completely dominate the second. This is the case with distance to coast, which completely dominates slope (compare the values of the columns coast and slo in the previous table). The summarized results for complete dominance could be retrieved using dominanceMatrix().

dominanceMatrix(dapres, type="complete",fit.functions = "r2.m", ordered=TRUE)
#>       rem coast alt rain land slo
#> rem   0.5   0.5 0.5  0.5  1.0 1.0
#> coast 0.5   0.5 0.5  0.5  0.5 1.0
#> alt   0.5   0.5 0.5  0.5  0.5 0.5
#> rain  0.5   0.5 0.5  0.5  0.5 0.5
#> land  0.0   0.5 0.5  0.5  0.5 0.5
#> slo   0.0   0.0 0.5  0.5  0.5 0.5

This complete dominance matrix summarizes the relation between each pair of predictors. If the value between two predictors is 1, the predictor under the first column completely dominates the other predictor of the pair. If the value is 0, the predictor under the first column is completely dominated by the other predictor of the pair. Lastly, if the value is 0.5, complete dominance could not be established between the pair. For this reason, as distance to coast dominates slope completely in all submodels, in the matrix the element row:coast-col:slo have a value of 1 (see entry in the coast row under the slo column of the table).

However, if complete dominance cannot be established, we explore conditional dominance by using method contributionByLevel().

contributionByLevel(dapres,fit.functions="r2.m")
#> 
#> Contribution by level
#> * Fit index:  r2.m 
#>  level   rem  land   alt   slo  rain coast
#>      0 0.274 0.156 0.074 0.038 0.069 0.125
#>      1 0.190 0.082 0.051 0.017 0.079 0.081
#>      2 0.125 0.034 0.037 0.007 0.069 0.050
#>      3 0.078 0.009 0.030 0.004 0.054 0.033
#>      4 0.044 0.001 0.029 0.003 0.042 0.025
#>      5 0.020 0.001 0.030 0.002 0.033 0.023

For each model size (see column level of the table), the average additional contribution of each predictor is calculated. For example, the average additional contribution of altitude to models of size 1 is computed as (0.0072582 + 0.0137805 + 0.1629760 + 0.0227860 + 0.0504051)/5 = 0.051. To establish conditional dominance, we compare the average additional contributions across all model sizes. For example, rainfall conditionally dominates slope, because the average additional contribution of rainfall is higher than that of slope across all model sizes (compare the values of the columns rain and slo in the table).

A graph of contribution by levels can be plotted using method plot() with argument which.graph='conditional'.

plot(dapres, which.graph ="conditional",fit.function = "r2.m")

The summarized results for conditional dominance can be retrieved using dominanceMatrix() (see entry in the rain row under the slo column of the table).

dominanceMatrix(dapres, type="conditional",fit.functions = "r2.m", ordered=TRUE)
#>       rem alt rain coast land slo
#> rem   0.5 0.5  0.5   0.5  1.0 1.0
#> alt   0.5 0.5  0.5   0.5  0.5 1.0
#> rain  0.5 0.5  0.5   0.5  0.5 1.0
#> coast 0.5 0.5  0.5   0.5  0.5 1.0
#> land  0.0 0.5  0.5   0.5  0.5 0.5
#> slo   0.0 0.0  0.0   0.0  0.5 0.5

Lastly, if conditional dominance cannot be established, we explore general dominance by using averageContribution() method.

averageContribution(dapres,fit.functions = "r2.m")
#> 
#> Average Contribution by predictor
#>        rem  land   alt   slo  rain coast
#> r2.m 0.122 0.047 0.042 0.012 0.058 0.056

The average contribution could be also plotted using method plot() with argument which.graph='general'

plot(dapres, which.graph ="general",fit.function = "r2.m")

To determine general dominance, we compute the mean of each predictor’s conditional measures. We conclude that remoteness has the highest value (0.122) and generally dominates all other predictors. For this reason, in the general dominance matrix, remoteness assumes a value of 1 with all other predictors (see entry in the rem row under every column of the table).

dominanceMatrix(dapres, type="general",fit.functions = "r2.m", ordered=TRUE)
#>       rem rain coast land alt slo
#> rem   0.5  1.0   1.0  1.0 1.0 1.0
#> rain  0.0  0.5   1.0  1.0 1.0 1.0
#> coast 0.0  0.0   0.5  1.0 1.0 1.0
#> land  0.0  0.0   0.0  0.5 1.0 1.0
#> alt   0.0  0.0   0.0  0.0 0.5 1.0
#> slo   0.0  0.0   0.0  0.0 0.0 0.5

Nevertheless, depending on the goals, we might want to analyze the dominance between all pairs of predictors. If this is the case, we can simply use the dominance matrices explained above, which summarize the relation between pairs of predictors in three values (0, 0.5 and 1). In our example, when looking at complete dominance matrix, we conclude that distance to coast completely dominates slope, and remoteness completely dominates land use and slope. However, when looking at conditional dominance matrix, we realize that altitude conditionally dominates slope, and rainfall conditionally dominates slope. Lastly, when exploring general dominance matrix, we conclude that distance to coast generally dominates altitude and land use, land use generally dominates altitude and slope, rainfall generally dominates altitude, distance to coast and land use, and remoteness generally dominates altitude, distance to coast and rainfall.

Furthermore, if we want to establish a rank of importance among predictors, we can explore the raw values of general dominance. In our example, we conclude that remoteness is clearly the most important predictor to explain bird species occurrence (0.122), followed by rainfall (0.058), distance to coast (0.056), land use (0.047), altitude (0.042) and slope (0.012).

4. Applying bootstrap analysis

To evaluate the robustness of our results, we use bootstrapping analysis by using bootDominanceAnalysis() function. Remember that this method could take a long time to complete.

set.seed(12346)
bootmodpres100 <- bootDominanceAnalysis(modpres, R=10)
summary(bootmodpres100,fit.functions="r2.m")

#> Dominance Analysis
#> ==================
#> Fit index: r2.m 
#>    dominance    i     k Dij  mDij SE.Dij  Pij  Pji Pnoij  Rep
#>     complete  rem  land 1.0 0.995 0.0500 0.99 0.00  0.01 0.99
#>     complete  rem   alt 0.5 0.605 0.2047 0.21 0.00  0.79 0.79
#>     complete  rem   slo 1.0 0.990 0.0704 0.98 0.00  0.02 0.98
#>     complete  rem  rain 0.5 0.615 0.2115 0.23 0.00  0.77 0.77
#>     complete  rem coast 0.5 0.730 0.2505 0.46 0.00  0.54 0.54
#>     complete land   alt 0.5 0.500 0.0000 0.00 0.00  1.00 1.00
#>     complete land   slo 0.5 0.510 0.0704 0.02 0.00  0.98 0.98
#>     complete land  rain 0.5 0.500 0.0000 0.00 0.00  1.00 1.00
#>     complete land coast 0.5 0.480 0.0985 0.00 0.04  0.96 0.96
#>     complete  alt   slo 0.5 0.585 0.1888 0.17 0.00  0.83 0.83
#>     complete  alt  rain 0.5 0.490 0.0704 0.00 0.02  0.98 0.98
#>     complete  alt coast 0.5 0.440 0.1633 0.00 0.12  0.88 0.88
#>     complete  slo  rain 0.5 0.365 0.2231 0.00 0.27  0.73 0.73
#>     complete  slo coast 0.0 0.185 0.2426 0.00 0.63  0.37 0.63
#>     complete rain coast 0.5 0.535 0.1282 0.07 0.00  0.93 0.93
#>  conditional  rem  land 1.0 0.995 0.0500 0.99 0.00  0.01 0.99
#>  conditional  rem   alt 0.5 0.605 0.2047 0.21 0.00  0.79 0.79
#>  conditional  rem   slo 1.0 0.990 0.0704 0.98 0.00  0.02 0.98
#>  conditional  rem  rain 0.5 0.615 0.2115 0.23 0.00  0.77 0.77
#>  conditional  rem coast 0.5 0.730 0.2505 0.46 0.00  0.54 0.54
#>  conditional land   alt 0.5 0.500 0.0000 0.00 0.00  1.00 1.00
#>  conditional land   slo 0.5 0.510 0.0704 0.02 0.00  0.98 0.98
#>  conditional land  rain 0.5 0.500 0.0000 0.00 0.00  1.00 1.00
#>  conditional land coast 0.5 0.480 0.0985 0.00 0.04  0.96 0.96
#>  conditional  alt   slo 0.5 0.585 0.1888 0.17 0.00  0.83 0.83
#>  conditional  alt  rain 0.5 0.490 0.0704 0.00 0.02  0.98 0.98
#>  conditional  alt coast 0.5 0.440 0.1633 0.00 0.12  0.88 0.88
#>  conditional  slo  rain 0.5 0.365 0.2231 0.00 0.27  0.73 0.73
#>  conditional  slo coast 0.0 0.185 0.2426 0.00 0.63  0.37 0.63
#>  conditional rain coast 0.5 0.535 0.1282 0.07 0.00  0.93 0.93
#>      general  rem  land 1.0 0.995 0.0500 0.99 0.00  0.01 0.99
#>      general  rem   alt 0.5 0.605 0.2047 0.21 0.00  0.79 0.79
#>      general  rem   slo 1.0 0.990 0.0704 0.98 0.00  0.02 0.98
#>      general  rem  rain 0.5 0.615 0.2115 0.23 0.00  0.77 0.77
#>      general  rem coast 0.5 0.730 0.2505 0.46 0.00  0.54 0.54
#>      general land   alt 0.5 0.500 0.0000 0.00 0.00  1.00 1.00
#>      general land   slo 0.5 0.510 0.0704 0.02 0.00  0.98 0.98
#>      general land  rain 0.5 0.500 0.0000 0.00 0.00  1.00 1.00
#>      general land coast 0.5 0.480 0.0985 0.00 0.04  0.96 0.96
#>      general  alt   slo 0.5 0.585 0.1888 0.17 0.00  0.83 0.83
#>      general  alt  rain 0.5 0.490 0.0704 0.00 0.02  0.98 0.98
#>      general  alt coast 0.5 0.440 0.1633 0.00 0.12  0.88 0.88
#>      general  slo  rain 0.5 0.365 0.2231 0.00 0.27  0.73 0.73
#>      general  slo coast 0.0 0.185 0.2426 0.00 0.63  0.37 0.63
#>      general rain coast 0.5 0.535 0.1282 0.07 0.00  0.93 0.93

This method provides measures of accuracy for our estimates. More specifically, the bootstrap values of complete, conditional and general dominance are calculated for each pair of predictors (see column Dij that represents the original result, mDij that represents the mean for Dij on bootstrap samples, and SE.Dij that represents standard error). These values of dominance can be interpreted as the expected level of dominance between pairs of predictors, as the degree to which this pattern was found on the resamples. Note that we recommend this analysis is performed with at least 1000 replications (R), with the size of the original sample, for precise results (i.e., bootstrap samples).

Additionally, other parameters are estimated: Pij (proportion of bootstrap samples where i dominates j); Pji (proportion of bootstrap samples where j dominates i); Pnoij (proportion of samples where no dominance can be asserted); and Rep (proportion of samples where original dominance is replicated). When standard error and reproducibility are close or equal to zero and one, respectively, the results are fairly robust. For example, in our analysis we see that the bootstrap complete dominance of remoteness over land use has a standard error of 0.05 and a reproducibility of 0.99.

Moreover, we can obtain bootstrap general dominance values by using bootAverageDominanceAnalysis() function.

bootavemodpres100<-bootAverageDominanceAnalysis(modpres,R=100)
summary(bootavemodpres100,fit.functions=c("r2.m"))

#> Bootstrap Average for Dominance Analysis
#> ========================================
#> Resamples:  100 
#> Fit index: r2.m 
#>     Var original   bs.E      bias   bs.SE
#> 1   rem   0.1219 0.1259  0.003985 0.02136
#> 2  land   0.0473 0.0480  0.000688 0.00718
#> 3   alt   0.0418 0.0417 -0.000176 0.01161
#> 4   slo   0.0117 0.0135  0.001781 0.00736
#> 5  rain   0.0576 0.0601  0.002572 0.01275
#> 6 coast   0.0561 0.0574  0.001377 0.01395

The results are organized per fit index and include the estimated bootstrap values of general dominance for each predictor (bs.E) and its corresponding standard errors (bs.SE), as well as the previously calculated general dominance values (original). The difference between these values and the estimated bootstrap values is represented by Bias. In our example, the bias is small, which is another good indicator of our results’ robustness.

References

Following the citation rules of Biological Conservation

Azen, R., Budescu, D. V., 2003. The Dominance Analysis Approach for Comparing Predictors in Multiple Regression. Psychol. Methods 8, 129-148. https://doi.org/10.1037/1082-989X.8.2.129
Azen, R., Traxel, N., 2009. Using Dominance Analysis to Determine Predictor Importance in Logistic Regression. J. Educ. Behav. Stat. 34, 319-347. https://doi.org/10.3102/1076998609332754
Budescu, D. V., 1993. Dominance analysis: A new approach to the problem of relative importance of predictors in multiple regression. Psychol. Bull. 114, 542-551. https://doi.org/10.1037/0033-2909.114.3.542
Luo, W., Azen, R., 2013. Determining Predictor Importance in Hierarchical Linear Models Using Dominance Analysis. J. Educ. Behav. Stat. 38, 3-31. https://doi.org/10.3102/1076998612458319
Soares, F.C., 2017. Modelling the distribution of Sao Tome bird species: Ecological determinants and conservation prioritization. Faculdade de Ciencias da Universidade de Lisboa.

Author’s e-mail: filipa.mco.soares@gmail.com

Exploring predictors’ importance in binomial logistic regressions

Filipa Coutinho Soares

2024-02-04