ISCA stands for “Inductive Subgroup Comparison Approach”, a data-driven approach to identifying and comparing groups in data originally proposed in Drouhot (2021). Specifically, ISCA identifies the main subgroups within Group 1 (a majority or reference group), and matches members of Group 2 (a minority group) to different subgroups from the majority based on social similarity. In other words, ISCA allows minority members to be compared to socially similar members of the majority. For instance, when comparing two ethnic groups (a majority and a minority group), ISCA might identify a subgroup comprising men of low social status in the majority group. It will then identify similar members of the minority group (also men of low social status) and match them to the appropriate subgroup within the majority, so that the comparison is more appropriate than would be if we compared all members to the majority group to all members of the minority groups. ISCA is especially useful when majority and minority groups to be compared are internally heterogeneous, resulting in different baselines within the majority group against which the minority group should be compared (again, to illustrate, for many comparisons it is best to study low status minority members vis-à-vis low status majority members rather than majority members in general).
On a more theoretical level, ISCA helps researchers avoid essentializing minority and majority groups as (e.g. ethnically, racially, religiously) homogenous categories in their quantitative analyses, and recover the internal social structures of social groups. Unlike common regression approach which obscure intra-group heterogeneity by yielding a single estimate for each pre-defined, fixed (nominal) group category, ISCA relies on a mixture of fuzzy clustering, Monte Carlo simulation, and regression analysis to assess minority and majority groups within a population as socially comparable subgroups, yielding an estimate for each subgroup. Estimates can then be compared across subgroups, as well as to a grand mean capturing traditional, single estimates at the group level. Together, these comparisons allow to see which empirical subgroups may be driving overall trends observed at the aggregate level, and, conversely, which pointed trends at the subgroup level get lost when only looking at nominal group-level estimates. For an in-depth discussion, see Drouhot 2021: 804-811. If you are already familiar with the logic of ISCA, you can skip the next section and move right on to the first function.
ISCA consists of three distinct methodological steps.
Through fuzzy cluster analysis— a family of data partitioning techniques from the larger umbrella of unsupervised machine learning (Kaufman and Rousseeuw 2005; Molina and Garip 2019) — the first step is to inductively identify the key subgroups within the majority population.
Minority groups are assigned to one of the subgroups making up the majority population on the basis of (social) similarity.
Hypotheses are tested and outcomes of interest are modelled in these matched minority-majority subgroups.
Thus, ISCA effectively switches the relevant categories of analysis in intergroup comparisons from nominal (e.g., religious, ethnic, or otherwise categorically defined) groups to data-driven subgroups and allows for within as well as the more traditional between-group comparisons.
From the point of view of the integration / assimilation literature within the social sciences, ISCA allows for the study of distinct assimilation/integration pathways by taking intragroup heterogeneity among both the majority and minority population into account and is characterized by three key features.
It is inductive, insofar as subgroups of interest emerge from the data rather than prenotions from the researcher. This is a point worth emphasizing: you could decide to compare low-income immigrants to low-income natives, for example. But you cannot, by definition, know in advance whether income is the right dimension to organize your comparison. Additionally, it is possible for several dimensions (income, gender, age, urban location, etc.) to consolidate into subgroups (Blau 1974). In other words, what matters for group heterogeneity may well be specific configurations of variables rather than specific variables. ISCA relies on the inherent reflexivity afforded by unsupervised machine-learning approaches: while domain-specific knowledge remains key to interpreting results, researchers need not impose assumptions about the structure of the data or the analytical appropriateness of a given social category in organizing heterogeneity within the data.
ISCA is probabilistic because it does not rely
on “hard” clustering assignment (such as that obtained with k-means
clustering), where observations can belong to one cluster only, as such
an approach would lead to reifying subgroups themselves and possibly
overstating cross-cluster differences. Rather, it relies on fuzzy
clustering, where membership in each cluster is uncertain and expressed
through a membership score, which is then used to assign cases to groups
in a stochastic and thus truly probabilistic manner.
More
specifically, each individual observation receives k membership
scores, where k is the number of clusters. Membership scores
range from 0 to 1, and the sum of each individual’s membership scores
adds up to 1. If all individuals have a very high probability of
belonging to only one cluster, then fuzzy and hard clustering do not
differ much. However, if membership probabilities are balanced across
clusters, hard clustering may result in arbitrary, and possibly
misleading group assignments.1
Finally, ISCA is iterative. As there exists significant uncertainty around subgroup boundaries (as expressed by membership scores from fuzzy clustering that are balanced), results following stochastic assignment may misrepresent the underlying uncertainty about cluster membership when membership is assigned only once. Thus, ISCA relies on Monte Carlo simulation and multiple iterations of the assignment and modelling steps to obtain stable results. Rather than a statistical or analytical nuisance, the procedure hence regards assignment uncertainty as meaningful since it reflects the blurry boundaries between ideal-typical subgroups making up the majority and minority categories of interest.
The following three functions allow researchers to conduct the ISCA in R.
Identifying Heterogeneity in Majority Group through Fuzzy Clustering and Stochastically Assign Individuals to a Subgroup
The function translates to step 1 and 2 (see Drouhot 2021: 808-10). First, identify and choose variables that are known to be associated with your outcome variable based on past research. In the original implementation, ISCA does not divide up a majority group sample directly in terms of the outcome variable but uses (demographic) variables that are known to be related to it. Note, however, that splitting groups directly in terms of the outcome of interest is of course possible, and can be appropriate depending on the research questions and related substantive concerns.
To obtain subgroups in the majority sample, the ISCA package then
relies on the fuzzy c-means clustering algorithm (Bezdek, Ehrlich, and Full 1984) and
specifically the cmeans()
function of the
e1071
-package (Meyer et al.
2023). It is advisable to transform continuous variables in
dummies (e.g., coded 0 for values below the median and 1 for values
above) to avoid an arbitrary weighting of attributes due to different
scales, which would affect the clustering results in undesirable
ways.
You need to decide the number of clusters you would like to model. There is a large literature on this across social sciences, as well as computer and data science. Hence, a large number of diagnostic tests exist to determine the most appropriate number, each with strength and weakness; for instance, Drouhot (2021: 839-40 in Appendix B) used 6 tests from the fclust package (Ferraro, Giordani, and Serafini 2019). Additionally, users can compare multiple solutions and go for the one where each subgroup is the most easily interpretable. In particular, clustering solution yielding either one very large, multiple very small, or redundant groups should be approached with caution.
For transparency, the function uses the following arguments within
the cmeans()
function: iter.max=100
,
verbose=TRUE
, dist="manhattan"
,
method="cmeans"
, m=1.5
(default, but can be
adjusted in the ISCA_random_assignments()
-function),
rate.par = NULL
.
Once you know the number of clusters you want to model and the variables with which the clusters should be created, the first function from the ISCA-package can be used.
The ISCA_random_assignments()
-function requires the
following arguments:
data=
A dataset containing all relevant
variables
filter=
Specification of the variable name that
contains information on majority / minority status.
majority_group=
The specific value within the
variable specified in the previous filter-argument indicating majority
status. This could be either a numeric value or a character
string.
minority_group=
The specific value(s) indicating
minority status in the filter variable. This could be either a numeric
value or a character string. It can be one single minority group or a
vector of several minority groups.
fuzzifier =
The fuzzifier is a value larger than 1
determining the extent of overlap between clusters. A value of 1
effectively makes fuzzy c-means equivalent to hard k-means. By default,
the value is 1.5.
n_clusters=
Number of clusters to be
created.
draws=
Number of probabilistic draws. If not
specified, the default is 500.
cluster_vars=
Vector specifying the variables that
should be used to create the clusters.
The output is a dataframe that has one new column containing the cluster assignment for each probabilistic draw. This means that the output is based on the originally provided dataset and all variables including those not used to create the clusters remain unchanged so that they could be used in later analyses (e.g., for step 2 and 3). The output is the foundation of the other two functions in the ISCA-package. For purposes of illustration, the output below is from a simulated dataset containing 1000 observations. One observation can be interpreted as one individual. The output below only shows the first six individuals in this fictitious dataset.
## female age education income religiosity discrimination native
## 1 1 74 6 1608 2 2 0
## 2 1 20 4 994 4 4 1
## 3 1 69 9 1227 10 4 0
## 4 0 21 6 0 2 5 0
## 5 1 64 6 2132 10 3 0
## 6 1 27 5 1974 1 4 1
ISCA_step1 <- ISCA_random_assignments(data=sim_data, filter=native, majority_group=1, minority_group=c(0), fuzzifier = 1.5,
n_clusters=3, draws=5, cluster_vars= c("female", "age", "education", "income"))
## Iteration: 1, Error: 219.9604189517
## Iteration: 2, Error: 218.2123581526
## Iteration: 3, Error: 217.9846545604
## Iteration: 4, Error: 217.9588491548
## Iteration: 5 converged, Error: 217.9588491548
## female age education income religiosity discrimination native A1 A2 A3 A4 A5
## 1 1 20 4 994 4 4 1 2 3 1 2 2
## 2 1 27 5 1974 1 4 1 1 1 1 1 1
## 3 0 71 5 1042 4 4 1 2 2 2 2 2
## 4 1 53 4 2316 7 5 1 2 1 1 1 1
## 5 0 51 5 1213 1 5 1 2 2 2 2 2
## 6 1 56 5 682 5 6 1 3 3 2 3 3
In this example, four variables (i.e., female, age, education, income) are used to create 4 clusters to which majority and minority subjects are assigned probabilistically across 5 iterations. The variable in the data set that stores information on minority vs. majority status is called ‘native’. In it, value 1 indicates majority status, the values 0 indicates minority status. In that particular example, clustering results (as indicated by column A1 to A5; 5 multinomial assignments based on the distribution formed by membership function in the 3 clusters) shows substantial stability across assignments A1-A5, indicating that the distribution of membership functions is probably very skewed towards one cluster. Note that for observation #1, the assignment is less stable than other observations. In this particular example based on a fictitious and random data, the obtained subgroups do not coalesce on all variables; rather, income appear most influential in organizing subgroups (i.e., subgroups may be heterogeneous in terms of age and religiosity for instance, but homogenous in terms of income). Note that the discrimination and religiosity variables are part of the dataset but were not used to create the clusters because the clusters focus more on social structural indicators and the inclusion of the outcome variable is not always recommended (see Drouhot 2021).
Note also that the function probabilistically assigns members of both
majority and minority groups to clusters. Specifically, the function
first starts by probabilistically assigning majority group members to
the empirically defined clusters, and then continues to
probabilistically assign each minority group member to the cluster they
most closely resemble. In other words, the clusters are created based
only on the majority group, and minority group members are subsequently
matched to a cluster of reference based on social similarity. This is
why in the head(ISCA_step1)
output above the first six
individuals are all natives, indicated by the value one. When displaying
the last 6 rows of ISCA_step1
, we can see that the output
contains minority members as well:
## female age education income religiosity discrimination native A1 A2 A3 A4
## 995 1 20 3 445 3 3 0 3 3 3 3
## 996 0 53 4 121 5 2 0 3 3 3 3
## 997 0 73 6 871 7 5 0 2 2 3 2
## 998 0 67 5 579 7 4 0 3 3 3 3
## 999 1 30 2 0 3 3 0 3 3 3 3
## 1000 1 59 6 2241 5 3 0 1 1 1 1
## A5
## 995 3
## 996 3
## 997 2
## 998 3
## 999 3
## 1000 1
The second function gives an overview of the distribution of the variables of interest per cluster and across all iterations. That includes:
a grand mean (i.e., the mean of the mean per variable and cluster and across iteration),
a grand standard deviation (i.e., the mean of the standard deviation per variable and cluster, and across iteration),
a cluster error (i.e., the standard deviation of the mean value per variable and cluster, and across iteration) for each variable.
The function also computes a grand mean and cluster error for a count variable for each cluster.
The substantive goal for the analyst here is to better understand the obtained clusters, by interpreting them based on domain-specific and other expert knowledge of the populations and context at hand. When producing a descriptive table, remember that observations should first be filtered to include only members of the majority group since they are the ones determining the social structure obtained at step 1 (minority members are only matched to clusters of majority members they most closely resemble). Additionally, for purposes of comparison, the table could also include variables that were not part of the cluster building, for example other independent or dependent variables.
Variables of interest could be the variables that were used to build the clusters. This helps to understand the clusters and assign substantive meaning to them. In that case the dataset should first be filtered to include only members of the majority group. The table could also include variables that were not part of the cluster building, for example other independent or dependent variables.
Variables with only two values are automatically recognized as categorical, dichotomous variables and no grand standard deviation is calculated. If you work with multinomial factor variables, it is best to include them as dummy-variables.
The ISCA_clustertable()
-function requires the following
arguments:
data=
The dataset including all relevant variables
and the random assignments from the first
ISCA_random_assignments()-function.
cluster_vars=
A vector specifying the variables of
interest.
draws=
Number of probabilistic draws. The number of
draws should be equal to the number of draws specified in the first
step. If not specified, the default is 500.
majority_only <- ISCA_step1[ISCA_step1["native"] == 1, ]
result_ISCA_clustertable <- ISCA_clustertable(data = majority_only,
cluster_vars = c("native", "female", "age", "education", "income"),
draws = 5)
head(result_ISCA_clustertable, 50)
## variable 1 2 3
## 1 assignment 1.0000 2.0000 3.0000
## 2 grand_mean_native 1.0000 1.0000 1.0000
## 3 cluster_error_native 0.0000 0.0000 0.0000
## 4 grand_sd_native 0.0000 0.0000 0.0000
## 5 grand_mean_female 0.4673 0.5089 0.5655
## 6 cluster_error_female 0.0089 0.0125 0.0168
## 7 grand_mean_age 48.2452 47.5843 48.2351
## 8 cluster_error_age 0.1220 0.4307 0.7271
## 9 grand_sd_age 18.2713 18.2838 18.0426
## 10 grand_mean_education 4.9916 5.0076 4.9839
## 11 cluster_error_education 0.0516 0.0386 0.0409
## 12 grand_sd_education 1.4005 1.2592 1.2286
## 13 grand_mean_income 1964.4856 1228.5364 369.6069
## 14 cluster_error_income 15.1964 30.7765 28.0879
## 15 grand_sd_income 469.1825 381.0639 410.0264
## 16 grand_mean_count 140.0000 187.8000 136.2000
## 17 cluster_error_count 3.5355 4.9699 7.1903
Remember that the function examples are based on a simple artificial dataset. Interpreting the clusters here is therefore not very straightforward and useful. With real life data, clearer patterns are likely to emerge, and are to be interpreted by the researcher based on domain-specific knowledge.
Within-Cluster Modelling of Outcomes and Cross-Group Comparisons
The third function runs OLS regressions for each cluster and iteration. The output contains a list with two entries which can be extracted. The first entry contains the OLS regression estimates for each cluster averaged across iterations, including estimates for a model which pools all clusters. The second entry contains averaged adjusted R-squared values per cluster to indicate model fit.
The ISCA_modeling()
-function requires the following
arguments:
data=
The dataset including all relevant variables
and the random assignments from the first
ISCA_random_assignments()-function.
model_spec=
A model specification similar to the
lm()
-function.
draws=
Number of probabilistic draws. The number of
draws should be equal to the number of draws specified in the first and
second step. If not specified, the default is 500.
n_clusters=
Number of clusters. This value should be
equal to the number of clusters specified in the first and second
step.
weights=
A vector specifying the variable in which
the weights are stored. The default is NONE
.
Note that the goal here becomes to compare majority and minority groups within each cluster as well as differences between clusters within a given categorically defined groups, particularly whether minority consistently differ from majority members and whether predictors affect the main outcome of interest in a consistent manner across subgroups. Additionally, if a supplementary model at the categorical (i.e., nominal) group level is run, the comparison of the cluster-specific estimates with group-level estimates is particularly meaningful because it allows to see which subgroups drive aggregate trends, and conversely, which significant trends at the subgroup level get obscured at the aggregate level.
Currently, there are no tests for the differences in effects between clusters, although these can be obtained from ad hoc testing as such through the Paternoster test for the equality of coefficients (Paternoster et al. 1998).
ISCA_modeling_results <- ISCA_modeling(data= ISCA_step1,
model_spec="religiosity ~ native + female + age + education + discrimination",
draws = 5, n_clusters = 3, weights = NULL)
## `mutate_if()` ignored the following grouping variables:
## • Column `cluster`
estimates <- as.data.frame(ISCA_modeling_results[1])
head(estimates, 30) # OLS estimates per cluster across iterations
## cluster term mean_coefficients mean_std.error mean_p_value
## 1 1 (Intercept) 4.1783 0.1869 0.0000
## 2 1 age -0.0027 0.0043 0.6502
## 3 1 discrimination 0.2077 0.0458 0.1409
## 4 1 education 0.0930 0.0402 0.4670
## 5 1 female 0.3082 0.0648 0.3542
## 6 1 native -1.3254 0.0934 0.0001
## 7 2 (Intercept) 4.8992 0.3641 0.0000
## 8 2 age -0.0014 0.0041 0.6824
## 9 2 discrimination 0.0439 0.0706 0.5639
## 10 2 education -0.0014 0.0368 0.7705
## 11 2 female 0.9220 0.0961 0.0012
## 12 2 native -1.3263 0.1075 0.0000
## 13 3 (Intercept) 5.6342 0.3043 0.0000
## 14 3 age 0.0038 0.0018 0.6894
## 15 3 discrimination -0.1685 0.0617 0.2333
## 16 3 education 0.0264 0.0274 0.8027
## 17 3 female 0.4384 0.1354 0.2077
## 18 3 native -1.4836 0.1506 0.0000
## 19 0 (Intercept) 4.9147 0.5211 0.0000
## 20 0 native -1.3733 0.1755 0.0000
## 21 0 female 0.5760 0.1753 0.0010
## 22 0 age -0.0003 0.0048 0.9514
## 23 0 education 0.0363 0.0656 0.5798
## 24 0 discrimination 0.0269 0.0713 0.7063
rsquared <- as.data.frame(ISCA_modeling_results[2])
head(rsquared) # Corresponding Adjusted R Squared values per cluster across iterations
## mean.fit.cluster1 mean.fit.cluster2 mean.fit.cluster3 pooled.model
## 1 0.04760011 0.06899341 0.06387298 0.06104137
In this example, the variables native, female, age, education and
discrimination are regressed on religiosity. No weights are specified.
Results are averaged across 5 iterations and estimates are calculated
for each of the three clusters. Additionally, there is a model 0 which
pooled all clusters. Note that unlike in the second function (i.e.,
ISCA_clustertable()
), clusters are not represented as
columns but are instead stored in the “cluster” variable (i.e., as
rows). This way, users can use the output for post-processing tasks,
such as plotting the estimates.
The code below helps to transform the output so that clusters are depicted as columns. This might ease the comparison of the effect coefficients across clusters. Remember that 0 indicates the pooled model.
cluster_as_columns <- estimates %>%
tidyr::pivot_longer(cols = c(mean_coefficients, mean_std.error, mean_p_value),
names_to = "Statistics",
values_to = "value") %>%
tidyr::pivot_wider(names_from = cluster, values_from = value)
print(cluster_as_columns)
## # A tibble: 18 × 6
## term Statistics `1` `2` `3` `0`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) mean_coefficients 4.18 4.90 5.63 4.91
## 2 (Intercept) mean_std.error 0.187 0.364 0.304 0.521
## 3 (Intercept) mean_p_value 0 0 0 0
## 4 age mean_coefficients -0.0027 -0.0014 0.0038 -0.0003
## 5 age mean_std.error 0.0043 0.0041 0.0018 0.0048
## 6 age mean_p_value 0.650 0.682 0.689 0.951
## 7 discrimination mean_coefficients 0.208 0.0439 -0.168 0.0269
## 8 discrimination mean_std.error 0.0458 0.0706 0.0617 0.0713
## 9 discrimination mean_p_value 0.141 0.564 0.233 0.706
## 10 education mean_coefficients 0.093 -0.0014 0.0264 0.0363
## 11 education mean_std.error 0.0402 0.0368 0.0274 0.0656
## 12 education mean_p_value 0.467 0.770 0.803 0.580
## 13 female mean_coefficients 0.308 0.922 0.438 0.576
## 14 female mean_std.error 0.0648 0.0961 0.135 0.175
## 15 female mean_p_value 0.354 0.0012 0.208 0.001
## 16 native mean_coefficients -1.33 -1.33 -1.48 -1.37
## 17 native mean_std.error 0.0934 0.108 0.151 0.176
## 18 native mean_p_value 0.0001 0 0 0
Membership scores are calculated as follows: \(w_{i_j}=\frac{1}{\sum_{k=1}^{c} (|| x_i - c_j || / || x_i - c_k ||)^{2/(m-1)}}\)↩︎