1 Introduction

The SNPassoc package contains facilities for data manipulation, tools for exploratory data analysis, convenient graphical facilities, and tools for assessing genetic association for both quantitative and categorial (case-control) traits in whole genome approaches. Genome-based studies are normally analyzed using a multistage approach. In the first step researchers are interested in assessing association between the outcome and thousands of SNPs. In a second and possibly third step, medium/large scale studies are performed in which only a few hundred of SNPs, those with a putative association found in the first step, are genotyped. SNPassoc is specially designed for analyzing this kind of designs. In addition, a convenience-based approach (select variants on the basis of logistical considerations such as the ease and cost of genotyping) can also be analyzed using SNPassoc. Different genetic models are also implemented in the package. Analysis of multiple SNPs can be analyzed using either haplotype or gene-gene interaction approaches.

This document is an updated version of the initial vignette that was published with the SNPassoc paper González et al. (2007). It contains a more realistic example belonging to a real dataset. The original vignette is still available here.

2 Data loading

SNP data are typically available in text format or Excel spreadsheets which are easily uploaded in R as a data frame. Here, as an illustrative example, we are analyzing a dataset containing epidemiological information and 51 SNPs from a case-control study on asthma. The data is available within SNPassoc and can be loaded by

Then, the data is loaded into the R session by

data(asthma, package = "SNPassoc")
str(asthma, list.len=9)
'data.frame':   1578 obs. of  57 variables:
 $ country    : Factor w/ 10 levels "Australia","Belgium",..: 5 5 5 5 5 5 5 5 5 5 ...
 $ gender     : Factor w/ 2 levels "Females","Males": 2 2 2 1 1 1 1 2 1 1 ...
 $ age        : num  42.8 50.2 46.7 47.9 48.4 ...
 $ bmi        : num  20.1 24.7 27.7 33.3 25.2 ...
 $ smoke      : int  1 0 0 0 0 1 0 0 0 0 ...
 $ casecontrol: int  0 0 0 0 1 0 0 0 0 0 ...
 $ rs4490198  : Factor w/ 3 levels "AA","AG","GG": 3 3 3 2 2 2 3 2 2 2 ...
 $ rs4849332  : Factor w/ 3 levels "GG","GT","TT": 3 2 3 2 1 2 3 3 2 1 ...
 $ rs1367179  : Factor w/ 3 levels "CC","GC","GG": 2 2 2 3 3 3 2 3 3 3 ...
  [list output truncated]
asthma[1:5, 1:8]
  country  gender      age      bmi smoke casecontrol rs4490198 rs4849332
1 Germany   Males 42.80630 20.14797     1           0        GG        TT
2 Germany   Males 50.22861 24.69136     0           0        GG        GT
3 Germany   Males 46.68857 27.73230     0           0        GG        TT
4 Germany Females 47.86311 33.33187     0           0        AG        GT
5 Germany Females 48.44079 25.23634     0           1        AG        GG

We observe that we have case-control status (0: control, 1: asthma) and another 4 variables encoding the country of origin, gender, age, body mass index (bmi) and smoking status (0: no smoker, 1: ex-smoker, 2: current smoker). There are 51 SNPs whose genotypes are given by the alleles names.

3 Descriptive analysis

To start the analysis, we must indicate which columns of the dataset asthma contain the SNP data, using the setupSNP function. In our example, SNPs start from column 7 onwards, which we specify in argument colSNPs

library(SNPassoc)
asthma.s <- setupSNP(data=asthma, colSNPs=7:ncol(asthma), sep="")

This is an alternative way of determining the columns containing the SNPs

idx <- grep("^rs", colnames(asthma))
asthma.s <- setupSNP(data=asthma, colSNPs=idx, sep="")

The argument sep indicates the character separating the alleles. The default value is ’‘/´´. In our case, there is no separating character, so that, we set sep=““. The argument name.genotypes can be used when genotypes are available in other formats, such as 0, 1, 2 or’‘norm´´,’‘het´´,’’mut´´. The purpose of the setupSNP function is to assign the class snp to the SNPs variables, to which SNPassoc methods will be applied. The function labels the most common genotype across subjects as the reference one. When numerous SNPs are available, the function can be parallelized through the argument mc.cores that indicates the number of processors to be used. We can verify that the SNP variables are given the new class snp

head(asthma.s$rs1422993)
[1] G/G G/T G/G G/T G/T G/G
Genotypes: G/G G/T T/T
Alleles:  G T 
class(asthma.s$rs1422993)
[1] "snp"    "factor"

and summarize their content with summary

summary(asthma.s$rs1422993)
Genotypes: 
    frequency percentage
G/G       903  57.224335
G/T       570  36.121673
T/T       105   6.653992

Alleles: 
  frequency percentage
G      2376   75.28517
T       780   24.71483

HWE (p value): 0.250093 

which shows the genotype and allele frequencies for a given SNP, testing for Hardy-Weinberg equilibrium (HWE). We can also visualize the results in a plot by

plot(asthma.s$rs1422993)
SNP summary. Bar chart showing the basic information of a given SNP

Figure 1: SNP summary
Bar chart showing the basic information of a given SNP

The argument type helps to get a pie chart

plot(asthma.s$rs1422993, type=pie)