LDAK vignette

The flexible_cfdr function requires as input the indices of an independent subset of SNPs. These indices indicate the (p,q) pairs considered independent observations for the purpose of the KDE fitting procedure.

In practice, we identify independent SNPs as those assigned a non-zero weighting by the LDAK package’s weighting calculation procedure. These weightings were originally developed as a means of adjusting for the unequal tagging of (causal) SNPs across the genome when estimating heritability. The calculation procedure requires haplotype information, which we obtain in the form of the 1000 Genomes (1000G) Phase 3 data.

The purpose of this vignette is to sketch out our own approach to generating LDAK weightings for use in fcfdr. Whilst RMarkdown was used to generate this HTML document, the code snippets contained in this vignette are intended to serve as static illustrations and are not intended to be run without further modification.

This workflow was written with the use of LDAK v5.1 (download here) and PLINK v1.90b6.21 64-bit (19 Oct 2020) (download here). We also use code from the R packages bigsnpr and data.table.


Optional: Obtaining and processing the 1000G data

NB: In this section we discuss how to obtain and process haplotype data from the 1000G project. We have made available the end result, the quality-controlled, European-only data, as a ~280MB download at https://doi.org/10.5281/zenodo.4709547. If your principal p-values were taken from a GWAS conducted in a European population, you can skip the steps here and use the files in the download instead. You will still have to run LDAK. Note that the data use coordinates from the hg19 genome assembly.

We first download the 1000G Phase 3 data in the vcf format. These data use the hg19 genome assembly for SNP coordinates.

for i in {1..22}; do
    wget -O "chr"$i".vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
done

wget -O "chrX.vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz"
wget -O "chrY.vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz"

Subsequent processing depends on the data being in the PLINK-compatible bed, bim, and fam file formats. We use PLINK to convert the vcf files to these formats.

for i in {1..22}; do
    plink --vcf "chr"$i".vcf.gz" --make-bed --out "chr"$i
done

plink --vcf chrX.vcf.gz --make-bed --out chrX
plink --vcf chrY.vcf.gz --make-bed --out chrY

The 1000G haplotype data were obtained by sequencing individuals from a variety of populations. In practice, we take a subset of the 1000G samples to ensure that the ancestry of the individuals from which we obtain haplotype data matches the ancestry of the individuals in the GWAS of interest (from which our principal p-values come). Sample ancestry information for the 1000G project can be found at the data portal of the International Genome Sample Resource.

We specify the desired sample IDs in a fam file. The fam files generated from the vcf files by plink --make-bed should be identical for all chromosomes, so we use a single custom fam file, euro.fam, to downsample all files. euro.fam was obtained by subsetting chr1.fam to retain only those entries with European sample IDs. We write the new files out to the directory euro_only.

for i in {1..22}; do
  plink --bfile "chr"$i --keep euro.fam --make-bed --silent --out "euro_only/chr"$i
done 

plink --bfile chrX --keep euro.fam --make-bed --silent --out euro_only/chrX
plink --bfile chrY --keep euro.fam --make-bed --silent --out euro_only/chrY

The fam files omit sex information. This is problematic only when we wish to carry out QC on the Y chromosome SNPs: PLINK drops heterozygous Y genotypes if we do not affirm the sex of the samples in chrY.fam. We do this by changing the values in the sex code column from 0 (‘unknown’) to 1 (‘male’).

sed -i 's/0 0 0 -9/0 0 1 -9/' euro_only/chrY.fam

We then use a function from the R package bigsnpr to carry out some basic QC on the 1000G data and write out the duly filtered data to another directory, euro_only_qc.

for(i in 1:22) {
       bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = paste0('euro_only/chr', i),
                          prefix.out = paste0('euro_only_qc/chr', i),
                          geno = 0, maf = 0.01, hwe = 1e-10)
}
bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = 'euro_only/chrX',
                        prefix.out = 'euro_only_qc/chrX',
                        geno = 0, maf = 0.01, hwe = 1e-10)

bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = 'euro_only/chrY',
                        prefix.out = 'euro_only_qc/chrY',
                        geno = 0, maf = 0.01, hwe = 1e-10)

Joining the 1000G and GWAS SNPs

NB: The following processing was carried out with a GWAS containing only autosomal SNPs, so we omit reference to the sex chromosomes henceforth.

If you skipped the previous section on processing the 1000G data then you can download our pre-processed files:

wget https://zenodo.org/record/4709547/files/euro_only_qc.tar.gz
tar -xvf euro_only_qc.tar

For the purpose of fitting the KDE in flexible_cfdr we care only about the dependence structure of the SNPs at which our p and q values were obtained, so we filter the 1000G SNPs to retain those for which we have GWAS p-values. Ideally, the 1000G SNPs would form a superset of those in the GWAS, but this is typically not the case; with ~5.5 million SNPs in our GWAS we would expect to lose several tens of thousands of SNPs. The absence of this relatively paltry number of SNPs should have a neglible influence upon the KDE.

To filter the SNPs we use plink --extract, which allows one to extract SNPs based on:

  1. SNP ID (i.e. rsID); or
  2. SNP genomic coordinates

rsIDs are not always consistent between data sets and we have found that extracting SNPs using genomic coordinates typically yields a larger intersection of 1000G and GWAS SNPs. Thus we use plink --extract with the --range argument (documentation here), which requires that we pass a white space-separated text file specifying a list of genomic ‘ranges’ (hereafter the ‘range file’) in rows in the format chr bp bp ID. In our use case, each SNP corresponds to a range, albeit one of length one: the first and second bp columns which give the start and end coordinates of the range are duplicated. We use rsIDs as range identifiers, but these are essentially superfluous when using the --range argument and will not be used by plink to filter the SNPs.

The coordinates of the GWAS SNPs are usually readily available and a simple approach to generating the requisite range files for plink --extract would copy the genomic coordinates of each SNP to the appropriate chromosome-specific range file (albeit in the plink-compliant chr bp bp rsID format). However, this approach overlooks the possibility that there exist multiple SNPs with different reference/alternative allele pairings at the same locus, that is, duplicates.

We can account for this when preparing the range files by first joining the 1000G SNPs in each chromosome’s bim file data to our GWAS SNPs and checking that the allele pairings match. This approach also allows us to remove duplicated rows in the range files; as noted above, it is not essential for a good fit that we retain every SNP.

The following exemplifies the sort of R code one can use to write out the range files whilst carrying out the aforementioned checks.

library(data.table)
system("mkdir plinkRanges")
system("mkdir filtered")
system("mkdir ldak")

# We assume this contains columns SNPID, CHR19 and BP19 (so-called because of hg19), REF, and ALT
# note that gwas_dat needs to be a data.table
gwas_dat <- fread('gwas_sum_stats.tsv.gz', sep = '\t', header = T)

# Iterating over chromosomal bim files
for(i in 1:22) {
  # bim files have no header
  bim_dat <- fread(sprintf('euro_only_qc/chr%d.bim', i), sep = '\t', header = F, col.names = c('Chr', 'ID', 'Cm', 'BP19', 'A1', 'A2') )
  
  bim_join <- merge(bim_dat, gwas_dat[CHR19 == i], by.x = 'BP19', by.y = 'BP19')
  
  # Make sure alleles match, although for two-sided association p-values we don't care whether ref/alt is reversed
  bim_join <- bim_join[(REF == A1 & ALT == A2) | (REF == A2 & ALT == A1)]
  
  bim_join <- bim_join[, .(Chr, BP19, BP19, SNPID)]

  if(any(duplicated(bim_join, by='BP19'))) {
    warning(sprintf('%d duplicates removed from output', sum(duplicated(bim_join, by = 'BP19'))))
  }

  # Remove duplicates
  bim_join <- unique(bim_join, by='BP19')

  fwrite(bim_join, file = sprintf('plinkRanges/chr%d.tsv', i), row.names = F, sep = '\t', col.names = F, quote = F)
}

Note that we took care to use the hg19 assembly coordinates to match the assembly used in the 1000G data we downloaded above.

Using these range files we can now filter the 1000G data.

for i in {1..22}; do
  plink --silent --bfile "euro_only_qc/chr"$i --extract "plinkRanges/chr"$i".tsv" --range --make-bed --out "filtered/chr"$i
done

Running the LDAK weighting calculation procedure

The weighting calculation procedure entails a preprocessing step, ldak --cut-weights, and a calculation step, ldak --calc-weights-all. We refer the reader to the LDAK documentation for further guidance. An idiosyncracy of our approach is that we process the data in a set of chromosome-specific files. This is an artifact of the format of the 1000G data.

We create the directory ldak and within it subdirectories labelled chrx, where x ranges from 1 to 22, to hold the results of the procedure. (note that you may have to replace the executable ldak with ldak5.1.linux depending on how you downloaded the LDAK software).

for i in {1..22}; do 
  ldak --cut-weights "ldak/chr"$i --bfile "filtered/chr"$i
  ldak --calc-weights-all "ldak/chr"$i --bfile "filtered/chr"$i
done

LDAK will write out the procedure’s results to files named ldak/chrx/weights.all. We join these chromosome-specific files into one.

for i in {1..22}; do
 # We omit the header in each file
  sed "s/$/ $i/" <(tail -n +2 "ldak/chr$i/weights.all") >> "ldak/combined_weights.all"
done

# We add back in a single header for the combined file
sed -i '1 i\Predictor Weight Neighbours Tagging Info Check Chr' "ldak/combined_weights.all"

Using the LDAK weightings in flexible_fcfdr

In practice, it is necessary to merge the weightings contained in combined_weights.all back into the file containing the p and q values so that all three vectors can be made available to flexible_cfdr. This poses a problem, however, as the rsIDs in the Predictor column of combined_weights.all are derived from the 1000G bim files, not the GWAS file rsIDs specified in the range files (which are ignored by plink as noted above). Merging 1000G and GWAS SNPs using genomic coordinates is to be preferred over the use of rsIDs as the latter are not always consistent between data sets.

combined_weights.all as written out by LDAK does not contain SNP basepair coordinates. This means we must use the bim files contained in the filtered/chrx directories to recover the basepair coordinates and reference/alternative allele pairings. This can be accomplished by matching the SNPs in combined_weights.all to those in the bim files. As the rsIDs in combined_weights.all are derived from these bim files, it is appropriate in this case to merge these data with the use of rsIDs. We provide the following snippets as an example of how this can be accomplished.

To simplify matters, we first concatenate all chromosomal bim files.

for i in {1..22}; do
  cat "filtered/chr"$i".bim" >> "filtered/chr_all.bim"
done

We then add the genomic metadata we need to the combined_weights.all LDAK output file.

library(data.table)

bim_dat <- fread('filtered/chr_all.bim', sep = '\t', header = F, col.names = c('Chr', 'ID', 'Cm', 'BP19', 'A1', 'A2'))

weights_dat <- fread('ldak/combined_weights.all', sep = ' ', header = T)

# Drop rows with a missing ID or weight value
weights_dat <- na.omit(weights_dat, cols = c('Predictor', 'Weight'))

join_dat <- merge(weights_dat, bim_dat[, .(ID, Chr, BP19, A1, A2)], all.x = T, by.x = c('Predictor', 'Chr'), by.y = c('ID', 'Chr'), sort = F)

fwrite(join_dat, file = 'ldak/combined_weights_meta.all', sep = ' ', col.names = T, row.names = F, quote = F)

The format of the file in which your p and q are stored will of course vary, but the snippet below illustrates how combined_weights_meta.all can be merged with it.

weights_dat <- fread('ldak/combined_weights_meta.all', sep = ' ', header = T, select = c('Predictor', 'Weight', 'Chr', 'BP19', 'A1', 'A2'))

# We assume this contains columns SNPID, CHR19 and BP19 (so-called because of hg19), REF, and ALT
gwas_dat <- fread('gwas_sum_stats.tsv.gz', sep = '\t', header = T)

gwas_dat <- merge(gwas_dat, weights_dat, by.x = c('CHR19', 'BP19'), by.y = c('Chr', 'BP19'), sort = F)

# Drop rows where the ref/alt allele pairing differs from that already present
gwas_dat <- gwas_dat[((REF == A1 & ALT == A2) | (REF == A2 & ALT == A1))]

# Drop now-redundant allele columns
gwas_dat[, c('A1', 'A2') := NULL ]