New data generation - Imputing non-typed SNPs
(This lab was adapted for statsTeachR by Tu Dao, Nicholas Reich and Andrea Foulkes from our GWAS tutorial.)
Typed SNPs measured using chip array technology typically capture about 1 million polymorphisms that vary in at least 1% of the general population. More generally, interest lies in analyzing association of genotypes of non-typed SNPs with disease outcomes because functional (causal) variants may not be measured. In this lab, we will demonstrate how to impute impute SNPs on chromosome 16.
In addition to the genotyped SNPs from our study, it is useful to extend the analysis to other known SNPs, that were not typed or were removed by SNP level filtering. In this example, we impute SNPs on chromosome 16. Performance of genotype imputation requires reference data, which has typed genotypes at the SNPs of interest from a similar homogeneous sample. Sources for this data include HapMap and 1000 Genomes.
To prepare for imputing non-typed SNPs, we need to load the data saved from last module and read in 1000G data for chromosome 16. To do this, we will read in the genotype data from the .ped and .info files using the read.pedfile() function in snpStats. Note, that the .info file is similar to the .map file we’ve discussed previously, and the .ped file is very large (150 Mb) and thus will take longer to download. Please install package downloader to be able to download the 1000G files to your workspace.
Once we have the data read in, we use the which argument to specify the column in the .info file with the SNP IDs. Then we proceed to obtain genotype data for the given chromosome and each SNPs position.
load("m2_lab1_save.RData") library(snpStats) library(downloader) info <- download("https://www.mtholyoke.edu/courses/afoulkes/Data/statsTeachR/chr16_1000g_CEU.info", destfile = "chr16_1000g_CEU.info") ped <- download("https://www.mtholyoke.edu/courses/afoulkes/Data/statsTeachR/chr16_1000g_CEU.ped", destfile = "chr16_1000g_CEU.ped") thougeno <- read.pedfile("chr16_1000g_CEU.ped", snps = "chr16_1000g_CEU.info", which = 1) genoMatrix <- thougeno$genotypes support <- thougeno$map colnames(support) <- c("SNP", "position", "A1", "A2") head(support)
## SNP position A1 A2 ## 1 rs140769322 60180 3 2 ## 2 rs188810967 60288 2 1 ## 3 rs76368850 60291 2 4 ## 4 rs185537431 60778 3 1 ## 5 rs542544747 60842 2 1 ## 6 rs4021615 61349 1 3
Imputing non-typed SNPs
After obtaining the genotype data and the location of the SNPs from the reference panel, we can impute the non-typed SNPs on chromosome 16 by determining the subset of SNPs that are in 1000G but not in our data. We do this by subsetting our data for chromosome 16 and then creating “missing” and “present” snpMatrix objects. Both are needed for imputation rules step of the algorithm. The “present” matrix holds our typed SNPs, while the “missing” matrix holds the SNPs that need to be imputed.
presSnps <- colnames(genotype) presSnps <- colnames(genotype) presDatChr <- genoBim[genoBim$SNP %in% presSnps & genoBim$chr == 16, ] targetSnps <- presDatChr$SNP is.present <- colnames(genoMatrix) %in% targetSnps missing <- genoMatrix[, !is.present] print(missing)
## A SnpMatrix with 99 rows and 386404 columns ## Row names: CEU_1 ... CEU_99 ## Col names: rs140769322 ... rs111706106
present <- genoMatrix[, is.present] print(present)
## A SnpMatrix with 99 rows and 12047 columns ## Row names: CEU_1 ... CEU_99 ## Col names: rs41340949 ... rs7196459
After subsetting our typed SNPs for the given chromosome, we need to acquire their positions to be used for imputation rules. We derive these “rules” for the additional SNPs that were not typed in our study using the snp.imputation function. In other words, we are using the genotypes from the 1000G data to fill in the gaps in our data. Each rule represents a predictive model for genotypes of un-typed SNPs associated with near-by typed SNPs.
pos.pres <- support$position[is.present] pos.miss <- support$position[!is.present] rules <- snp.imputation(present, missing, pos.pres, pos.miss)
## SNPs tagged by a single SNP: 68562 ## SNPs tagged by multiple tag haplotypes (saturated model): 137495
rules <- rules[can.impute(rules)] cat("Imputation rules for", length(rules), "SNPs were estimated\n")
## Imputation rules for 206057 SNPs were estimated
Why would we want to impute non-typed SNPs?
In the last step we remove un-typed SNPs that we fail to derive imputation “rules” for. We also filter out the SNPs that have low estimated minor allele frequency (MAF), and low imputation accuracy. The latter is based on the R^2 value of the model estimated by the snp.imputation function. The end results are expected posterior values of the un-typed SNPs.
# Quality control for imputation certainty and MAF r2threshold <- 0.7 minor <- 0.01 rules <- rules[imputation.r2(rules) >= r2threshold] cat(length(rules),"imputation rules remain after uncertain imputations were removed\n")
## 159085 imputation rules remain after uncertain imputations were removed
rules <- rules[imputation.maf(rules) >= minor] cat(length(rules),"imputation rules remain after MAF filtering\n")
## 159085 imputation rules remain after MAF filtering
# Obtain posterior expectation of genotypes of imputed snps target <- genotype[,targetSnps] imputed <- impute.snps(rules, target, as.numeric=FALSE) print(imputed)
## A SnpMatrix with 1400 rows and 159085 columns ## Row names: 10002 ... 11596 ## Col names: rs28436661 ... rs62053708
We set R^2 value in this example to be 0.7. Define R^2 value and interpret the R^2 threshold is this case.
Saving work for following labs…
Before saving the data, we want to remove the data we created in this lab to open up memory. You can do this and save the necessary data from this lab by running the following commands:
rm(genoMatrix) rm(missing) rm(present) save(genotype, genoBim, clinical, pcs, imputed, target, rules, support, file="m2_lab2_save.RData")
On your own
Using the exercise_dat.Rdata
- Calculate and store imputation rules using snp.imputation()
- Impute un-typed SNPs. Use an R^2 value of 0.85 and MAF of 0.05.
- Explain in your own words why we need a reference panel like 1000 Genomes to impute our SNPs? Why can’t we just use the SNPs we have to determine imputation rules?