如何将原始SNP信息转化为0,1,2的矩阵形式

xiaoxiao2021-03-25 368

导入示例数据

library(SNPassoc) data(SNPs) SNPs[1:8,1:8] idcascosexblood.preproteinsnp10001snp10002snp100031 1 Female 13.7 75640.52TT CC GG 2 1 Female 12.7 28688.22TT AC GG 3 1 Female 12.9 17279.59TT CC GG 4 1 Male 14.6 27253.99CT CC GG 5 1 Female 13.4 38066.57TT AC GG 6 1 Female 11.3 9872.46TT CC GG 7 1 Female 11.9 11132.90TT AC GG 8 1 Male 12.4 29973.43TT AC GG

提取SNP数据,并转化格式

这里比较重要的是，row.names这一列表示ID，里面的数据全是SNP数据

myDat<- SNPs[,-(2:5)] row.names(myDat) <- myDat$id; myDat <- myDat[,-1] myDat[1:5,1:5] # str(myDat) myDat <- as.matrix(myDat) snp10001snp10002snp10003snp10004snp10005TTCCGGGGGGTTACGGGGAGTTCCGGGGGGCTCCGGGGGGTTACGGGGGG

利用synbreed包进行转化，可以补全缺失值，转化基因型

Recoding alleles from character/factor/numeric into the number of copies of the minor alleles, i.e. 0, 1 and 2. In codeGeno, in the first step heterozygous genotypes are coded as 1. From the other genotypes, the less frequent genotype is coded as 2 and the remaining genotype as 0. 利用等位基因频率对基因型进行转化，多的纯合体为0，杂合为1，少的纯合体为2

library(synbreed) cp <- create.gpData(geno = myDat) cp.dat <- codeGeno(gpData = cp,label.heter = "alleleCoding", maf = 0.01, nmiss = 0.1, impute = TRUE, impute.type = "random", verbose = TRUE) step 1 : 1 marker(s) removed with > 10 % missing values step 2 : Recoding alleles step 4 : 12 marker(s) removed with maf < 0.01 step 7 : Imputing of missing values step 7d : Random imputing of missing values step 8 : No recoding of alleles necessary after imputation step 9 : 0 marker(s) removed with maf < 0.01 step 10 : No duplicated markers removed End : 22 marker(s) remain after the check Summary of imputation total number of missing values : 37 number of random imputations : 37

如果报错说是多余两个基因型，那是因为没有考虑缺失值，需要保存到csv中，再读取进去

write.csv(myDat,"snps.csv") ge <- read.csv("snps.csv",header = T,row.names = 1,na.strings = "NA") summary(ge) ge <- as.matrix(ge) gp <- create.gpData(geno = ge) cp.dat <- codeGeno(gpData = gp,label.heter = "alleleCoding", maf = 0.01, nmiss = 0.1, impute = TRUE, impute.type = "random", verbose = TRUE) snp10001 snp10002 snp10003 snp10004 snp10005 snp10006 snp10007 snp10008 CC:12 AA: 5 GG :144 GG :156 AA: 3 AA:157 CC:157 CC:104 CT:53 AC:78 NA's: 13 NA's: 1 AG:70 CG: 44 TT:92 CC:74 GG:84 GG: 9 snp10009 snp100010 snp100011 snp100012 snp100013 snp100014 snp100015 AA :72 TT :147 CC: 1 CC : 3 AA :101 AA :27 AG: 13 AG :79 NA's: 10 CG: 2 CG :68 AG : 35 AC :74 GG:144 GG : 5 GG:154 GG :84 GG : 9 CC :52 NA's: 1 NA's: 2 NA's: 12 NA's: 4 snp100016 snp100017 snp100018 snp100019 snp100020 snp100021 snp100022 GG :152 CC : 5 CC : 5 CC:32 AA: 9 GG:157 AA :156 NA's: 5 CT :83 CT :84 CG:75 AG: 43 NA's: 1 TT :67 TT :67 GG:50 GG:105 NA's: 2 NA's: 1 snp100023 snp100024 snp100025 snp100026 snp100027 snp100028 snp100029 AA : 5 CC :14 CC:157 GG :156 CC :68 CC :34 AA :14 AT :78 CT :51 NA's: 1 CG :82 CT :72 AG :48 TT :71 TT :91 GG : 5 TT :50 GG :94 NA's: 3 NA's: 1 NA's: 2 NA's: 1 NA's: 1 snp100030 snp100031 snp100032 snp100033 snp100034 snp100035 AA:157 TT :102 AA :34 AA :34 CC :14 TT :146 NA's: 55 AG :70 AG :69 CT :48 NA's: 11 GG :52 GG :49 TT :94 NA's: 1 NA's: 5 NA's: 1 step 1 : 1 marker(s) removed with > 10 % missing values step 2 : Recoding alleles step 4 : 12 marker(s) removed with maf < 0.01 step 7 : Imputing of missing values step 7d : Random imputing of missing values step 8 : No recoding of alleles necessary after imputation step 9 : 0 marker(s) removed with maf < 0.01 step 10 : No duplicated markers removed End : 22 marker(s) remain after the check Summary of imputation total number of missing values : 37 number of random imputations : 37

查看一下转化后的结果

gee <- cp.dat$geno gee[1:5,1:5] snp10001snp10002snp10005snp10008snp10009100000201101300000410000501001

转载请注明原文地址: https://ju.6miu.com/read-951.html

技术

最新回复(0)