导入示例数据
library(
SNPassoc)
data(SNPs)
SNPs[
1:
8,
1:
8]
idcascosexblood.preproteinsnp10001snp10002snp10003
1 1 Female 13.7 75640.52TT CC GG 2 1 Female 12.7 28688.22TT AC GG 3 1 Female 12.9 17279.59TT CC GG 4 1 Male 14.6 27253.99CT CC GG 5 1 Female 13.4 38066.57TT AC GG 6 1 Female 11.3 9872.46TT CC GG 7 1 Female 11.9 11132.90TT AC GG 8 1 Male 12.4 29973.43TT AC GG
提取SNP数据,并转化格式
这里比较重要的是,row.names这一列表示ID,里面的数据全是SNP数据
myDat<-
SNPs[,-(
2:
5)]
row.names(myDat) <- myDat$id;
myDat <- myDat[,-
1]
myDat[
1:
5,
1:
5]
# str(myDat)
myDat <-
as.matrix(myDat)
snp10001snp10002snp10003snp10004snp10005
TTCCGGGGGGTTACGGGGAGTTCCGGGGGGCTCCGGGGGGTTACGGGGGG
利用synbreed包进行转化,可以补全缺失值,转化基因型
Recoding alleles from character/factor/numeric into the number of copies of the minor alleles, i.e. 0, 1 and 2. In codeGeno, in the first step heterozygous genotypes are coded as 1. From the other genotypes, the less frequent genotype is coded as 2 and the remaining genotype as 0. 利用等位基因频率对基因型进行转化,多的纯合体为0,杂合为1,少的纯合体为2
library(synbreed)
cp <- create
.gpData(geno = myDat)
cp.dat <- codeGeno(gpData =
cp,label
.heter =
"alleleCoding", maf =
0.01, nmiss =
0.1,
impute = TRUE, impute
.type =
"random", verbose = TRUE)
step 1 : 1 marker(s) removed with > 10 % missing values
step 2 : Recoding alleles
step 4 : 12 marker(s) removed with maf < 0.01
step 7 : Imputing of missing values
step 7d : Random imputing of missing values
step 8 : No recoding of alleles necessary after imputation
step 9 : 0 marker(s) removed with maf < 0.01
step 10 : No duplicated markers removed
End : 22 marker(s) remain after the check
Summary of imputation
total number of missing values : 37
number of random imputations : 37
如果报错说是多余两个基因型,那是因为没有考虑缺失值,需要保存到csv中,再读取进去
write
.csv(myDat,
"snps.csv")
ge <- read
.csv(
"snps.csv",header = T,row
.names =
1,na
.strings =
"NA")
summary(ge)
ge <- as
.matrix(ge)
gp <- create
.gpData(geno = ge)
cp.dat <- codeGeno(gpData = gp,label
.heter =
"alleleCoding", maf =
0.01, nmiss =
0.1,
impute = TRUE, impute
.type =
"random", verbose = TRUE)
snp10001 snp10002 snp10003 snp10004 snp10005 snp10006 snp10007 snp10008
CC:12 AA: 5 GG :144 GG :156 AA: 3 AA:157 CC:157 CC:104
CT:53 AC:78 NA's: 13 NA's: 1 AG:70 CG: 44
TT:92 CC:74 GG:84 GG: 9
snp10009 snp100010 snp100011 snp100012 snp100013 snp100014 snp100015
AA :72 TT :147 CC: 1 CC : 3 AA :101 AA :27 AG: 13
AG :79 NA's: 10 CG: 2 CG :68 AG : 35 AC :74 GG:144
GG : 5 GG:154 GG :84 GG : 9 CC :52
NA's: 1 NA's: 2 NA's: 12 NA's: 4
snp100016 snp100017 snp100018 snp100019 snp100020 snp100021 snp100022
GG :152 CC : 5 CC : 5 CC:32 AA: 9 GG:157 AA :156
NA's: 5 CT :83 CT :84 CG:75 AG: 43 NA's: 1
TT :67 TT :67 GG:50 GG:105
NA's: 2 NA's: 1
snp100023 snp100024 snp100025 snp100026 snp100027 snp100028 snp100029
AA : 5 CC :14 CC:157 GG :156 CC :68 CC :34 AA :14
AT :78 CT :51 NA's: 1 CG :82 CT :72 AG :48
TT :71 TT :91 GG : 5 TT :50 GG :94
NA's: 3 NA's: 1 NA's: 2 NA's: 1 NA's: 1
snp100030 snp100031 snp100032 snp100033 snp100034 snp100035
AA:157 TT :102 AA :34 AA :34 CC :14 TT :146
NA's: 55 AG :70 AG :69 CT :48 NA's: 11
GG :52 GG :49 TT :94
NA's: 1 NA's: 5 NA's: 1
step 1 : 1 marker(s) removed with > 10 % missing values
step 2 : Recoding alleles
step 4 : 12 marker(s) removed with maf < 0.01
step 7 : Imputing of missing values
step 7d : Random imputing of missing values
step 8 : No recoding of alleles necessary after imputation
step 9 : 0 marker(s) removed with maf < 0.01
step 10 : No duplicated markers removed
End : 22 marker(s) remain after the check
Summary of imputation
total number of missing values : 37
number of random imputations : 37
查看一下转化后的结果
gee <-
cp.dat$geno
gee[
1:
5,
1:
5]
snp10001snp10002snp10005snp10008snp10009
100000201101300000410000501001
转载请注明原文地址: https://ju.6miu.com/read-951.html