700字范文 > 《微信小程序进阶实战之分答应用开发（中级项目）》（完整版）

《微信小程序进阶实战之分答应用开发（中级项目）》（完整版）

时间：2023-05-18 10:43:14

找到Matrix eQTL这个包，看下文章Matrix eQTL: ultra fast eQTL analysis via large matrix operations（/10.1093/bioinformatics/bts163

）

eQTL（表达数量性状位点）计算transcript-SNP 的关系，即分析SNP与基因的表达是否相关。由于计算数量巨大，很多人都用较小的数据来做。因此该作者开发了Matrix eQTL，用于处理大数据，支持additive linear and ANOVA models with covariates，并且可以将cis- and trans-eQTLs分开计算。

Matrix eQTL相较于其他软件如FastMap — 18.4 min, Merlin — 12.3 min, Plink — 9.0 min, Matrix eQTL — 5.7 min and snpMatrix — 3.3 min要快，它设置一个阈值，只有超过这个阈值的p值才会被计算。

采用的是线型回归模型，g为基因表达情况，s为SNP分型结果。

说明文档http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/runit.html

示例数据：http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/R.html

分析过程很简单，首先设置好要分析文件的路径和名称：

install.packages("MatrixEQTL")

# 设置数据目录，示例数据放在包的安装目录下了。

base.dir = find.package("MatrixEQTL")

#设置分析的模型

useModel = modelLINEAR; # modelANOVA or modelLINEAR or modelLINEAR_CROSS

#设置SNP文件的名称

SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");

# 设置表达数据文件的名称

expression_file_name = paste(base.dir, "/data/GE.txt", sep="");

# 设置协变量文件的名称

# 无协变量设置为character()

covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");

output_file_name = tempfile();

提供了三种分析模型供选择

(1) modelLINEAR

Model: useModel = modelLINEAR

Equation: expression = α + ∑k βk⋅covariatek + γ⋅genotype_additive

Testing for significance of: γ

Test statistic: t-statistic

(2) modelANOVA

Model: useModel = modelANOVA

Equation: expression = α + ∑k βk⋅covariatek + γ1⋅genotype_additive + γ2⋅genotype_dominant

Testing for significance of: (γ1,γ2) pair

Test statistic: F-statistic

(3) modelLINEAR_CROSS

Model: useModel = modelLINEAR_CROSS

Equation:

expression = α + ∑k βk⋅covariatek + γ⋅genotype_additive + δ⋅genotype_additive⋅covariateK

Testing for significance of: δ

Test statistic: t-statistic

注意这里要设置一个p值的阈值，一般越大的数据量阈值设的越小，之前说过它会按这个阈值来计算结果，如果设的过大，分析耗时并且输出很多结果。输出的结果都储存在output_file_name里

pvOutputThreshold = 1e-2

# 设置协变量矩阵为 numeric()，很少用，默认

errorCovariance = numeric()

# 这里建立了一个SlicedData的新对象，用于存放martix的数据，并设置存放数据的格式

snps = SlicedData$new();

# 设置数据分隔符为tab

snps$fileDelimiter = "\t"; # the TAB character

# 设置缺失值为NA

snps$fileOmitCharacters = "NA"; # denote missing values;

snps$fileSkipRows = 1;# one row of column labels

snps$fileSkipColumns = 1; # one column of row labels

snps$fileSliceSize = 2000; # read file in pieces of 2,000 rows

snps$LoadFile( SNP_file_name );

## Load gene expression data

gene = SlicedData$new();

gene$fileDelimiter = "\t"; # the TAB character

gene$fileOmitCharacters = "NA"; # denote missing values;

gene$fileSkipRows = 1;# one row of column labels

gene$fileSkipColumns = 1; # one column of row labels

gene$fileSliceSize = 2000; # read file in slices of 2,000 rows

gene$LoadFile(expression_file_name);

## Load covariates

cvrt = SlicedData$new();

cvrt$fileDelimiter = "\t"; # the TAB character

cvrt$fileOmitCharacters = "NA"; # denote missing values;

cvrt$fileSkipRows = 1;# one row of column labels

cvrt$fileSkipColumns = 1; # one column of row labels

看下文件格式，snp文件用0,1,2表示，基因文件是表达量，cvrt是covariates：

image.png

设置好文件后可以用 Matrix_eQTL_engine主函数进行eQTL分析了，参数snps设置SNP文件，gene设置表达量文件，cvrt设置协变量。然后将每行的SNP和gene放到一块进行线性回归的分析。

me = Matrix_eQTL_engine(

snps = snps,

gene = gene,

cvrt = cvrt,

output_file_name = output_file_name,

pvOutputThreshold = pvOutputThreshold,

useModel = useModel,

errorCovariance = errorCovariance,

verbose = TRUE,

pvalue.hist = TRUE,

min.pv.by.genesnp = FALSE,

noFDRsaveMemory = FALSE);

运行完后得到的me对象是一个list：

image.png

输出文件的每行eqtl为：SNP name, a transcript name, estimate of the effect size, t- or F-statistic, p-value, and FDR。

Matrix eQTL可以区分顺式（cis，local）和反式（trans，distant）eQTL，主要用Matrix_eQTL_main函数来分析。其包括以下几个参数：

* `pvOutputThreshold.cis` – p-value threshold for cis-eQTLs.

* `output_file_name.cis` – detected cis-eQTLs are saved in this file.

* `cisDist` – maximum distance at which gene-SNP pair is considered local.

* `snpspos` – data frame with information about SNP locations, must have 3 columns - SNP name, chromosome, and position. See [sample SNP location file](http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/Sample_Data/snpsloc.txt).

* `genepos` – data frame with information about gene locations, must have 4 columns - the name, chromosome, and positions of the left and right ends. See [sample gene location file](http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/Sample_Data/geneloc.txt).

下面来看具体代码：

# source("Matrix_eQTL_R/Matrix_eQTL_engine.r");

library(MatrixEQTL)

## Location of the package with the data files.

base.dir = find.package('MatrixEQTL');

# base.dir = '.';

## Settings

# Linear model to use, modelANOVA, modelLINEAR, or modelLINEAR_CROSS

useModel = modelLINEAR; # modelANOVA, modelLINEAR, or modelLINEAR_CROSS

# Genotype file name

SNP_file_name = paste(base.dir, "/data/SNP.txt", sep="");

snps_location_file_name = paste(base.dir, "/data/snpsloc.txt", sep="");

# Gene expression file name

expression_file_name = paste(base.dir, "/data/GE.txt", sep="");

gene_location_file_name = paste(base.dir, "/data/geneloc.txt", sep="");

# Covariates file name

# Set to character() for no covariates

covariates_file_name = paste(base.dir, "/data/Covariates.txt", sep="");

# Output file name

output_file_name_cis = tempfile();

output_file_name_tra = tempfile();

# Only associations significant at this level will be saved

pvOutputThreshold_cis = 2e-2;

pvOutputThreshold_tra = 1e-2;

# Error covariance matrix

# Set to numeric() for identity.

errorCovariance = numeric();

# errorCovariance = read.table("Sample_Data/errorCovariance.txt");

# Distance for local gene-SNP pairs

cisDist = 1e6;

## Load genotype data

snps = SlicedData$new();

snps$fileDelimiter = "\t"; # the TAB character

snps$fileOmitCharacters = "NA"; # denote missing values;

snps$fileSkipRows = 1;# one row of column labels

snps$fileSkipColumns = 1; # one column of row labels

snps$fileSliceSize = 2000; # read file in slices of 2,000 rows

snps$LoadFile(SNP_file_name);

## Load gene expression data

gene = SlicedData$new();

gene$fileDelimiter = "\t"; # the TAB character

gene$fileOmitCharacters = "NA"; # denote missing values;

gene$fileSkipRows = 1;# one row of column labels

gene$fileSkipColumns = 1; # one column of row labels

gene$fileSliceSize = 2000; # read file in slices of 2,000 rows

gene$LoadFile(expression_file_name);

## Load covariates

cvrt = SlicedData$new();

cvrt$fileDelimiter = "\t"; # the TAB character

cvrt$fileOmitCharacters = "NA"; # denote missing values;

cvrt$fileSkipRows = 1;# one row of column labels

cvrt$fileSkipColumns = 1; # one column of row labels

if(length(covariates_file_name)>0) {

cvrt$LoadFile(covariates_file_name);

}

## Run the analysis

snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);

genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);

me = Matrix_eQTL_main(

snps = snps,

gene = gene,

cvrt = cvrt,

output_file_name = output_file_name_tra,

pvOutputThreshold = pvOutputThreshold_tra,

useModel = useModel,

errorCovariance = errorCovariance,

verbose = TRUE,

output_file_name.cis = output_file_name_cis,

pvOutputThreshold.cis = pvOutputThreshold_cis,

snpspos = snpspos,

genepos = genepos,

cisDist = cisDist,

pvalue.hist = "qqplot",

min.pv.by.genesnp = FALSE,

noFDRsaveMemory = FALSE);

unlink(output_file_name_tra);

unlink(output_file_name_cis);

## Results:

cat('Analysis done in: ', me$time.in.sec, ' seconds', '\n');

cat('Detected local eQTLs:', '\n');

show(me$cis$eqtls)

cat('Detected distant eQTLs:', '\n');

show(me$trans$eqtls)

## Plot the Q-Q plot of local and distant p-values

plot(me)

因此，分析自己的数据需要准备

genotype

expression

covariates

gene location

SNP location

这五个文件，前三个需要每列的样本名对应且顺序一致。

作者也提供了生成模拟数据的代码：

# Create an artificial dataset and plot the histogram and Q-Q plot of all p-values

library('MatrixEQTL')

# Number of samples

n = 100;

# Number of variables

ngs = 2000;

# Common signal in all variables (population stratification)

pop = 0.2 * rnorm(n);

# data matrices

snps.mat = matrix(rnorm(n*ngs), ncol = ngs) + pop;

gene.mat = matrix(rnorm(n*ngs), ncol = ngs) + pop + snps.mat*((1:ngs)/ngs)^9/2;

# data objects for Matrix eQTL engine

snps1 = SlicedData$new( t( snps.mat ) );

gene1 = SlicedData$new( t( gene.mat ) );

cvrt1 = SlicedData$new( );

rm(snps.mat, gene.mat)

# Slice data in blocks of 500 variables

snps1$ResliceCombined(500);

gene1$ResliceCombined(500);

# name of temporary output file

filename = tempfile();

# Perform analysis recording information for

# a histogram

meh = Matrix_eQTL_engine(

snps = snps1,

gene = gene1,

cvrt = cvrt1,

output_file_name = filename,

pvOutputThreshold = 1e-100,

useModel = modelLINEAR,

errorCovariance = numeric(),

verbose = TRUE,

pvalue.hist = 100);

unlink( filename );

# png(filename = "histogram.png", width = 650, height = 650)

plot(meh, col="grey")

# dev.off();

# Perform the same analysis recording information for

# a Q-Q plot

meq = Matrix_eQTL_engine(

snps = snps1,

gene = gene1,

cvrt = cvrt1,

output_file_name = filename,

pvOutputThreshold = 1e-6,

useModel = modelLINEAR,

errorCovariance = numeric(),

verbose = TRUE,

pvalue.hist = "qqplot");

unlink( filename );

# png(filename = "QQplot.png", width = 650, height = 650)

plot(meq, pch = 16, cex = 0.7)

# dev.off();

阅读推荐：

生信技能树公益视频合辑：学习顺序是linux，r，软件安装，geo，小技巧，ngs组学！

B站链接：/space/338686099

YouTube链接：/channel/UC67sImqK7V8tSWHMG8azIVA/playlists

生信工程师入门最佳指南：https://mp./s/vaX4ttaLIa19MefD86WfUA

学徒培养：https://mp./s/3jw3_PgZXYd7FomxEMxFmw

---------------------

作者：weixin_34295316

来源：CSDN

原文：/weixin_34295316/article/details/87552069

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。