700字范文 > 转录组学习之转录本组装与定量（stringtie）[学习笔记通俗易懂版]

转录组学习之转录本组装与定量（stringtie）[学习笔记通俗易懂版]

时间：2023-05-31 12:43:33

date : .07.25

recorder : CYH-BI

特别注意：本文为我自己学习的学习记录，没有任何权威，只能仅供初学者提供思路与参考。

本文知乎地址：/p/645770755

stringtie 工具进行转录本组装与定量

软件介绍

StringTie是一种快速高效的RNA-Seq序列比对组装器。它的输入不仅包括其他转录本汇编程序也可以使用短读序列的对比。为了在实验之间鉴定差异表达的基因，可以使用Ballgown，Cuffdiff或其他（DESeq2，edgeR等）专用软件来处理StringTie的输出。

Stringtie应用了起源于最优化理论的网络流算法，与可选择的从头组装策略一起来将这些短读段组装成转录本。与目前其他的转录本组装软件相比，stringtie具有更精准的基因组装效果以及更好的基因表达估计，同时通过它获得的组装好的转录本的数目也比其它软件多。

好用的地址：https://phantom-aria.github.io//04/17/a.html （这篇文章解决了很多问题）

Stringtie工具的安装

方法一：使用官网安装包安装

1、下载包

wget http://ccb.jhu.edu/software/stringtie/dl/stringtie-2.2.1.Linux_x86_64.tar.gz

2、解压

tar -zxvf stringtie-2.2.1.Linux_x86_64.tar.gz

3、配置环境

vim ~/.bashrcexport PATH=$PATH:"/home/cyh/biosoft/stringtie-2.2.1.Linux_x86_64: $PATH"source ~/.bsahrc

方法二：使用conda安装

conda install -c bioconda stringtie

Stringtie的使用

使用-h或–help查看一下参数以及用法

--mix : both short and long read data alignments are provided(long read alignments must be the 2nd BAM/CRAM input file)--rf : assume stranded library fr-firststrand--fr : assume stranded library fr-secondstrand-G reference annotation to use for guiding the assembly process (GTF/GFF)--conservative : conservative transcript assembly, same as -t -c 1.5 -f 0.05--ptf : load point-features from a given 4 column feature file <f_tab>-o output path/file name for the assembled transcripts GTF (default: stdout)-l name prefix for output transcripts (default: STRG)-f minimum isoform fraction (default: 0.01)-L long reads processing; also enforces -s 1.5 -g 0 (default:false)-R if long reads are provided, just clean and collapse the reads butdo not assemble-m minimum assembled transcript length (default: 200)-a minimum anchor length for junctions (default: 10)-j minimum junction coverage (default: 1)-t disable trimming of predicted transcripts based on coverage(default: coverage trimming is enabled)-c minimum reads per bp coverage to consider for multi-exon transcript(default: 1)-s minimum reads per bp coverage to consider for single-exon transcript(default: 4.75)-v verbose (log bundle processing details)-g maximum gap allowed between read mappings (default: 50)-M fraction of bundle allowed to be covered by multi-hit reads (default:1)-p number of threads (CPUs) to use (default: 1)-A gene abundance estimation output file-E define window around possibly erroneous splice sites from long reads tolook out for correct splice sites (default: 25)-B enable output of Ballgown table files which will be created in thesame directory as the output GTF (requires -G, -o recommended)-b enable output of Ballgown table files but these files will be created under the directory path given as <dir_path>-e only estimate the abundance of given reference transcripts (requires -G)--viral : only relevant for long reads from viral data where splice sitesdo not follow consensus (default:false)-x do not assemble any transcripts on the given reference sequence(s)-u no multi-mapping correction (default: correction enabled)--ref/--cram-ref reference genome FASTA file for CRAM inputTranscript merge usage mode: stringtie --merge [Options] {gtf_list | strg1.gtf ...}With this option StringTie will assemble transcripts from multipleinput files generating a unified non-redundant set of isoforms. In this modethe following options are available:-G <guide_gff> reference annotation to include in the merging (GTF/GFF3)-o <out_gtf>output file name for the merged transcripts GTF(default: stdout)-m <min_len>minimum input transcript length to include in the merge(default: 50)-c <min_cov>minimum input transcript coverage to include in the merge(default: 0)-F <min_fpkm> minimum input transcript FPKM to include in the merge(default: 1.0)-T <min_tpm>minimum input transcript TPM to include in the merge(default: 1.0)-f <min_iso>minimum isoform fraction (default: 0.01)-g <gap_len>gap between transcripts to merge together (default: 250)-ikeep merged transcripts with retained introns; by defaultthese are not kept unless there is strong evidence for them-l <label> name prefix for output transcripts (default: MSTRG)

单个样本组装

对每一个排序与转换格式（.bam文件）之后的样本使用基因组注释文件，生成.gtf，以便与后续组装(注意：输入文件一定要是排序后的。)

stringtie -p 3 -e -G /home/cyh/Desktop/hugene_dir/GCF_000001405.40_GRCh38.p14_genomic.gff -o ly1.gtf -i /home/cyh/Desktop/his_result_sample1/sample1_sorted.bam

-p3 ：线程数3

-G：基因组注释信息（.gff也可以是.gtf文件）

-o：生成样本的（.gtf）

-i：输入排序后的样本文件（.bam文件）

-e：如果你不需要新的转录本，一定要加-e参数，

如果我们研究的样本没有很好的注释信息，研究的人少，现有的注释信息都不完善，那么我们就需要重建转录本进行注释，这个时候就不需要加参数-e。如果样品的注释信息非常完整，比如拟南芥这种模式生物，我们不需要重建新的转录本进行注释，只对现有的参考基因组注释文件就足够了，那就要用-e参数，不需要预测新的转录本。

-e参数还有个比较重要的地方，只有用了-e参数后，才可以运行prepDE.py3脚本得到read count矩阵（也就是进行定量）。

这部分推荐地址：

1、Stringtie的使用说明 - 简书 ()

2、https://phantom-aria.github.io//04/17/a.html

多样本的组装

当单个转录本组装完成后，就可以组装多个转录本

stringtie --merge -p 3 \ly1.gtf \ly2.gtf \...(省略)\lyn.gtf \-G /home/cyh/Desktop/hugene_dir/GCF_000001405.40_GRCh38.p14_genomic.gff \-o stringtied_merged.gtf

输入数据是单个转录本组装后的.gtf文件

-G：基因组注释文件

输出数据是一个组装好的.gtf文件（我这里命名为：stringtied_merged.gtf）

stringtie --merge [options] gtf.list:转录组merge模式，在该模式下，Stringtie可以利用输入的一个gtf list并将他们中的转录本进行非冗余的整理。可以在处理多个RNA-seq样本的时候，由于转录组存在时空特异性，可以将每个样本各自的转录组进行非冗余的整合，如果-G提供了参考gtf文件，可以将其一起整合到一个文件中,最终输出成一个完整的gtf文件，这个gtf文件可以用来定量。

得到的stringtied_merged.gtf可以用来生成结果用于Ballgown包，请看定量部分

定量

定量的方式有多种

第一种：（不推荐）

该部分结果用于Ballgown包，使用-B参数生成 *.ctab文件，用于使用ballgown包进行差异表达分析,以sample1数据为例，会生成好6个文件（一个.gtf，五个*.ctab），建议每个样本生成的结果使用一个文件夹去装，否则个样本的结果会被覆盖。后续使用Ballgown包去读取结果（Rsudio的内容这里不解释）

stringtie -e -B -p 4 -G stringtied_merged.gtf -o sample1-ballgown.gtf /home/cyh/Desktop/his_result_sample1/sample1_sorted.bam

-G后面指定gtf或gff文件，建议使用上面–merge后的stringtied_merged.gtf文件

-o输出的.gtf文件

在输出的GTF格式的文件中，对于每个转录本，会给出以下3种表达量

1、coverage

2、TPM

3、FPKM

我的脚本举例,对于每个样本结果都会有一个文件夹去装，因为每个样本除了.gtf文件，其他几个文件名都一样，结果会被覆盖。我有三个.bam文件，使用多样本组装后的stringtie_merged.gtf去定量，会生成一个.gtf文件与5个.ctab文件.ctab文件需要别Ballgown包读取。

for i in {1,2,3}domkdir sample_ly${i}cd ./sample_ly${i}stringtie -e -B -p 20 -G /home/chenyh/ly_NT_RNAseq/stringtie_result/stringtie_merged.gtf -o ly${i}-ballgown.gtf /home/chenyh/ly_NT_RNAseq/samtools_result/ly${i}.bamcd ../done

使用stringtie软件，每个样本加入-B参数后生成的*.ctab文件，每个样本有五个结果，分别为：

e_data.ctab: 外显子水平表达值i_data.ctab:内含子水平表达值t_data.ctab:转录组水平表达值e2t.ctab:表中有两列，e_id和t_id，表示哪些外显子属于哪些转录本。这些id与e_data和t_data表中的id匹配。i2t.ctab:表中有两列，i_id和t_id，表示哪些内含子属于哪些转录本。这些id与i_data和t_data表中的id匹配。

对此如何使用Ballgown包进行后续定量，请看其他教程。

第二种：（推荐）

使用stringTie自带的python脚本定量

prepDE.py

本质上，stringTie只提供了转录本水平的表达量，定量方式包括TPM和FPKM值两种。为了进行raw count的定量方式，官方提供了prepED.py脚本，可以计算出raw count的表达量，用法如下

python prepDE.py \-i sample_list.txt \-g gene_count_matrix.csv \-t transcript_count_matrix.csv

输入文件为sample_list.txt，该文件为\t分隔的两列，第一列为样本名称，第二列为定量的gtf文件的路径，示例如下

sampleA A.stringtie.gtfsampleB B.stringtie.gtf

该部分的.gtf文件，可以是单个转录本组装生成的结果。

该脚本同时输出基因和转录本水平的raw count表达量值。生成两个结果gene_count_matrix.csv以及transcript_count_matrix.csv。后续就可以使用DEseq2进行后续分析了。

到此，本文内容结束，这篇文章是经过了自己学习实践出来的，参考了很多资料，如若有大佬能指出错误，我将感激

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。