If you have RNA-seq reads aligned to the genome, how do you determine if the distribution of reads across a transcript is even for the whole genome?
For example, if a transcript has a lot of depth of coverage at the 3′ end, but not at the 5′ end, the reads would not be evenly distributed.
Many RNA-seq protocols will select for poly-adenylated genes by targeting the polyA tail. This often gives RNA-Seq data a coverage bias toward the 3′ end of genes. One easy way to generate this measurement as well as other RNAseq QC measurements it to run Picard’s CollectRNASeqMetrics.jar. which will report “MEDIAN_5PRIME_TO_3PRIME_BIAS” in the report it generates. If this number is close to 1, the gene coverage is less biased. There may also be a plot that gets generated that might be of use.
When you see high amounts of 3′ bias there can be a large number of causes. Usually more biased libraries come from samples that had limited amounts or degradation of the RNA. Some protocols (like ribo-depletion) will usually be less bias’ed because they don’t rely on pulling down the polyA tail.
Cufflinks tries to address this…
(from the Cufflinks website)
“Cufflinks then estimates the relative abundances of these transcripts based on how many reads support each one, taking into account biases in library preparation protocols.”
An in-depth description of how they do this can be found here: