FORUMSNP calling of transcriptomes built without a reference genome
  • Dear all, 
    I’m reposting this question from a prior thread in this forum in hopes of getting more feedback. I’m trying to separate transcripts built with a denovo (no genome) approach into two alleles. 
    Ideally, I would use a pipeline/program that has been previously published and has a low error rate. Do you have a preferred approach for SNP calling / phasing RNA-seq data built without a genome? I would love to read about it. 

    link to previous discussion thread

    2 Answers
  • Hi Vanessa,
    This is definitely not an easy one, but after some internal discussion with the team we’ve come up with the following advice:

    1. There is a way to call RNAseq variants using the GATK RNAseq pipeline using your assembled transcriptome in place of the reference.  There is a brief discussion on the GATK forum where they say this is possible, but not tested and therefore outside of their ‘best practices guidelines’.  With that said, it might be worth a shot.
    2. In the original thread, you mentioned that you have tried using KisSplice, but did not get good results.  Perhaps you could try contacting the authors.  Perhaps your results could be improved by altering the settings.
    3. Another method is outlined in the plos one paper “De novo Transcriptome Assembly and SNP Discovery in the Wing Polymorphic Salt Marsh Beetle“, so you can have a look — essentially, they use Trinity short read assembler for the transcriptome assembly, then BWA to align the reads back to the assembled reference, then finally SAMtools for the variant calling.

    Hopefully at least one of these leads proves fruitful, and if so, please make sure to post your solution so others might benefit.
    Good Luck!

    Hi Jamie and others,

    Thank you for the feedback. I also thought about these avenues as potential solutions, but I have encountered some limitations with them (including being told by one of the program authors that SNPs, not explained by amplification errors, are naturally occurring in mitochondrial genes – plopl!).

    Trinity incorporated the GATK pipeline for their variant calling protocol in their last release. Cedar has the current version for their pipeline, and it includes a modified program that one of the Trinity makers edited to make it more friendly for the large amount of data that I have. A new beta version of Trinity and of their variant calling pipeline was just announce, so I hope to keep working with the GATK pipeline through Trinity. We are not sure if the SuperTranscript approach is ideal for our work yet, but I hope to know for sure in the next month or two.

    For those trying these options for the first time, I would not suggest trying BWA to align RNA-seq data to your reference transcripts (BWA has been found to insert unnecessary gaps in isoforms, and Transcriptomes are packed with isoforms). There are programs now that are more sensitive to alternative splicing events such as STAR that are made for RNA-seq data (

    Thank you for your time and help,

  • Hi Vanessa,
