A good place to start would be to check out the Dream Challenges website.
The Dream Challenges are a series of competitions to evaluate the accuracy (amongst other things) of a wide variety of bioinformatics algorithms, software, and platforms.
For your specific question have a look at the mutation caller challenge results from 2014.
The top performers include DELLY, MantaStrelka, and the Ken_Chen_Lab (novoBreak algorithm) which scored top spot in 3 of the 5 challenges with accuracy scores in the high 80s and low 90s (%)
I personnally use the GATK pipeline described below to perform snp calling analysis:
I have used this pipeline for couple of years and did not take time to reevaluate if it is the most accurate now. I have found this 2015 article where they compare different variant calling pipelines. Maybe this could help you. Here is the link to article:
I would add that this might change depending of what kind of variant you’re trying to call (SNV, CNV, Structural variant) and what kind of conditions (Calling de novo variants in the a proband but not in control family members or somatic variants from a tumour sample) might require completely different pipelines.
For example, I’ve been thinking about implementing a pipeline from Stephen Scherer’s group to deal with similar cases for de novo variant calling.
Setup details: https://images.nature.com/original/nature-assets/npjgenmed/2016/npjgenmed201627/extref/npjgenmed201627-s2.pdf
Source publication: https://www.nature.com/articles/npjgenmed201627
They reasonably describe the different tools and filtering steps they used.
I’m also dealing with the same problem that Karen is working on. I’m working with transcriptomes built with a de novo approach (ie. no genome used for the assembly). No genomes are available for my species.
I’m trying to figure out what is the best approach for me to follow for a variant calling analysis for my type of data. Ideally, I would find a pipeline that has already been used in publications (beyond the publication of the program itself), it is not based on the creation of super transcripts, and it has a low error rate (I know this is a lot to ask, but the approaches that we have taken so far are calling variants in our mitochondrial DNA!! plop!).
Hmm. This is a tough question. If I understand correctly, you have de novo assembled RNA data in which you want to identify variants (SNV, or large scale?). As you noted, the usual paradigm for this sort of thing is that you have either a reference genome or a database of variants against which you would identify variation. Without having multiple samples, or a reference genome, what sort of “variation” are you hoping to capture, perhaps you are trying to identify heterozygous locations in your de novo assembly?
Perhaps others have better ideas, but it seems you are in a difficult situation if you have a single RNA sample of an organism with no reference genome in which you are trying to identify variants.
EDIT: I re-read your comment where I noticed that you said you have multiple transcriptomes. So are you trying to compare your sequence data multiple transcriptomes to identify regions/positions of variability in your sequenced population?
You are capturing the essence of my challenges. I don’t have a genome. On the bright side, I do have 2-9 transcriptomes per species (~25 species total).
For my research, I want to study the signatures of selection on multiple genes within a population and between species. I’m trying to broaden up the list of genes that are under strong selection in the reproductive systems of echinoderms. To do so, I’m trying to phase each individual transcriptome so that I get two alleles for each of my genes (or regions) of interest.
I am not familiar with snp calling on de novo data but I stumble upon this paper called “SNP calling from RNA-seq data without a reference genome”: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5100560/
They have a pipeline available at http://kissplice.prabi.fr/TWAS/ which might do what you want.
Great suggestion, we also thought it looked interesting so we tried kissplice last year. In general, I found the pipeline to be straightforward and friendly to current popular pipelines for RNA-seq studies. The problem with that pipeline was that the program was calling SNPs in our mitochondrial data (haploid DNA). So we thought of trying a different pipeline.