FORUMOTU ID in Vsearch
Genevieve asked 2 months ago



  • My problem is… I use Vsearch-Usearch_global with the SILVA_132_LSUParc_tax_silva.fasta reference data base. It work well but when I open the otutabout file (OTU table in the classic tab-separated plain text format as a matrix containing the abundances of the OTUs in the different samples) the first line start with the string ’#OTU ID’ but only the accession number is written and not the complete taxonomy. So I cannot know if its a bacteria or a planta etc… In the reference Silva data base that I used the ID of the sequences is written like this >AY187551.1.496 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Xanthobacteraceae;Bradyrhizobium;Bradyrhizobium sp. Ih3-2. I only see the numbers before bacteria in the OTU ID column. There is a way to modfied this to obtain more details in the OTU ID column ? 

    Thank 🙂

    5 Answers
    jflucier Staff answered 2 months ago



  • I have never used vsearch. i use mostly qiime2.

    From what i have read in vsearch documentation, i think your reference fasta is missing taxonomic information. Your headers only includes OTU ids taht are reported in output.

    So to solve this problem you need to include taxonomic information in your reference fasta header (SILVA_132_LSUParc_tax_silva.fasta). I have found and adapted and existing code that should do the trick.

    example reference fata:

    $ head ref.fa
    >KC716084.1.1445
    GATGAACGCTAGCGGCAGGCCTAACACATGCAAGTCGAGGGGGAACAGGGGGCTTGCACCGCTGACGACCGGCGCACGGGGGTGCGTAACGCGTATACAATCTACCTTTTACAGAGAGATACCCCAGAGAAATTTGGAATAATACCTCATAATATTTTTGCTCGGCATCGAGTGATAATTAAAGTTTCGGCGGTAAAAGATGAGTATGCGTCCTATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCGACGATAGGTAGGGGTCCTGAGAGGGAGATCCCCCACACTGGTACTGAGACACGGACCAGACTCTTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGTCGCAAGACTGAACCAGCCATGCCGCGTGCAGGATGAAGGTTCTATGAATCGTAAACTGCTTTTATACACCAAGAAAAACACCCACGTGTGCGCAAATGCCGGTAGGGTATGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGATTCATTGGGTTTAAAGGGTGCGTAGGCGGATCGTTAAGTCAGCGGTGAAATACTGCCGCTTAACTGGAAAATTGCCATTGATACTGTTTATCTAGAGTATGGTAGAGGTAGGTGGAATGTGTTGTGTAGCGGTGAAATGCATAGATATGACACAGAACGCCGATTGCGAAGGCAGCTTACTAAGCCATTACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGATCACTCGCTGTTTGCGATATACAGCAAGCGGCTGAGCGAAAGCATTAAGTGATCCACCTGGGGAGTACGATCGCAAGGTTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTTAAATGTAGGCTGCATTCGGCTGAAAGGCTGATTCCCTTCGGGGCTGCTTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTTCTTAGTTATTACCAAGTTAAGTTGGGGACTCTAAGGAGACTGCCGATGAAACTCGTGAGGAAGGTGGGATGACGTCAAATCAGCGCGGCCCTTATGTCCTGGGCTACACACGTGTTACAATGGTCGGTACAAAGAGCAGCCACTTGGTGACAAGGCGCTAATCTCAAAAGCCGATCTCAGTTCGGATCGAAGTCTGCAACTCGACTTCGTGAAGTTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCAAGCCATGGAAGCTGGGGGTGCCTGAAGTCCGTAACCGCAAGGAGCGGCCTAGGGTAAAACTAGTAACTGGGGCT
    >GBKB01000906.322.1853
    AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAACGGACGAGAAGCTTGCTTCTCTGATGTTAGCGGCGGACGGGTGAGTAACACGTGGATAACCTACCTATAAGACTGGGATAACTTCGGGAAACCGGAGCTAATACCGGATAATATTTTGAACCGCATGGTTCAAAAGTGAAAGACGGTCTTGCTGTCACTTATAGATGGATCCGCGCTGCATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCAACGATGCATAGCCGACCTGAGAGGGTGATCGGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAACTCTGTTATTAGGGAAGAACATATGTGTAAGTAACTGTGCACATCTTGACGGTACCTAATCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAAGTGTTAGGGGGTTTCCGCCCCTTAGTGCTGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGACCGCAAGGTTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAATCTTGACATCCTTTGACAACTCTAGAGATAGAGCCTTCCCCTTCGGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTAAGCTTAGTTGCCATCATTAAGTTGGGCACTCTAAGTTGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGACAATACAAAGGGCAGCGAAACCGCGAGGTCAAGCAAATCCCATAAAGTTGTTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGTAGATCAGCATGCTACGGTGAATACGTTCCCGGGTCTTGTACACACCGCCCGTCACACCACGAGAGTTTGTAACACCCGAAGCCGGTGGAGTAACCTTTTAGGAGCTAGCCGTCGAAGGTGGGACAAATGATTGGGGTGAAGTCGTAACAAGGTAGCCGTATCGGAAGGTGCGGCTG
    >JULO01000037.98060.99588
    AGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTAGAACGCTGAGAACTGGTGCTTGCACCGGTTCAAGGAGTTGCGAACGGGTGAGTAACGCGTAGGTAACCTACCTCATAGCGGGGGATAACTATTGGAAACGATAGCTAATACCGCATAAGAGAGACTAACGCATGTTAGTAATTTAAAAGGGGCAATTGCTCCACTATGAGATGGACCTGCGTTGTATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCGACGATACATAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCGGCAATGGGGGCAACCCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTAGAGAAGAATGATGGTGGGAGTGGAAAATCCACCAAGTGACGGTAACTAACCAGAAAGGGACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGAAGTTAAAGGCATTGGCTCAACCAATGTACGCTTTGGAAACTGGAGAACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGGTGTTAGGCCCTTTCCGGGGCTTAGTGCCGGAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGACCGCAAGGTTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCGATGCCCGCTCTAGAGATAGAGTTTTACTTCGGTACATCGGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCATCATTAAGTTGGGCACTCTAGCGAGACTGCCGGTAATAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGTTGGTACAACGAGTCGCAAGCCGGTGACGGCAAGCTAATCTCTTAAAGCCAATCTCAGTTCGGATTGTAGGCTGCAACTCGCCTACATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAGAGTTTGTAACACCCGAAGTCGGTGAGGTAACCTATTAGGAGCCAGCCGCCTAAGGTGGGATAGATGATTGGGGTGAAGTCGTAACAAGGTAGCCGTATCGGAAGGTGCGGCTG
    >HM248444.1.1359
    ATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCGGGCGAGGTTGCTTCGGTAACTGAGCTAGCGGCGGACGGGTGAGTAATGCTTAGGAATCTGCCTATTAGTGGGGGACAACATTCCGAAAGGAATGCTAATACCGCATACGCCCTACGGGGGAAAGCAGGGGATCTTCGGACCTTGCGCTAATAGATGAGCCTAAGTCAGATTAGCTAGTTGGTGGGGTAAAGGCCTACCAAGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGGAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTACTTGGATTAATACTCTAGGATAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTCTTAAGTCGGATGTGAAATCCCTGAGCTTAACTTAGGAATTGCATTCGATACTGGGAAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCATGCCGTAAACGATGTCTACTAGCCGTTGGGGCCTTTGAGGCTTTAGTGGCGCAGCTAACGCGATAAGTAGACCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATAGTAAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTTACATACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTTTCCTTATTTGCCAGCGGGTTAAGCCGGGAACTTTAAGGATACTGCCAGTGACAAACTGGAGGAAGGCGGGGACGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCTACCTAGCGATAGGATGCTAATCTCAAAAAGCCGATCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGCCT
    >AB650512.1.1469
    ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGCAGCACAGAAAGAGCTTGCTCTTTGGGTGGCGAGTGGCGAACGGGTGAGTAATGTCTGGGAAACTGCCCAATGGAGGGGGATAACCATTGGAAACGATGGCTAATACCGCATAATGTCGATAAGACCAAAGTGGGGGACCTATTTGGCCTCATACCATTGGATGTGCCCAGATGGGATTAGCTAGTAGGTAGGGTAATGGCTTACCTAGGCAACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGGATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCTTTCGAGTTGTAAAGTACTTTCAGTAGGAAGGAAGGCAGTAAACCTAATATGTTTATTGATTGACATTACCTGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGAGCACGTAGGCGGCCTATTAAGTCAGATGTGAAATCCCTGGGCTTAACCTAGGAACTGCATTTGAAACTGGCAGGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGTGTAGATATCTGGAGGAATACCAGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTCGACTTGGAAGTTGTATCCTTTGAGATGTGGCTTCCGAAGCTAACGCATTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACACGAAGAACCTTACCTGGCCTTGACATCCAGAGAACATTCTAGAAATAGAATAGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGTGGTTCGGCCAGGAACTCAAAGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGCGTATACAAAGAGAAGCAACCTCGTAAGAGCAAGCGGACCTCATAAAGTATGTCGTAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTAGATCAGAATGCTACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCCTCGTGGAGAGCGCTTACCACTTTGTGATTCATGACTGGGGTG

    example taxonomic annotation file:

    $ head taxo.txt
    KC716084.1.1445 D_0__Bacteria;D_1__Proteobacteria;D_2__Epsilonproteobacteria;D_3__Campylobacterales;D_4__Helicobacteraceae;D_5__Sulfuricurvum;Ambiguous_taxa
    GBKB01000906.322.1853 D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Peptococcaceae;D_5__Thermincola;D_6__uncultured bacterium
    JULO01000037.98060.99588 D_0__Bacteria;D_1__Bacteroidetes;D_2__Bacteroidia;D_3__Bacteroidales;D_4__Porphyromonadaceae;D_5__Porphyromonas;D_6__uncultured bacterium
    HM248444.1.1359 D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Enterobacteriales;D_4__Enterobacteriaceae;D_5__Klebsiella;D_6__uncultured organism
    AB650512.1.1469 D_0__Bacteria;D_1__Chloroflexi;D_2__Ktedonobacteria;D_3__Ktedonobacterales;D_4__HSB OF53-F07;D_5__uncultured bacterium;D_6__uncultured bacterium

    the perl code that merge taxonomic information to fasta header

    $ cat reformat_fasta_headers.pl
    #!/usr/bin/perl -w
    =usage

    reformat_fasta_headers.pl -f fasta_file -a annotation file (2 columns tab delimited: find col 1 and replace with col2)

    =cut

    use strict;
    use warnings;
    use Bio::SeqIO;
    use Getopt::Long;

    #set command line arguments
    my ($fasta, $annot) = @ARGV;
    my $version="reformat_fasta_headers.pl\tv0.0.1";
    GetOptions(
    'f|fasta:s'=>\$fasta,
    'a|annot:s'=>\$annot,
    'v|version'=>sub{print $version."\n"; exit;},
    );

    open my $fh, ' $2 } ;
    close $fh;

    my $in = Bio::SeqIO->new( -file => $fasta, -format => 'Fasta' );

    while ( my $seq = $in->next_seq() ) {
    my $seqID = $seq->id . " ". $annot{ $seq->id } // $seq->id;
    print ">$seqID\n" . $seq->seq . "\n";
    }

    Ouput of code execution:

    $ perl reformat_fasta_headers.pl -f ref.fa -a taxo.txt
    >KC716084.1.1445 D_0__Bacteria;D_1__Proteobacteria;D_2__Epsilonproteobacteria;D_3__Campylobacterales;D_4__Helicobacteraceae;D_5__Sulfuricurvum;Ambiguous_taxa
    GATGAACGCTAGCGGCAGGCCTAACACATGCAAGTCGAGGGGGAACAGGGGGCTTGCACCGCTGACGACCGGCGCACGGGGGTGCGTAACGCGTATACAATCTACCTTTTACAGAGAGATACCCCAGAGAAATTTGGAATAATACCTCATAATATTTTTGCTCGGCATCGAGTGATAATTAAAGTTTCGGCGGTAAAAGATGAGTATGCGTCCTATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCGACGATAGGTAGGGGTCCTGAGAGGGAGATCCCCCACACTGGTACTGAGACACGGACCAGACTCTTACGGGAGGCAGCAGTGAGGAATATTGGTCAATGGTCGCAAGACTGAACCAGCCATGCCGCGTGCAGGATGAAGGTTCTATGAATCGTAAACTGCTTTTATACACCAAGAAAAACACCCACGTGTGCGCAAATGCCGGTAGGGTATGAATAAGCATCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGATGCAAGCGTTATCCGGATTCATTGGGTTTAAAGGGTGCGTAGGCGGATCGTTAAGTCAGCGGTGAAATACTGCCGCTTAACTGGAAAATTGCCATTGATACTGTTTATCTAGAGTATGGTAGAGGTAGGTGGAATGTGTTGTGTAGCGGTGAAATGCATAGATATGACACAGAACGCCGATTGCGAAGGCAGCTTACTAAGCCATTACTGACGCTGAGGCACGAAAGCGTGGGGATCGAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGATCACTCGCTGTTTGCGATATACAGCAAGCGGCTGAGCGAAAGCATTAAGTGATCCACCTGGGGAGTACGATCGCAAGGTTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGAGGAACATGTGGTTTAATTCGATGATACGCGAGGAACCTTACCTGGGCTTAAATGTAGGCTGCATTCGGCTGAAAGGCTGATTCCCTTCGGGGCTGCTTACAAGGTGCTGCATGGTTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTGTTCTTAGTTATTACCAAGTTAAGTTGGGGACTCTAAGGAGACTGCCGATGAAACTCGTGAGGAAGGTGGGATGACGTCAAATCAGCGCGGCCCTTATGTCCTGGGCTACACACGTGTTACAATGGTCGGTACAAAGAGCAGCCACTTGGTGACAAGGCGCTAATCTCAAAAGCCGATCTCAGTTCGGATCGAAGTCTGCAACTCGACTTCGTGAAGTTGGATTCGCTAGTAATCGCGCATCAGCCATGGCGCGGTGAATACGTTCCCGGGCCTTGCACACACCGCCCGTCAAGCCATGGAAGCTGGGGGTGCCTGAAGTCCGTAACCGCAAGGAGCGGCCTAGGGTAAAACTAGTAACTGGGGCT
    >GBKB01000906.322.1853 D_0__Bacteria;D_1__Firmicutes;D_2__Clostridia;D_3__Clostridiales;D_4__Peptococcaceae;D_5__Thermincola;D_6__uncultured bacterium
    AGAGTTTGATCCTGGCTCAGGATGAACGCTGGCGGCGTGCCTAATACATGCAAGTCGAGCGAACGGACGAGAAGCTTGCTTCTCTGATGTTAGCGGCGGACGGGTGAGTAACACGTGGATAACCTACCTATAAGACTGGGATAACTTCGGGAAACCGGAGCTAATACCGGATAATATTTTGAACCGCATGGTTCAAAAGTGAAAGACGGTCTTGCTGTCACTTATAGATGGATCCGCGCTGCATTAGCTAGTTGGTAAGGTAACGGCTTACCAAGGCAACGATGCATAGCCGACCTGAGAGGGTGATCGGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAACTCTGTTATTAGGGAAGAACATATGTGTAAGTAACTGTGCACATCTTGACGGTACCTAATCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAAGTGTTAGGGGGTTTCCGCCCCTTAGTGCTGCAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGACCGCAAGGTTGAAACTCAAAGGAATTGACGGGGACCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAAATCTTGACATCCTTTGACAACTCTAGAGATAGAGCCTTCCCCTTCGGGGGACAAAGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTAAGCTTAGTTGCCATCATTAAGTTGGGCACTCTAAGTTGACTGCCGGTGACAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGATTTGGGCTACACACGTGCTACAATGGACAATACAAAGGGCAGCGAAACCGCGAGGTCAAGCAAATCCCATAAAGTTGTTCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGTAGATCAGCATGCTACGGTGAATACGTTCCCGGGTCTTGTACACACCGCCCGTCACACCACGAGAGTTTGTAACACCCGAAGCCGGTGGAGTAACCTTTTAGGAGCTAGCCGTCGAAGGTGGGACAAATGATTGGGGTGAAGTCGTAACAAGGTAGCCGTATCGGAAGGTGCGGCTG
    >JULO01000037.98060.99588 D_0__Bacteria;D_1__Bacteroidetes;D_2__Bacteroidia;D_3__Bacteroidales;D_4__Porphyromonadaceae;D_5__Porphyromonas;D_6__uncultured bacterium
    AGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAATACATGCAAGTAGAACGCTGAGAACTGGTGCTTGCACCGGTTCAAGGAGTTGCGAACGGGTGAGTAACGCGTAGGTAACCTACCTCATAGCGGGGGATAACTATTGGAAACGATAGCTAATACCGCATAAGAGAGACTAACGCATGTTAGTAATTTAAAAGGGGCAATTGCTCCACTATGAGATGGACCTGCGTTGTATTAGCTAGTTGGTGAGGTAAAGGCTCACCAAGGCGACGATACATAGCCGACCTGAGAGGGTGATCGGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTAGGGAATCTTCGGCAATGGGGGCAACCCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTAGAGAAGAATGATGGTGGGAGTGGAAAATCCACCAAGTGACGGTAACTAACCAGAAAGGGACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTTTTAAGTCTGAAGTTAAAGGCATTGGCTCAACCAATGTACGCTTTGGAAACTGGAGAACTTGAGTGCAGAAGGGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCGGTGGCGAAAGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAGTGCTAGGTGTTAGGCCCTTTCCGGGGCTTAGTGCCGGAGCTAACGCATTAAGCACTCCGCCTGGGGAGTACGACCGCAAGGTTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGAAGAACCTTACCAGGTCTTGACATCCCGATGCCCGCTCTAGAGATAGAGTTTTACTTCGGTACATCGGTGACAGGTGGTGCATGGTTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTATTGTTAGTTGCCATCATTAAGTTGGGCACTCTAGCGAGACTGCCGGTAATAAACCGGAGGAAGGTGGGGATGACGTCAAATCATCATGCCCCTTATGACCTGGGCTACACACGTGCTACAATGGTTGGTACAACGAGTCGCAAGCCGGTGACGGCAAGCTAATCTCTTAAAGCCAATCTCAGTTCGGATTGTAGGCTGCAACTCGCCTACATGAAGTCGGAATCGCTAGTAATCGCGGATCAGCACGCCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCACGAGAGTTTGTAACACCCGAAGTCGGTGAGGTAACCTATTAGGAGCCAGCCGCCTAAGGTGGGATAGATGATTGGGGTGAAGTCGTAACAAGGTAGCCGTATCGGAAGGTGCGGCTG
    >HM248444.1.1359 D_0__Bacteria;D_1__Proteobacteria;D_2__Gammaproteobacteria;D_3__Enterobacteriales;D_4__Enterobacteriaceae;D_5__Klebsiella;D_6__uncultured organism
    ATTGAACGCTGGCGGCAGGCTTAACACATGCAAGTCGAGCGGGCGAGGTTGCTTCGGTAACTGAGCTAGCGGCGGACGGGTGAGTAATGCTTAGGAATCTGCCTATTAGTGGGGGACAACATTCCGAAAGGAATGCTAATACCGCATACGCCCTACGGGGGAAAGCAGGGGATCTTCGGACCTTGCGCTAATAGATGAGCCTAAGTCAGATTAGCTAGTTGGTGGGGTAAAGGCCTACCAAGGCGACGATCTGTAGCGGGTCTGAGAGGATGATCCGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGGACAATGGGCGGAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGCCTTTTGGTTGTAAAGCACTTTAAGCGAGGAGGAGGCTACTTGGATTAATACTCTAGGATAGTGGACGTTACTCGCAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCGAGCGTTAATCGGATTTACTGGGCGTAAAGCGTGCGTAGGCGGCTTCTTAAGTCGGATGTGAAATCCCTGAGCTTAACTTAGGAATTGCATTCGATACTGGGAAGCTAGAGTATGGGAGAGGATGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGATGGCGAAGGCAGCCATCTGGCCTAATACTGACGCTGAGGTACGAAAGCATGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCATGCCGTAAACGATGTCTACTAGCCGTTGGGGCCTTTGAGGCTTTAGTGGCGCAGCTAACGCGATAAGTAGACCGCCTGGGGAGTACGGTCGCAAGACTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATAGTAAGAACTTTCCAGAGATGGATTGGTGCCTTCGGGAACTTACATACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTTTCCTTATTTGCCAGCGGGTTAAGCCGGGAACTTTAAGGATACTGCCAGTGACAAACTGGAGGAAGGCGGGGACGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGTCGGTACAAAGGGTTGCTACCTAGCGATAGGATGCTAATCTCAAAAAGCCGATCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGCGGATCAGAATGCCGCGGTGAATACGTTCCCGGGCCT
    >AB650512.1.1469 D_0__Bacteria;D_1__Chloroflexi;D_2__Ktedonobacteria;D_3__Ktedonobacterales;D_4__HSB OF53-F07;D_5__uncultured bacterium;D_6__uncultured bacterium
    ATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGCAGCACAGAAAGAGCTTGCTCTTTGGGTGGCGAGTGGCGAACGGGTGAGTAATGTCTGGGAAACTGCCCAATGGAGGGGGATAACCATTGGAAACGATGGCTAATACCGCATAATGTCGATAAGACCAAAGTGGGGGACCTATTTGGCCTCATACCATTGGATGTGCCCAGATGGGATTAGCTAGTAGGTAGGGTAATGGCTTACCTAGGCAACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGGATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCTTTCGAGTTGTAAAGTACTTTCAGTAGGAAGGAAGGCAGTAAACCTAATATGTTTATTGATTGACATTACCTGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGAGCACGTAGGCGGCCTATTAAGTCAGATGTGAAATCCCTGGGCTTAACCTAGGAACTGCATTTGAAACTGGCAGGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGTGTAGATATCTGGAGGAATACCAGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCTGTAAACGATGTCGACTTGGAAGTTGTATCCTTTGAGATGTGGCTTCCGAAGCTAACGCATTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACACGAAGAACCTTACCTGGCCTTGACATCCAGAGAACATTCTAGAAATAGAATAGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGTGGTTCGGCCAGGAACTCAAAGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGGCCAGGGCTACACACGTGCTACAATGGCGTATACAAAGAGAAGCAACCTCGTAAGAGCAAGCGGACCTCATAAAGTATGTCGTAGTTCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTAGATCAGAATGCTACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCCTCGTGGAGAGCGCTTACCACTTTGTGATTCATGACTGGGGTG

    In new fasta header, there is a space between OTU id and the taxonomic information.

    Good luck

    Genevieve answered 2 months ago



  • Thank but I’m using windows, so I’m not sure to understand how to do. But I understand that the header of the fasta sequences in the SILVA_132_LSUParc_tax_silva.fasta miss some information. Before I did a search with an other reference data base (UNITE) and in the sequence ID, there is a tax=…. I put an exemple

    >LC146734|SH497095.07FU;tax=d:Fungi,p:Basidiomycota,c:Agaricomycetes,o:Agaricales,f:Tricholomataceae,g:Flagelloscypha,s:Flagelloscypha_japonica_SH497095.07FU;

    There is a way to add this miss letters inside the header of sequences from the SILVA database ?
    >AY187551.1.496 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Xanthobacteraceae;Bradyrhizobium;Bradyrhizobium sp. Ih3-2

    jflucier Staff answered 2 months ago




  • #!/usr/bin/perl -w
    =usage

    reformat_fasta_headers.pl -f fasta_file -a annotation file (2 columns tab delimited: find col 1 and replace with col2)

    =cut

    use strict;
    use warnings;
    use Bio::SeqIO;
    use Getopt::Long;

    #set command line arguments
    my ($fasta, $annot) = @ARGV;
    my $version="reformat_fasta_headers.pl\tv0.0.1";
    GetOptions(
    'f|fasta:s'=>\$fasta,
    'a|annot:s'=>\$annot,
    'v|version'=>sub{print $version."\n"; exit;},
    );

    open my $fh, ' $2 } ;
    close $fh;

    my $in = Bio::SeqIO->new( -file => $fasta, -format => 'Fasta' );

    while ( my $seq = $in->next_seq() ) {
    my $seqID = $seq->id . " tax=". $annot{ $seq->id } // $seq->id;
    print ">$seqID\n" . $seq->seq . "\n";
    }

    base don what i read from doc i think its more a space that is needed… but not sure 100%

    Genevieve answered 2 months ago



  • Thank to help me but I’m working with windows (I don’t understand Linux). The tool that I used to change header in my own illumina sequences was Seqkit and its one guy from your staff who gave me the script to do it ( FOR %f IN (*fasta) DO seqkit.exe replace -p (.*) -r %f %f > renamed_%f). We successed to change the header with the ID of my sample and we obtained this >301.fasta.1 instead of   >M03992:221:000000000-BGWB9:1:2107:24792:2786. Do you think with Seqkit we can add    ;tax= after the accession number and before Bacteria to all the sequences in the fasta file ?

    jflucier Staff answered 2 months ago



  • this is perl code. You can have perl install on your windows system and excute code above. Not sure that seqkit can do this. Your command above works with information in 1 file (simgle fasta) not 2 files (fasta and taxa file).