FORUMBlast to multiple genomes
spongemicrobiome asked 4 months ago



  • I am running blast on the command line.  I’d like to blast one protein against >100 genomes.  Normally I’d concatenate the genomes and make them into a database. However, the genome contigs have internal headers not unique to the sample name.  For example:
    Genome 1

    >contig1
    >contig2

    Genome 2

    >contig1
    >contig2

    If i concatenate them then i’d have no way to know if which sample the query matched to.
    I tried: for i in *faa; do bastp -bd prtoen1.faa – query $i -out $i.blasted; done.  But i get one output file for each genome (hundreds of files).    Ideally, i’d like one output.  Has anyone tried to do this before?

    2 Answers
    zhibin Staff answered 4 months ago



  • How about add genome name in front of each contig?
    sed -i -e ‘s/^>/>genome1_/’ genome1.faa

    spongemicrobiome replied 4 months ago

    I was thinking about that but I have a directory with >500 genomes. I was wondering which is easier loop the blast or loop renaming of files? both of which i am stuck with!

    spongemicrobiome replied 4 months ago

    thanks for sed -i -e ‘s/^>/>genome1_/’ genome1.faa
    Super handy!

    spongemicrobiome replied 4 months ago

    is there a way to loop that through a directory of genomes with different file names?

    jhgalvez Staff replied 4 months ago

    Use wildcards to substitute differences in the names so you can loop through directories with different names: https://ryanstutorials.net/linuxtutorial/wildcards.php

    spongemicrobiome replied 4 months ago

    cool! thank you for your help this is a great forum!

    zhibin Staff replied 4 months ago

    You can try this

    for i in *.faa; do sed -i -e “s/^>/>$(echo $i|cut -f 1 -d ‘.’)_/” $i; done

    It will take name before the first “.” as genome name

    Please test a couple of files first!

    spongemicrobiome replied 4 months ago

    i get:
    bash: /: Is a directory

    zhibin Staff replied 4 months ago

    make sure when you copy the code, ” and ‘ are correct.

    When I copied ” became “ and ”, ‘ became ‘ and ’

    spongemicrobiome replied 4 months ago

    that’s exactly what it was the ” and ‘ got changed when i copied it. It now worked!!!! Thank you so much for your help! again really useful forum!

    jhgalvez Staff answered 4 months ago



  • You can include more than one fasta file when creating indices, just add them as arguments separated by spaces (as opposed to manually concatenating them into a single file). Still, if the headers are exactly the same and the file names are exactly the same, you might keep running into issues. In that case, the best practice would be to re-label your headers so that they are unique, removing any ambiguity. 
     
    Hope this helps!