FORUMCleaning UCSC PED files for Haploview
Silven asked 7 months ago



  •  
    I use Haploview to create Heat Maps and Linked-Gene-Block tables. I download the sample PED and INFO files from UCSC. There are often invalid samples in the range I am researching, so I am told I need to use PLINK to find and remove the problem samples, then write a new PED file that Haploview can read.
    Does someone already have this code written? If not, who could I ask for help writing the code, and then running it on the correct CCDB server(s)?
    I do NOT know UNIX, so the PLINK help pages don’t make any sense to me. I took a genomics workshop from WestGrid so did get a brief intro to PLINK. I do a bit of coding in R, and that’s it.
    Thank you!

    Silven replied 7 months ago

    I need to make slight correction to the above: I download the actual files from Ensembl; I use UCSC to find the gene location details.

    2 Answers
    Eloi Mercier Staff answered 7 months ago



  • Hi Silven,
    I am not familiar with Haploview. One would think that it would be able to extract the information necessary and ignore the extra samples.
    Are you able to copy the error message you have?
     

    Silven replied 7 months ago

    Hello,

    I don’t have access to Haploview right now, I can look up exact error code later. The error code basically says “Error at location ###”, where ### varies with every different download I open. As far as I can tell from trial-and-error, the ### refers to a specific sample that has wrong data, whatever may be wrong. Probably too many bases at a SNP site: instead of just “T”, it might have “T/G” perhaps, which Haploview won’t accept. My work-around has been to download the SNPs/gene region on either side of the bad SNP sample, and those downloads work fine. So I can pinpoint where the bad SNP/Sample is, and most any text editor can remove the bad sample, but I can’t write a new PED file with regular software.

    Once the error message is clicked on, it just goes back to the last file opened, and it won’t do anything at all with the bad file.

    Silven

    Silven replied 7 months ago

    Here are exact quotes of two error messages:
    “More than two alleles at marker 339”
    “File input error on line 22, marker 1193. For any marker, an individual’s genotype must be only letters or only numbers.”

    Eloi Mercier Staff replied 7 months ago

    The Haploview documentation at https://www.broadinstitute.org/haploview/input-file-formats states:

    Please note that Haploview can only interpret biallelic markers with greater than two alleles (e.g. microsatellites) will not work correctly

    Haploview is here complaining that there is “More than two alleles at marker 339”. You need to remove the problematic snps.

    This can be done by Plink version 1.9 using the –biallelic-only option. Please not that Plink 1.9 is not currently installed on CC server and you need to install in locally. You can download it here: https://www.cog-genomics.org/plink2

    The command line to run should look like:

    plink --noweb --file myfile --recode --biallelic-only --out myfile_filtered

    Let me know if you still need help.
    Eloi

    jrosner Staff answered 7 months ago



  • Hi Silven,
    Yes, the PLINK pages can be a bit daunting, especially if you’re not particularly familiar with UNIX.  With that said, you will need to use the command line in order to run your file cleanup.
    So let’s start with this exceprt from the PLINK documentation:

    To remove certain individuals from a file:
    plink --file data --remove mylist.txt
    where the file mylist.txt is just a list of Family ID / Individual ID pairs, one set per line, i.e. one person per line.  And data refers to 2 files, the .ped and .map.
    One must combine this option with the desired analytic (e.g. –assoc), summary statistic (e.g. –freq) or data-generation (e.g. –make-bed) option.

    Does this make sense?  Perhaps you can give it a try and let me know if you get it to work.

    Silven replied 7 months ago

    Hi,

    Okay, that’s a good place for me to start. I will try this in a few days with a friend who has more experience with UNIX than I do, and will report back what we’ve managed.

    I was genuinely hoping someone had done this before so I wouldn’t have to work entirely from scratch.

    Thank you!
    Silven

    jrosner Staff replied 7 months ago

    ok great, here’s the link to the plink page i referenced as well in case you need to come back to it…
    http://zzz.bwh.harvard.edu/plink/dataman.shtml#remove

    good luck!

    Eloi Mercier Staff replied 7 months ago

    Since you are not familiar with the command line, let me guide you a bit more thoroughly:
    1. create a file called mylist.txt with the family and individual ID such as:

    FAM1<tab>IND1
    FAM2<tab>IND2

    (replace by a tabulation)

    2. connect to a CC server and load PLINK:

    sshfs <username>@guillimin-p2.hpc.mcgill.ca (for instance)
    module avail (to find the plink module)
    module load mugqic_dev/plink/1.07 (to load the module)

    3. run plink

     plink --noweb --file myfile --remove mylist.txt --recode --out newfile

    where:
    myfile is the basename for myfile.ped and myfile.fam
    – -remove can be replaced by –keep to only keep the individuals listed in mylist.txt
    – -noweb to stop plink from tying to access the web
    – -recode to produce a filtered ped file

    Let me know if you still need help.
    Eloi

    Silven replied 7 months ago

    Hi,

    I will work with that code and post here if I get stuck. This won’t be until next week, but I’m looking forward to trying these coding tips.

    Thank again,
    Silven

    Silven replied 6 months ago

    So far I have successfully logged into two different servers, and one has PLINK installed. Next week I will actually try manipulating files.

    jrosner Staff replied 6 months ago

    Great! Good luck with the file manipulation, and keep us posted.

    Silven replied 6 months ago

    Hello again,

    So my friend isn’t familiar enough with PLINK to run those commands right now. We found a fix using UNIX commands only, like the sample below. Yes, this code is clunky compared to the lovely PLINK code above. And I want to find the time to learn how to use PLINK, it just won’t be this moment. Probably a project over the holiday break.

    # HaploView said there’s a problem with Marker 189; this file had 253 Markers/SNPs.
    # The PED file first 6 columns have non-marker info, so column numbers are adjusted by 6.
    # These lines remove that one problem marker/column from the PED file.
    rev Chr15_full-size.ped | cut -f66- | rev > Chr15_before-mrkr-189.ped
    cut -f196- Chr15_full-size.ped > Chr15_after-mrkr-189.ped
    paste Chr15_before-mrkr-189.ped Chr15_after-mrkr-189.ped > Chr15_removed-189.ped

    # These lines remove the same problem marker/row from the INFO file.
    head -188 Chr15_full-size.info > Chr15_removed-189_start.info
    tail -64 Chr15_full-size.info > Chr15_removed-189_end.info
    cat Chr15_removed-189_start.info Chr15_removed-189_end.info > Chr15_removed-189.info

    And voilá! We got files that fixed that problem marker, and then HaploView gave us a new problem marker. So we fixed the second marker the same way, and got a fabulous ~50,000 bp region that HaploView turned into pretty haplotype blocks.

    Yup, I really want to learn PLINK. Thanks again,
    Silven

    jrosner Staff replied 6 months ago

    Based on the above UNIX code, i’d say yes, PLINK might be worth learning 🙂
    But hey, sometimes whatever works, works right?
    Anyway, glad to hear you’ve got it worked out for the time being.
    And just let us know if you need any help in the future.

    Cheers,
    Jamie