I use Haploview to create Heat Maps and Linked-Gene-Block tables. I download the sample PED and INFO files from UCSC. There are often invalid samples in the range I am researching, so I am told I need to use PLINK to find and remove the problem samples, then write a new PED file that Haploview can read.
Does someone already have this code written? If not, who could I ask for help writing the code, and then running it on the correct CCDB server(s)?
I do NOT know UNIX, so the PLINK help pages don’t make any sense to me. I took a genomics workshop from WestGrid so did get a brief intro to PLINK. I do a bit of coding in R, and that’s it.
I need to make slight correction to the above: I download the actual files from Ensembl; I use UCSC to find the gene location details.
I am not familiar with Haploview. One would think that it would be able to extract the information necessary and ignore the extra samples.
Are you able to copy the error message you have?
I don’t have access to Haploview right now, I can look up exact error code later. The error code basically says “Error at location ###”, where ### varies with every different download I open. As far as I can tell from trial-and-error, the ### refers to a specific sample that has wrong data, whatever may be wrong. Probably too many bases at a SNP site: instead of just “T”, it might have “T/G” perhaps, which Haploview won’t accept. My work-around has been to download the SNPs/gene region on either side of the bad SNP sample, and those downloads work fine. So I can pinpoint where the bad SNP/Sample is, and most any text editor can remove the bad sample, but I can’t write a new PED file with regular software.
Once the error message is clicked on, it just goes back to the last file opened, and it won’t do anything at all with the bad file.
Here are exact quotes of two error messages:
“More than two alleles at marker 339”
“File input error on line 22, marker 1193. For any marker, an individual’s genotype must be only letters or only numbers.”
The Haploview documentation at https://www.broadinstitute.org/haploview/input-file-formats states:
Please note that Haploview can only interpret biallelic markers with greater than two alleles (e.g. microsatellites) will not work correctly
Haploview is here complaining that there is “More than two alleles at marker 339”. You need to remove the problematic snps.
This can be done by Plink version 1.9 using the –biallelic-only option. Please not that Plink 1.9 is not currently installed on CC server and you need to install in locally. You can download it here: https://www.cog-genomics.org/plink2
The command line to run should look like:
plink --noweb --file myfile --recode --biallelic-only --out myfile_filtered
Let me know if you still need help.
Yes, the PLINK pages can be a bit daunting, especially if you’re not particularly familiar with UNIX. With that said, you will need to use the command line in order to run your file cleanup.
So let’s start with this exceprt from the PLINK documentation:
To remove certain individuals from a file:
plink --file data --remove mylist.txt
where the file
mylist.txtis just a list of Family ID / Individual ID pairs, one set per line, i.e. one person per line. And
datarefers to 2 files, the .ped and .map.
One must combine this option with the desired analytic (e.g. –assoc), summary statistic (e.g. –freq) or data-generation (e.g. –make-bed) option.
Does this make sense? Perhaps you can give it a try and let me know if you get it to work.
Okay, that’s a good place for me to start. I will try this in a few days with a friend who has more experience with UNIX than I do, and will report back what we’ve managed.
I was genuinely hoping someone had done this before so I wouldn’t have to work entirely from scratch.
ok great, here’s the link to the plink page i referenced as well in case you need to come back to it…
Since you are not familiar with the command line, let me guide you a bit more thoroughly:
1. create a file called mylist.txt with the family and individual ID such as:
(replace by a tabulation)
2. connect to a CC server and load PLINK:
sshfs <username>@guillimin-p2.hpc.mcgill.ca (for instance) module avail (to find the plink module) module load mugqic_dev/plink/1.07 (to load the module)
3. run plink
plink --noweb --file myfile --remove mylist.txt --recode --out newfile
myfile is the basename for myfile.ped and myfile.fam
– -remove can be replaced by –keep to only keep the individuals listed in mylist.txt
– -noweb to stop plink from tying to access the web
– -recode to produce a filtered ped file
Let me know if you still need help.
I will work with that code and post here if I get stuck. This won’t be until next week, but I’m looking forward to trying these coding tips.
So far I have successfully logged into two different servers, and one has PLINK installed. Next week I will actually try manipulating files.
Great! Good luck with the file manipulation, and keep us posted.
So my friend isn’t familiar enough with PLINK to run those commands right now. We found a fix using UNIX commands only, like the sample below. Yes, this code is clunky compared to the lovely PLINK code above. And I want to find the time to learn how to use PLINK, it just won’t be this moment. Probably a project over the holiday break.
# HaploView said there’s a problem with Marker 189; this file had 253 Markers/SNPs.
# The PED file first 6 columns have non-marker info, so column numbers are adjusted by 6.
# These lines remove that one problem marker/column from the PED file.
rev Chr15_full-size.ped | cut -f66- | rev > Chr15_before-mrkr-189.ped
cut -f196- Chr15_full-size.ped > Chr15_after-mrkr-189.ped
paste Chr15_before-mrkr-189.ped Chr15_after-mrkr-189.ped > Chr15_removed-189.ped
# These lines remove the same problem marker/row from the INFO file.
head -188 Chr15_full-size.info > Chr15_removed-189_start.info
tail -64 Chr15_full-size.info > Chr15_removed-189_end.info
cat Chr15_removed-189_start.info Chr15_removed-189_end.info > Chr15_removed-189.info
And voilá! We got files that fixed that problem marker, and then HaploView gave us a new problem marker. So we fixed the second marker the same way, and got a fabulous ~50,000 bp region that HaploView turned into pretty haplotype blocks.
Yup, I really want to learn PLINK. Thanks again,
Based on the above UNIX code, i’d say yes, PLINK might be worth learning 🙂
But hey, sometimes whatever works, works right?
Anyway, glad to hear you’ve got it worked out for the time being.
And just let us know if you need any help in the future.