FORUMCorrelation between Raw and refined RNA-Seq data from TCGA breast cancer cohort
Helimeli asked 3 months ago



  • Hi,
    I used to work with RNA-Seq data from TCGA download from cBioportal. They were very well correlated with microarray data that we have on our infrastructure.
    But I wanted to have a look to the isoform expression for a certain gene (GJA1). So I wanted to look directly in the data from TCGA. We have the Level 3 RNASeqV2 data, from 09 dec 2014.
    I first used normalized data for RNA-Seq to check if it was coherent with our numbers for RNA-Seq. So I used normalized results (Example of a file name:unc.edu.cdc523d2-da82-4a3d-a97e-9745c8a802d1.2045613.rsem.genes.normalized_results”). These files have two column, one with the gene ID, the other with “normalized_count”. Seemed easy and I expected a nice r = 1 correlation between the two. But when I compared the number we have with the one in these files by plotting them, I found a very strange pattern:
    https://www.facebook.com/photo.php?fbid=10157724962313206&set=a.10150696589063206&type=3&theater

    What could be the problem? Why is there three levels of data? Any idea? These are normalized data!!! do I have to process them in certain ways?
    The color is for the breast cancer subtype. I can see that the blue one on the top line are tumors that express higher levels of the gene. Don’t pay too much attention to that.
    Thanks
    Mel

    1 Answers
    jhgalvez Staff answered 3 months ago



  • Hello Mel, 
    The fact that the data you downloaded from the TCGA has been “normalized” does not mean that it is a 1:1 comparison with the RNA-Seq data from your facility. It really depends on the type of normalization and my suggestion would be that: 
    a) You re-process your data according to the procedure used in the TCGA. From what I can see in the title you sent, it seems that the data you selected was processed using an RSEM pipeline [https://deweylab.github.io/RSEM/] . Make sure to follow the procedure as closely as possible (try to find any specific parameters used). 
    b) Download the *raw* data from the cBioPortal (or from the Genomics Data Commons portal [https://portal.gdc.cancer.gov/]) and run your RNA-Seq pipeline with all the data at once. 
    Of those options, I think the second one is better, because that way you have full control on how the samples were normalized and counted, which will make comparisons between them simpler. 
    I hope this helps! 
    Hector