FORUMCorrelation between Raw and refined RNA-Seq data from TCGA breast cancer cohort
Helimeli asked 2 years ago

I used to work with RNA-Seq data from TCGA download from cBioportal. They were very well correlated with microarray data that we have on our infrastructure.
But I wanted to have a look to the isoform expression for a certain gene (GJA1). So I wanted to look directly in the data from TCGA. We have the Level 3 RNASeqV2 data, from 09 dec 2014.
I first used normalized data for RNA-Seq to check if it was coherent with our numbers for RNA-Seq. So I used normalized results (Example of a file”). These files have two column, one with the gene ID, the other with “normalized_count”. Seemed easy and I expected a nice r = 1 correlation between the two. But when I compared the number we have with the one in these files by plotting them, I found a very strange pattern:

What could be the problem? Why is there three levels of data? Any idea? These are normalized data!!! do I have to process them in certain ways?
The color is for the breast cancer subtype. I can see that the blue one on the top line are tumors that express higher levels of the gene. Don’t pay too much attention to that.

1 Answers
jhgalvez Staff answered 2 years ago

Hello Mel, 
The fact that the data you downloaded from the TCGA has been “normalized” does not mean that it is a 1:1 comparison with the RNA-Seq data from your facility. It really depends on the type of normalization and my suggestion would be that: 
a) You re-process your data according to the procedure used in the TCGA. From what I can see in the title you sent, it seems that the data you selected was processed using an RSEM pipeline [] . Make sure to follow the procedure as closely as possible (try to find any specific parameters used). 
b) Download the *raw* data from the cBioPortal (or from the Genomics Data Commons portal []) and run your RNA-Seq pipeline with all the data at once. 
Of those options, I think the second one is better, because that way you have full control on how the samples were normalized and counted, which will make comparisons between them simpler. 
I hope this helps!