I used to work with RNA-Seq data from TCGA download from cBioportal. They were very well correlated with microarray data that we have on our infrastructure.
But I wanted to have a look to the isoform expression for a certain gene (GJA1). So I wanted to look directly in the data from TCGA. We have the Level 3 RNASeqV2 data, from 09 dec 2014.
I first used normalized data for RNA-Seq to check if it was coherent with our numbers for RNA-Seq. So I used normalized results (Example of a file name:unc.edu.cdc523d2-da82-4a3d-a97e-9745c8a802d1.2045613.rsem.genes.normalized_results »). These files have two column, one with the gene ID, the other with « normalized_count ». Seemed easy and I expected a nice r = 1 correlation between the two. But when I compared the number we have with the one in these files by plotting them, I found a very strange pattern:
What could be the problem? Why is there three levels of data? Any idea? These are normalized data!!! do I have to process them in certain ways?
The color is for the breast cancer subtype. I can see that the blue one on the top line are tumors that express higher levels of the gene. Don’t pay too much attention to that.
The fact that the data you downloaded from the TCGA has been « normalized » does not mean that it is a 1:1 comparison with the RNA-Seq data from your facility. It really depends on the type of normalization and my suggestion would be that:
a) You re-process your data according to the procedure used in the TCGA. From what I can see in the title you sent, it seems that the data you selected was processed using an RSEM pipeline [https://deweylab.github.io/RSEM/] . Make sure to follow the procedure as closely as possible (try to find any specific parameters used).
b) Download the *raw* data from the cBioPortal (or from the Genomics Data Commons portal [https://portal.gdc.cancer.gov/]) and run your RNA-Seq pipeline with all the data at once.
Of those options, I think the second one is better, because that way you have full control on how the samples were normalized and counted, which will make comparisons between them simpler.
I hope this helps!