Hello Statalist,
I am trying to use a data repository called GTEX (genotype-tissue expression portal). This is repository supported by American tax payers through the National Institutes of Health. Gene expression data managed by GTEX are freely downloadable here: https://www.gtexportal.org/home/down...-gtex/overview
The GTEX repository is composed of donors that donate their body to science upon death. As of now, it includes 948 adult donors (636 males and 312 females). From each donor, a panel of anatomical tissues is collected. Then, the RNA is extracted from each tissue, sequenced, and the results of gene expression are provided as Transcripts Per Million (TPM). There are a total of 52 different tissue types that can be collected, but each donor does not provide all 52 tissues (for example, male donors do not have the uterus tissue). And there are a total of 56,200 human RNA transcripts. These transcripts are the observations (rows) in all GTEX files, and they are always the same. The variables (columns) of the GTEX files are the combination of a given tissue from a given donor. So, the dataset has a WIDE layout. The latest version of GTEX has 17,382 samples (that is, tissue/donor combinations), so the number of variables is enormous: 17,382 variables plus one variable indicating the gene code (which is unique to all 56,200 transcripts) and another variable indicating the gene acronym (which is not unique in the dataset). The GTEX files are provided as .gct files (Gene Cluster Text), which is basically a text file that has 3 extra rows at the top: the first row is just one cell indicating #1.2 (the ID for a GCT dataset), the second row is just two cells indicating the number of data rows and column (equivalent to the Stata des, short command), and the third row is the name of the variables. The version of Stata I have (Stata 18 SE) can import the big GTEX file (called "GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct", the second one in the list in the GTEX download page, 1.5 GB in size) without problems (it just takes about 4 minutes).
The problem I have originates from the way GTEX names the TPM variables. In particular, GTEX uses a dash to separate the various segments of the variable name. For example: GTEX-P44H-2226-SM-E9U4P is the name of one variable. Each of those 17,382 variables is arranged into 5 segments (as the example I included) or 6 segments, all separated by a dash. The first segment is always GTEX. The second segment is the donor ID (in this example, P44H). Unfortunately, the donor can be 4 (as shown here) or 5 alpha-numeric characters long, so it is not uniform. The remaining 3 (or 4 segments) indicate the tissue type. For example, 2226-SM-E9U4P is the urinary bladder from donor P44H.
If I import the GTEX file issuing "import delimited", file name, and the the option "varnames(3) command, Stata imports the file correctly, that is all 17,382 variable are interpreted as numbers (they contain the number of TMP found for that gene in that tissue/combination). However, the dashes disappear from the variable names because Stata does not allow them in the variable name. So, I cannot extract the donor ID from the variable name. This extraction is needed because later on I want to match the donor a ID to a GTEX dictionary file that has sex, age, and circumstances of death for each donor.
And if I import the file GTEX issuing "import delimited", file name, without any option specified, Stata imports it but all 17,382 variables are now strings. I can then change all dashes to underscores with a loop, then drop the first two rows, and then use Nick Cox's "renvars" command to make the first row be the variable names, and finally destring the 17,382 variable with a loop. The problem with this (besides the length: the loop to destring all variables took several hours to complete), is that some of the TPM variables remain as string because they also contained a dash (which got changed to an underscore, not recognized any longer as a scientific notation (GTEX uses the scientific notation - followed by the number to express the result for very low gene expression results, and the + followed by the number when the expression is very high).
So, my question is: is there a way to import the GTEX text file indicated above so that the variable names maintain the individuality of their 5 or 6 segments, and that the TMP results are interpreted as numbers?
Many thanks for your help.
Patrizio Caturegli
(Johns Hopkins Hospital)
I am trying to use a data repository called GTEX (genotype-tissue expression portal). This is repository supported by American tax payers through the National Institutes of Health. Gene expression data managed by GTEX are freely downloadable here: https://www.gtexportal.org/home/down...-gtex/overview
The GTEX repository is composed of donors that donate their body to science upon death. As of now, it includes 948 adult donors (636 males and 312 females). From each donor, a panel of anatomical tissues is collected. Then, the RNA is extracted from each tissue, sequenced, and the results of gene expression are provided as Transcripts Per Million (TPM). There are a total of 52 different tissue types that can be collected, but each donor does not provide all 52 tissues (for example, male donors do not have the uterus tissue). And there are a total of 56,200 human RNA transcripts. These transcripts are the observations (rows) in all GTEX files, and they are always the same. The variables (columns) of the GTEX files are the combination of a given tissue from a given donor. So, the dataset has a WIDE layout. The latest version of GTEX has 17,382 samples (that is, tissue/donor combinations), so the number of variables is enormous: 17,382 variables plus one variable indicating the gene code (which is unique to all 56,200 transcripts) and another variable indicating the gene acronym (which is not unique in the dataset). The GTEX files are provided as .gct files (Gene Cluster Text), which is basically a text file that has 3 extra rows at the top: the first row is just one cell indicating #1.2 (the ID for a GCT dataset), the second row is just two cells indicating the number of data rows and column (equivalent to the Stata des, short command), and the third row is the name of the variables. The version of Stata I have (Stata 18 SE) can import the big GTEX file (called "GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct", the second one in the list in the GTEX download page, 1.5 GB in size) without problems (it just takes about 4 minutes).
The problem I have originates from the way GTEX names the TPM variables. In particular, GTEX uses a dash to separate the various segments of the variable name. For example: GTEX-P44H-2226-SM-E9U4P is the name of one variable. Each of those 17,382 variables is arranged into 5 segments (as the example I included) or 6 segments, all separated by a dash. The first segment is always GTEX. The second segment is the donor ID (in this example, P44H). Unfortunately, the donor can be 4 (as shown here) or 5 alpha-numeric characters long, so it is not uniform. The remaining 3 (or 4 segments) indicate the tissue type. For example, 2226-SM-E9U4P is the urinary bladder from donor P44H.
If I import the GTEX file issuing "import delimited", file name, and the the option "varnames(3) command, Stata imports the file correctly, that is all 17,382 variable are interpreted as numbers (they contain the number of TMP found for that gene in that tissue/combination). However, the dashes disappear from the variable names because Stata does not allow them in the variable name. So, I cannot extract the donor ID from the variable name. This extraction is needed because later on I want to match the donor a ID to a GTEX dictionary file that has sex, age, and circumstances of death for each donor.
And if I import the file GTEX issuing "import delimited", file name, without any option specified, Stata imports it but all 17,382 variables are now strings. I can then change all dashes to underscores with a loop, then drop the first two rows, and then use Nick Cox's "renvars" command to make the first row be the variable names, and finally destring the 17,382 variable with a loop. The problem with this (besides the length: the loop to destring all variables took several hours to complete), is that some of the TPM variables remain as string because they also contained a dash (which got changed to an underscore, not recognized any longer as a scientific notation (GTEX uses the scientific notation - followed by the number to express the result for very low gene expression results, and the + followed by the number when the expression is very high).
So, my question is: is there a way to import the GTEX text file indicated above so that the variable names maintain the individuality of their 5 or 6 segments, and that the TMP results are interpreted as numbers?
Many thanks for your help.
Patrizio Caturegli
(Johns Hopkins Hospital)
Comment