Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • About importing gene expression data

    Hello Statalist,
    I am trying to use a data repository called GTEX (genotype-tissue expression portal). This is repository supported by American tax payers through the National Institutes of Health. Gene expression data managed by GTEX are freely downloadable here: https://www.gtexportal.org/home/down...-gtex/overview

    The GTEX repository is composed of donors that donate their body to science upon death. As of now, it includes 948 adult donors (636 males and 312 females). From each donor, a panel of anatomical tissues is collected. Then, the RNA is extracted from each tissue, sequenced, and the results of gene expression are provided as Transcripts Per Million (TPM). There are a total of 52 different tissue types that can be collected, but each donor does not provide all 52 tissues (for example, male donors do not have the uterus tissue). And there are a total of 56,200 human RNA transcripts. These transcripts are the observations (rows) in all GTEX files, and they are always the same. The variables (columns) of the GTEX files are the combination of a given tissue from a given donor. So, the dataset has a WIDE layout. The latest version of GTEX has 17,382 samples (that is, tissue/donor combinations), so the number of variables is enormous: 17,382 variables plus one variable indicating the gene code (which is unique to all 56,200 transcripts) and another variable indicating the gene acronym (which is not unique in the dataset). The GTEX files are provided as .gct files (Gene Cluster Text), which is basically a text file that has 3 extra rows at the top: the first row is just one cell indicating #1.2 (the ID for a GCT dataset), the second row is just two cells indicating the number of data rows and column (equivalent to the Stata des, short command), and the third row is the name of the variables. The version of Stata I have (Stata 18 SE) can import the big GTEX file (called "GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_tpm.gct", the second one in the list in the GTEX download page, 1.5 GB in size) without problems (it just takes about 4 minutes).

    The problem I have originates from the way GTEX names the TPM variables. In particular, GTEX uses a dash to separate the various segments of the variable name. For example: GTEX-P44H-2226-SM-E9U4P is the name of one variable. Each of those 17,382 variables is arranged into 5 segments (as the example I included) or 6 segments, all separated by a dash. The first segment is always GTEX. The second segment is the donor ID (in this example, P44H). Unfortunately, the donor can be 4 (as shown here) or 5 alpha-numeric characters long, so it is not uniform. The remaining 3 (or 4 segments) indicate the tissue type. For example, 2226-SM-E9U4P is the urinary bladder from donor P44H.

    If I import the GTEX file issuing "import delimited", file name, and the the option "varnames(3) command, Stata imports the file correctly, that is all 17,382 variable are interpreted as numbers (they contain the number of TMP found for that gene in that tissue/combination). However, the dashes disappear from the variable names because Stata does not allow them in the variable name. So, I cannot extract the donor ID from the variable name. This extraction is needed because later on I want to match the donor a ID to a GTEX dictionary file that has sex, age, and circumstances of death for each donor.

    And if I import the file GTEX issuing "import delimited", file name, without any option specified, Stata imports it but all 17,382 variables are now strings. I can then change all dashes to underscores with a loop, then drop the first two rows, and then use Nick Cox's "renvars" command to make the first row be the variable names, and finally destring the 17,382 variable with a loop. The problem with this (besides the length: the loop to destring all variables took several hours to complete), is that some of the TPM variables remain as string because they also contained a dash (which got changed to an underscore, not recognized any longer as a scientific notation (GTEX uses the scientific notation - followed by the number to express the result for very low gene expression results, and the + followed by the number when the expression is very high).

    So, my question is: is there a way to import the GTEX text file indicated above so that the variable names maintain the individuality of their 5 or 6 segments, and that the TMP results are interpreted as numbers?

    Many thanks for your help.

    Patrizio Caturegli
    (Johns Hopkins Hospital)

  • #2
    People are unlikely to go and download the data from its source. What you could do is provide a sample of how the data looks, which can be copied and pasted into a text file and imported locally.

    Comment


    • #3
      Thank you, Andrew. Here is one of the individual 54 GTEX tissue files, the one about the Fallopian tube. It is contributed only by 9 women. I copied all variables (9 TPM variables, plus the two gene variables, plus one row ID variable), and the first 10 rows (out of the total 56,203).

      #1.3
      56200 11 0 0
      id Name Description GTEX-OHPK-2326-SM-3MJH2 GTEX-PLZ4-2326-SM-EYYV5 GTEX-S32W-1326-SM-4AD5Q GTEX-S341-0826-SM-4AD73 GTEX-SE5C-0926-SM-4BRUF GTEX-T5JW-0326-SM-4DM6J GTEX-T6MO-1026-SM-4DM72 GTEX-TSE9-2326-SM-EZ6ME GTEX-U3ZN-1126-SM-4DXUL
      0 ENSG00000223972.5 DDX11L1 0.0000 0.0000 0.0000 0.0000 0.0000 0.0305 0.0000 0.0416 0.0332
      1 ENSG00000227232.5 WASH7P 4.0780 6.1340 8.5180 3.6760 4.5450 4.7680 13.0500 7.1510 6.6110
      2 ENSG00000278267.1 MIR6859-1 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
      3 ENSG00000243485.5 MIR1302-2HG 0.0000 0.0619 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
      4 ENSG00000237613.2 FAM138A 0.0000 0.0000 0.0000 0.0000 0.0521 0.0000 0.0497 0.0000 0.0471
      5 ENSG00000268020.3 OR4G4P 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0428 0.0000
      6 ENSG00000240361.1 OR4G11P 0.0360 0.0570 0.0000 0.0547 0.1352 0.0000 0.0000 0.0000 0.1221
      7 ENSG00000186092.4 OR4F5 0.0369 0.0000 0.0700 0.0000 0.0000 0.0000 0.0000 0.0000 0.0625
      8 ENSG00000238009.6 RP11-34P13.7 0.0000 0.0000 0.0000 0.7244 0.0926 0.0000 0.0000 0.0000 0.0558
      9 ENSG00000233750.3 CICP27 0.0098 0.0000 0.0371 0.2226 0.1468 0.0305 0.0350 0.0831 0.0166
      10 ENSG00000268903.1 RP11-34P13.15 3.8110 6.9560 10.5600 18.7100 11.4400 4.8220 8.3450 11.1000 7.1440

      Comment


      • #4
        Thanks. Copying the data to a text file "as is" and specifying the following

        Code:
        import delimited "myfile.txt", rowrange(3) delimiter(" ") clear varnames(3)
        results in all variables after "Description" being imported properly as numbers, as you state. Note that the original variable names are maintained as variable labels.


        Res.:

        Code:
        . desc
        
        Contains data
         Observations:            11                  
            Variables:            12                  
        --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        Variable      Storage   Display    Value
            name         type    format    label      Variable label
        --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        id              byte    %8.0g                
        name            str17   %17s                  Name
        description     str13   %13s                  Description
        gtexohpk2326s~2 float   %9.0g                 GTEX-OHPK-2326-SM-3MJH2
        gtexplz42326s~5 float   %9.0g                 GTEX-PLZ4-2326-SM-EYYV5
        gtexs32w1326~5q float   %9.0g                 GTEX-S32W-1326-SM-4AD5Q
        gtexs3410826~73 float   %9.0g                 GTEX-S341-0826-SM-4AD73
        gtexse5c0926s~f float   %9.0g                 GTEX-SE5C-0926-SM-4BRUF
        gtext5jw0326~6j float   %9.0g                 GTEX-T5JW-0326-SM-4DM6J
        gtext6mo1026~72 float   %9.0g                 GTEX-T6MO-1026-SM-4DM72
        gtextse92326s~e float   %9.0g                 GTEX-TSE9-2326-SM-EZ6ME
        gtexu3zn1126s~l float   %9.0g                 GTEX-U3ZN-1126-SM-4DXUL
        --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        Sorted by:
             Note: Dataset has changed since last saved.
        You can rename the variables as follows, although beware of the 32 characters limit.

        Code:
        qui ds id name description, not
        foreach var in `r(varlist)'{
            rename `var' `=strtoname("`:var lab `var''")'
        }
        describe, fullnames
        Res.:

        Code:
        . describe, fullnames
        
        Contains data
         Observations:            11                  
            Variables:            12                  
        --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        Variable      Storage   Display    Value
            name         type    format    label      Variable label
        --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        id              byte    %8.0g                
        name            str17   %17s                  Name
        description     str13   %13s                  Description
        GTEX_OHPK_2326_SM_3MJH2
                        float   %9.0g                 GTEX-OHPK-2326-SM-3MJH2
        GTEX_PLZ4_2326_SM_EYYV5
                        float   %9.0g                 GTEX-PLZ4-2326-SM-EYYV5
        GTEX_S32W_1326_SM_4AD5Q
                        float   %9.0g                 GTEX-S32W-1326-SM-4AD5Q
        GTEX_S341_0826_SM_4AD73
                        float   %9.0g                 GTEX-S341-0826-SM-4AD73
        GTEX_SE5C_0926_SM_4BRUF
                        float   %9.0g                 GTEX-SE5C-0926-SM-4BRUF
        GTEX_T5JW_0326_SM_4DM6J
                        float   %9.0g                 GTEX-T5JW-0326-SM-4DM6J
        GTEX_T6MO_1026_SM_4DM72
                        float   %9.0g                 GTEX-T6MO-1026-SM-4DM72
        GTEX_TSE9_2326_SM_EZ6ME
                        float   %9.0g                 GTEX-TSE9-2326-SM-EZ6ME
        GTEX_U3ZN_1126_SM_4DXUL
                        float   %9.0g                 GTEX-U3ZN-1126-SM-4DXUL
        --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
        Sorted by:
             Note: Dataset has changed since last saved.

        Comment


        • #5
          Thank you very much, Andrew!! Your loop worked great. Even with the large dataset of 17K variables or so, it changed all dashes to underscores in a matter of seconds.
          Best regards.

          Comment

          Working...
          X