Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Importing variable names in unpleasant Excel format

    I have been sent some data in an Excel spreadsheet that is formatted in an unfriendly way for importing to Stata. It uses merged cells to categorise repeating column names, so that a screenshot is, for once, probably the best way to show sample data:
    Click image for larger version

Name:	Screen Shot 2022-04-02 at 10.06.41 am.png
Views:	5
Size:	71.3 KB
ID:	1657474
    The actual data has many more Groups and 10 variables within each group. My problem is how to keep the grouping and column information when importing each variable into Stata. When I import the data using the options
    Code:
    cellrange(A2) firstrow
    , Group B's variables become "F", "G" etc. so as not to duplicate the variable names from Group A.

    Is there some clever way after the import to generate variable names A_Age, A_Sex, ..., B_Age, B_Sex,...etc? For reference, the .csv of the example data is:
    Code:
    Group A,,,,,Group B,,,,,Group C,,,,
    Age,Sex,Fed,Score,Total,Age,Sex,Fed,Score,Total,Age,Sex,Fed,Score,Total
    10,M,AUS,5,5,5,F,AUS,5,4,8,M,AUS,2,2
    Click image for larger version

Name:	Screen Shot 2022-04-02 at 10.06.41 am.png
Views:	5
Size:	71.3 KB
ID:	1657472
    Attached Files

  • #2
    Here is some example code that may start you in a useful direction. I first import your csv data into Excel and reproduce your spreadhsheet.
    Code:
    import excel data.xlsx, cellrange(A1:O2)
    rename (*) (v#), addnumber
    generate id = _n
    reshape long  v, i(id) j(var)
    reshape wide v, i(var) j(id)
    list, clean
    replace v1 = v1[_n-1] if missing(v1)
    replace v1 = subinstr(v1,"Group ","",.)
    generate oldname = "v"+string(var)
    generate newname = v2 + "_" + v1
    list, clean
    forvalues i = 1/`c(N)' {
        local oldnames `oldnames' `=oldname[`i']'
        local newnames `newnames' `=newname[`i']'
    }
    
    import excel data.xlsx, cellrange(A3) clear
    rename (*) (v#), addnumber
    rename (`oldnames') (`newnames')
    list *_A, clean 
    list *_B, clean
    list *_C, clean
    Code:
    . import excel data.xlsx, cellrange(A1:O2)
    (15 vars, 2 obs)
    
    . rename (*) (v#), addnumber
    
    . generate id = _n
    
    . reshape long  v, i(id) j(var)
    (j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)
    
    Data                               Wide   ->   Long
    -----------------------------------------------------------------------------
    Number of observations                2   ->   30          
    Number of variables                  16   ->   3           
    j variable (15 values)                    ->   var
    xij variables:
                              v1 v2 ... v15   ->   v
    -----------------------------------------------------------------------------
    
    . reshape wide v, i(var) j(id)
    (j = 1 2)
    
    Data                               Long   ->   Wide
    -----------------------------------------------------------------------------
    Number of observations               30   ->   15          
    Number of variables                   3   ->   3           
    j variable (2 values)                id   ->   (dropped)
    xij variables:
                                          v   ->   v1 v2
    -----------------------------------------------------------------------------
    
    . list, clean
    
           var        v1      v2  
      1.     1   Group A     Age  
      2.     2               Sex  
      3.     3               Fed  
      4.     4             Score  
      5.     5             Total  
      6.     6   Group B     Age  
      7.     7               Sex  
      8.     8               Fed  
      9.     9             Score  
     10.    10             Total  
     11.    11   Group C     Age  
     12.    12               Sex  
     13.    13               Fed  
     14.    14             Score  
     15.    15             Total  
    
    . replace v1 = v1[_n-1] if missing(v1)
    (12 real changes made)
    
    . replace v1 = subinstr(v1,"Group ","",.)
    (15 real changes made)
    
    . generate oldname = "v"+string(var)
    
    . generate newname = v2 + "_" + v1
    
    . list, clean
    
           var   v1      v2   oldname   newname  
      1.     1    A     Age        v1     Age_A  
      2.     2    A     Sex        v2     Sex_A  
      3.     3    A     Fed        v3     Fed_A  
      4.     4    A   Score        v4   Score_A  
      5.     5    A   Total        v5   Total_A  
      6.     6    B     Age        v6     Age_B  
      7.     7    B     Sex        v7     Sex_B  
      8.     8    B     Fed        v8     Fed_B  
      9.     9    B   Score        v9   Score_B  
     10.    10    B   Total       v10   Total_B  
     11.    11    C     Age       v11     Age_C  
     12.    12    C     Sex       v12     Sex_C  
     13.    13    C     Fed       v13     Fed_C  
     14.    14    C   Score       v14   Score_C  
     15.    15    C   Total       v15   Total_C  
    
    . forvalues i = 1/`c(N)' {
      2.     local oldnames `oldnames' `=oldname[`i']'
      3.     local newnames `newnames' `=newname[`i']'
      4. }
    
    . 
    . import excel data.xlsx, cellrange(A3) clear
    (15 vars, 1 obs)
    
    . rename (*) (v#), addnumber
    
    . rename (`oldnames') (`newnames')
    
    . list *_A, clean 
    
           Age_A   Sex_A   Fed_A   Score_A   Total_A  
      1.      10       M     AUS         5         5  
    
    . list *_B, clean
    
           Age_B   Sex_B   Fed_B   Score_B   Total_B  
      1.       5       F     AUS         5         4  
    
    . list *_C, clean
    
           Age_C   Sex_C   Fed_C   Score_C   Total_C  
      1.       8       M     AUS         2         2  
    
    .

    Comment

    Working...
    X