Importing variable names in unpleasant Excel format

David Smerdon

Join Date: Jul 2019

Posts: 24
#1

Importing variable names in unpleasant Excel format

01 Apr 2022, 19:14

I have been sent some data in an Excel spreadsheet that is formatted in an unfriendly way for importing to Stata. It uses merged cells to categorise repeating column names, so that a screenshot is, for once, probably the best way to show sample data:

The actual data has many more Groups and 10 variables within each group. My problem is how to keep the grouping and column information when importing each variable into Stata. When I import the data using the options

Code:

cellrange(A2) firstrow

, Group B's variables become "F", "G" etc. so as not to duplicate the variable names from Group A.

Is there some clever way after the import to generate variable names A_Age, A_Sex, ..., B_Age, B_Sex,...etc? For reference, the .csv of the example data is:

Code:

Group A,,,,,Group B,,,,,Group C,,,, Age,Sex,Fed,Score,Total,Age,Sex,Fed,Score,Total,Age,Sex,Fed,Score,Total 10,M,AUS,5,5,5,F,AUS,5,4,8,M,AUS,2,2

Attached Files
Tags: Excel, import

William Lisowski

Join Date: Dec 2014
Posts: 10150

01 Apr 2022, 21:07

Here is some example code that may start you in a useful direction. I first import your csv data into Excel and reproduce your spreadhsheet.

Code:

import excel data.xlsx, cellrange(A1:O2)
rename (*) (v#), addnumber
generate id = _n
reshape long  v, i(id) j(var)
reshape wide v, i(var) j(id)
list, clean
replace v1 = v1[_n-1] if missing(v1)
replace v1 = subinstr(v1,"Group ","",.)
generate oldname = "v"+string(var)
generate newname = v2 + "_" + v1
list, clean
forvalues i = 1/`c(N)' {
    local oldnames `oldnames' `=oldname[`i']'
    local newnames `newnames' `=newname[`i']'
}

import excel data.xlsx, cellrange(A3) clear
rename (*) (v#), addnumber
rename (`oldnames') (`newnames')
list *_A, clean 
list *_B, clean
list *_C, clean

Code:

. import excel data.xlsx, cellrange(A1:O2)
(15 vars, 2 obs)

. rename (*) (v#), addnumber

. generate id = _n

. reshape long  v, i(id) j(var)
(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations                2   ->   30          
Number of variables                  16   ->   3           
j variable (15 values)                    ->   var
xij variables:
                          v1 v2 ... v15   ->   v
-----------------------------------------------------------------------------

. reshape wide v, i(var) j(id)
(j = 1 2)

Data                               Long   ->   Wide
-----------------------------------------------------------------------------
Number of observations               30   ->   15          
Number of variables                   3   ->   3           
j variable (2 values)                id   ->   (dropped)
xij variables:
                                      v   ->   v1 v2
-----------------------------------------------------------------------------

. list, clean

       var        v1      v2  
  1.     1   Group A     Age  
  2.     2               Sex  
  3.     3               Fed  
  4.     4             Score  
  5.     5             Total  
  6.     6   Group B     Age  
  7.     7               Sex  
  8.     8               Fed  
  9.     9             Score  
 10.    10             Total  
 11.    11   Group C     Age  
 12.    12               Sex  
 13.    13               Fed  
 14.    14             Score  
 15.    15             Total  

. replace v1 = v1[_n-1] if missing(v1)
(12 real changes made)

. replace v1 = subinstr(v1,"Group ","",.)
(15 real changes made)

. generate oldname = "v"+string(var)

. generate newname = v2 + "_" + v1

. list, clean

       var   v1      v2   oldname   newname  
  1.     1    A     Age        v1     Age_A  
  2.     2    A     Sex        v2     Sex_A  
  3.     3    A     Fed        v3     Fed_A  
  4.     4    A   Score        v4   Score_A  
  5.     5    A   Total        v5   Total_A  
  6.     6    B     Age        v6     Age_B  
  7.     7    B     Sex        v7     Sex_B  
  8.     8    B     Fed        v8     Fed_B  
  9.     9    B   Score        v9   Score_B  
 10.    10    B   Total       v10   Total_B  
 11.    11    C     Age       v11     Age_C  
 12.    12    C     Sex       v12     Sex_C  
 13.    13    C     Fed       v13     Fed_C  
 14.    14    C   Score       v14   Score_C  
 15.    15    C   Total       v15   Total_C  

. forvalues i = 1/`c(N)' {
  2.     local oldnames `oldnames' `=oldname[`i']'
  3.     local newnames `newnames' `=newname[`i']'
  4. }

. 
. import excel data.xlsx, cellrange(A3) clear
(15 vars, 1 obs)

. rename (*) (v#), addnumber

. rename (`oldnames') (`newnames')

. list *_A, clean 

       Age_A   Sex_A   Fed_A   Score_A   Total_A  
  1.      10       M     AUS         5         5  

. list *_B, clean

       Age_B   Sex_B   Fed_B   Score_B   Total_B  
  1.       5       F     AUS         5         4  

. list *_C, clean

       Age_C   Sex_C   Fed_C   Score_C   Total_C  
  1.       8       M     AUS         2         2  

.

Announcement

Importing variable names in unpleasant Excel format

Comment