Reading an Excel file with one mereged cells

Hatim Tareq

Join Date: Aug 2014

Posts: 103
#1

Reading an Excel file with one mereged cells

11 Nov 2014, 06:05

Hi All,

I have an Excel file, as shown below (this is only an example of the data), I would like to be able to imported to Stata

Sup MF Id output

M F

1 136 78 1 0

1 8 11 1 1

1 0 0 1 2

1 0 0 1 3

1 0 2 1 4

Regards
Tags: None
Svend Juul

Join Date: Apr 2014

Posts: 515
#2

11 Nov 2014, 06:25

Before you can use -import excel- in Stata, you must manually edit the spreadsheet so that you have only one line with variable names, and they must obey Stata's rules for variable names. You could have these names:

Code:

Sup MF_M MF_F id output
Comment
daniel klein

Join Date: Mar 2014

Posts: 3824
#3

11 Nov 2014, 06:39

Depending on how complex your actual dataset(s) look like, you can just

Code:

import excel Sup MF_M MF_F Id output using <foo.xlsx> ,cellrange(A3)

Best
Daniel
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#4

11 Nov 2014, 06:58

In general, I would advise against editing your data in Excel. It leaves no audit trail when you do so, and when you come back to this some weeks from now, you won't remember what you did. In fact, the only thing I recommend doing with data in Excel is -import- it to Stata. Once in Stata you can do with it whatever you want and keep a complete record of it in your do-files and log-files. In line with that, I endorse Daniel's solution above.

By the way, when using -import excel- with the -firstrow- option it is not necessary that the Excel column headers all be legal Stata variable names. Stata will accept those that are, and make modifications to those that are not, or, where it can't figure out how to do that, name them after the Excel columns they are found in.
2 likes
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#5

11 Nov 2014, 07:36

"In general, I would advise against editing your data in Excel. It leaves no audit trail when you do so..." Interesting and extreme approach. I actually wrote a 20+ item list of what to do and what not to do with data, with #1 being "Always Use Syntax." But, especially with complexly formatted spreadsheets or data with double-headers (variable names one row, what should be variable labels in the second row), I've always fixed things up manually, then kept the original and modified files around. Maybe next time I'll try a pure-syntax approach, even if it does take me much more time and trial-and-error. In the long run (say I have 20 census files to process), might even save time. Hmm...
Comment
Svend Juul

Join Date: Apr 2014

Posts: 515
#6

11 Nov 2014, 08:54

Clyde:
In Hatim's Excel sheet it is necessary to change the first two lines to one before trying to import in Stata. I see no other way than doing it manually. But I would do it on a copy of the original sheet.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29958
#7

11 Nov 2014, 09:02

Svend,

I don't think it is necessary to do it that way. Daniel Klein's solution simply ignores those first two lines using the -cellrange()- option and creates the variable names directly in the -import excel- command.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3824

11 Nov 2014, 09:15

Actually, there is more than one way of doing this and I have already demonstrated a possible solution. Another possibility would be importing the first two rows, extract the information and create the variable names, then import everything from row 3 on, passing the constructed names to the import excel call I have given earlier. Consider

Code:

// import the first two rows holding the names
import excel using <foo.xlsx> ,clear
keep in 1/2

// get the names in the first row
unab allvars : _all
foreach v of loc allvars {
    
    // get the name in the first row
    loc name = `v'[1]
    if ("`name'" == "") {
        // if empty, replace with previous name
        loc name `prev'
    }
    loc prev `name' // save previous name
    
    // get additional name in second row
    // (this is for the merged cells)
    loc secondrow = `v'[2]
    if ("`secondrow'" != "") {
        
        // add the name in the secon row
        // if there is one
        loc name `name'_`secondrow'
    }
    
    // make it a valid Stata name
    // and build a namelist
    loc name = strtoname("`name'")
    loc names `names' `name'
}

// now get the complete data, passing the namelist to -import excel-
// (this is same as show earlier, but automated)
import excel `names' using <foo.xlsx> ,cellra(A3) clear

// done
// no manipulation of the Excel file necessary

This will fail if the first two cells are merged, but one could think of a solution for this as well.

Best
Daniel

Last edited by daniel klein; 11 Nov 2014, 09:18.

Comment

Svend Juul

Join Date: Apr 2014

Posts: 515
#9

11 Nov 2014, 09:21

Daniel, Clyde:
Nice; I learned something.

Svend
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#10

11 Nov 2014, 09:36

Originally posted by ben earnhart View Post

"In general, I would advise against editing your data in Excel. It leaves no audit trail when you do so..." Interesting and extreme approach. I actually wrote a 20+ item list of what to do and what not to do with data, with #1 being "Always Use Syntax." But, especially with complexly formatted spreadsheets or data with double-headers (variable names one row, what should be variable labels in the second row), I've always fixed things up manually, then kept the original and modified files around. Maybe next time I'll try a pure-syntax approach, even if it does take me much more time and trial-and-error. In the long run (say I have 20 census files to process), might even save time. Hmm...

It's an interesting topic. Usually, my preference would to do as much as possible via syntax with limiting the editing in Excel to minimum. The point is that, very often, the data we have to work with is presented in a format that is not necessarily most convenient for statistical analysis (more on that). In terms of importing from Excel presumably it would be worthwhile to explore possibilities offered by the cellrange() and maybe limit the import to the desired set of cells.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment

ziluo

Join Date: Oct 2014
Posts: 7

#11

09 Nov 2023, 15:08

I really love this solution. However, when I run these codes, I got the following error message. Could you please tell me what the problem is? Many thanks!

variables multiply defined
Q17a_Count
Q17a_Percent

[QUOTE=daniel klein;n419097]Actually, there is more than one way of doing this and I have already demonstrated a possible solution. Another possibility would be importing the first two rows, extract the information and create the variable names, then import everything from row 3 on, passing the constructed names to the import excel call I have given earlier. Consider

Code:

// import the first two rows holding the names
import excel using <foo.xlsx> ,clear
keep in 1/2

// get the names in the first row
unab allvars : _all
foreach v of loc allvars {

// get the name in the first row
loc name = `v'[1]
if ("`name'" == "") {
// if empty, replace with previous name
loc name `prev'
}
loc prev `name' // save previous name

// get additional name in second row
// (this is for the merged cells)
loc secondrow = `v'[2]
if ("`secondrow'" != "") {

// add the name in the secon row
// if there is one
loc name `name'_`secondrow'
}

// make it a valid Stata name
// and build a namelist
loc name = strtoname("`name'")
loc names `names' `name'
}

// now get the complete data, passing the namelist to -import excel-
// (this is same as show earlier, but automated)
import excel `names' using <foo.xlsx> ,cellra(A3) clear

// done
// no manipulation of the Excel file necessary

This will fail if the first two cells are merged, but one could think of a solution for this as well.

Best
Daniel[/QUO

Sup	MF		Id	output
	M	F
1	136	78	1	0
1	8	11	1	1
1	0	0	1	2
1	0	0	1	3
1	0	2	1	4

Announcement