Setting a substr of the first row as variable name

Falco Wolf

Join Date: Oct 2018
Posts: 83

Setting a substr of the first row as variable name

10 Nov 2018, 06:34

Hello,

I Imported a file (from google trends) that I want to use for stata. I'm interested in using the names that appear in the second row (Alabama, Alaska, Arizona, Arkansas,...) as the variable names of each row. I can generate a substring and create a new variable, but that seems a pretty big hassle for the complete variable set, and I was wondering if there was a more effcient way of doing this.

I can use:
import delimited "filename.csv", varnames(2) rowrange(2)
However, I still have the problem of not that the variable name will be housepricesAlabama... and so on. Is there an efficient way to do this?

Also, my second step will be creating an annual trend, monthly trend and growth from month to month. Would it be better to transpose the rous and columns? If I do this, how can I tell stata to tranpose correctly?

Thank you for the help,

Falco

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 v1 str24(v2 v3 v4 v5)
""        "Category..All.categories" "Category..All.categories" "Category..All.categories" "Category..All.categories"
"Month"   "house prices: (Alabama)"  "house prices: (Alaska)"   "house prices: (Arizona)"  "house prices: (Arkansas)"
"2004-01" "77"                       "100"                      "37"                       "0"                       
"2004-02" "74"                       "0"                        "36"                       "0"                       
"2004-03" "100"                      "0"                        "0"                        "0"                       
"2004-04" "0"                        "0"                        "100"                      "100"                     
"2004-05" "0"                        "0"                        "92"                       "0"                       
"2004-06" "0"                        "100"                      "62"                       "0"                       
"2004-07" "68"                       "0"                        "32"                       "0"                       
"2004-08" "0"                        "0"                        "0"                        "0"                       
"2004-09" "0"                        "0"                        "65"                       "41"                      
"2004-10" "61"                       "0"                        "69"                       "41"                      
"2004-11" "63"                       "0"                        "57"                       "0"                       
"2004-12" "0"                        "77"                       "50"                       "42"                      
"2005-01" "0"                        "0"                        "33"                       "0"                       
"2005-02" "0"                        "0"                        "23"                       "0"                       
"2005-03" "45"                       "0"                        "49"                       "32"                      
"2005-04" "50"                       "0"                        "33"                       "0"                       
"2005-05" "0"                        "65"                       "20"                       "0"                       
"2005-06" "0"                        "0"                        "20"                       "54"                      
end

Tags: None

Lance Erickson

Join Date: Apr 2014

Posts: 144
#2

10 Nov 2018, 07:22

Falco,

This should pull out the state names, rename the variables appropriately, and get rid of the non-data rows...

Code:

drop in 1 foreach v of varlist v2-v5 { replace `v' = substr(`v',strpos(`v',"(")+1,strpos(`v',")")-strpos(`v',"(")-1) in 1 rename `v' `=`v'[1]' } drop in 1

Most of the recommendations I see on Statalist suggest working with long data is easier. I don't know what the data are supposed to represent so I can't make any suggestions on how you could get the trends you want.

Lance
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#3

10 Nov 2018, 07:42

If you're using -import delimited- to load the data, the -varnames- and -rowrange- options would likely be of use to you. You're also likely going to have problems because you are importing numeric data as strings, about which see -destring-.
Comment

Falco Wolf

Join Date: Oct 2018
Posts: 83

10 Nov 2018, 12:49

Lance,

Your code worked perfectly. But I ran into a similar problem with the rest of the strings of the dataset. I was trying to modify your formula but it wouldn't work.

Now all the values in the dataset are strings. And I would like to replace each string value, for the numerical value it represents. However, I have 179 observations and 52 variables so it's a bit inneficient.

Is there a way to tell stata to change the strings into values and place them where the string was?
And for the monthly data, whenever I convert it into month year value, it gives me a number that seems random, why is that?

Thank you for the help.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 Month str24(Alabama Alaska) str26 California str24 Idaho
"2004-01" "77"  "100" "69"  "0"  
"2004-02" "74"  "0"   "85"  "0"  
"2004-03" "100" "0"   "57"  "0"  
"2004-04" "0"   "0"   "72"  "0"  
"2004-05" "0"   "0"   "81"  "87" 
"2004-06" "0"   "100" "61"  "0"  
"2004-07" "68"  "0"   "88"  "0"  
"2004-08" "0"   "0"   "55"  "0"  
"2004-09" "0"   "0"   "50"  "0"  
"2004-10" "61"  "0"   "69"  "0"  
"2004-11" "63"  "0"   "57"  "0"  
"2004-12" "0"   "77"  "62"  "77" 
"2005-01" "0"   "0"   "63"  "68" 
"2005-02" "0"   "0"   "80"  "0"  
"2005-03" "45"  "0"   "59"  "0"  
"2005-04" "50"  "0"   "72"  "70" 
"2005-05" "0"   "65"  "100" "0"  
"2005-06" "0"   "0"   "81"  "0"  
"2005-07" "0"   "66"  "69"  "70" 
"2005-08" "85"  "61"  "72"  "0"  
"2005-09" "39"  "56"  "71"  "0"  
"2005-10" "74"  "0"   "59"  "100"
"2005-11" "37"  "0"   "45"  "0"  
"2005-12" "54"  "50"  "43"  "47" 
"2006-01" "0"   "0"   "63"  "42" 
"2006-02" "0"   "0"   "85"  "0"  
"2006-03" "0"   "41"  "50"  "0"  
"2006-04" "31"  "0"   "53"  "0"  
"2006-05" "29"  "43"  "60"  "0"  
"2006-06" "31"  "46"  "56"  "0"  
"2006-07" "44"  "0"   "77"  "41" 
"2006-08" "26"  "0"   "56"  "0"  
"2006-09" "26"  "0"   "39"  "0"  
"2006-10" "25"  "36"  "52"  "65" 
"2006-11" "25"  "0"   "37"  "31" 
"2006-12" "0"   "0"   "54"  "0"  
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#5

10 Nov 2018, 13:35

Now all the values in the dataset are strings. And I would like to replace each string value, for the numerical value it represents. However, I have 179 observations and 52 variables so it's a bit inneficient.

For all of the variables except Month, you can do this with -destring-.

Code:

ds Month, not destring `r(varlist)', replace

Note: Make sure you save a copy of the original data before doing that, in case something goes wrong. -replace- options lose the original data, so if it somehow blows up, there's no going back. So be sure to save a copy of the original data first.
Also make sure you look at the messages you get from -destring-. You may find that it refuses to convert some of the variables. This happens if these variables are contaminated with non-numeric material (as is often the case with data imported from spreadsheets or other non-statistical data set sources). If this arises, you'll need to inspect those variables to find the offending observations and then decide what to do about them. For example, if Arkansas doesn't convert to numeric, run

Code:

browse Month Arkansas if missing(real(Arkansas))

to see the offending observations. Also, be sure to read -help destring- so you can see the options that -destring- itself provides for dealing with certain common types of simple non-numeric contamination.

Is there a way to tell stata to change the strings into values and place them where the string was?

The aproach outlined in response to the first question does this.

And for the monthly data, whenever I convert it into month year value, it gives me a number that seems random, why is that?

Well, you don't say how you went about converting it, so we can't be sure. But what you should have done is this:

Code:

gen month = monthly(Month, "YM") format month %tm

Now, the numbers that this produces follow Stata's regular rules for encoding date and time information. The base month is January 1960, which is represented as 0. So any given month is represented in Stata as the number of months after that. February 1960 is 1, March 1960 is 2, etc. The current month, November 208 is 706. If these numbers appear "random" to you, well, you just need to get used to it.

If you are going to work with data that involves dates and times on a regular basis, you absolutely need to familiarize yourself with Stata's approach to datetime variables. Read the section of the PDF manual that lays this out. (Run -help datetime- and click on the blue link "View complete PDF manual entry" near the top.) It's a very long read, and pretty complicated. You won't remember all of it, and depending on what you do, there are probably some parts of it you will never actually use. But it is important to know the underlying concepts and principles, and to have at least a passing familiarity with the various functions available in Stata for managing dates and times. With that under your belt, you will have an easier time getting comfortable with the functions you will use most. And you will at least know what to look for when you encounter a new situation involving dates and times.
Comment

Announcement

Setting a substr of the first row as variable name

Comment

Comment

Comment

Comment