Reshaping data with multiple obs per id

Michael Foreman

Join Date: Mar 2018
Posts: 54

Reshaping data with multiple obs per id

04 Mar 2020, 07:00

Dear Stata users,

I have the following panel data set covering all 50 states over 13 years.

Code:

. describe
  obs:       452,755                          
 vars:             4                          3 Mar 2020 23:39
 size:     7,244,080                          
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
year                   long    %12.0g                CCES year
state                  long    %12.0g     state      State (FIPS)
st                     long    %12.0g     st         State abbreviation (FIPS)
approval_gov           long    %12.0g     approval_gov  Governor approval
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: 



. tab approval_gov,m

               Governor approval |      Freq.     Percent        Cum.
---------------------------------+-----------------------------------
                Strongly Approve |     67,404       14.89       14.89
      Approve / Somewhat Approve |    139,940       30.91       45.80
Disapprove / Somewhat Disapprove |     84,271       18.61       64.41
             Strongly Disapprove |    117,282       25.90       90.31
          Never Heard / Not Sure |     40,318        8.91       99.22
  Neither Approve Nor Disapprove |      1,414        0.31       99.53
                               . |      2,126        0.47      100.00
---------------------------------+-----------------------------------
                           Total |    452,755      100.00

. tab approval_gov,m nolab

   Governor |
   approval |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     67,404       14.89       14.89
          2 |    139,940       30.91       45.80
          3 |     84,271       18.61       64.41
          4 |    117,282       25.90       90.31
          5 |     40,318        8.91       99.22
          6 |      1,414        0.31       99.53
          . |      2,126        0.47      100.00
------------+-----------------------------------
      Total |    452,755      100.00



.    tab year

    CCES year    Freq.    Percent    Cum.
                
    2006    36,421    8.04    8.04
    2007    9,999    2.21    10.25
    2008    32,800    7.24    17.50
    2009    13,800    3.05    20.55
    2010    55,400    12.24    32.78
    2011    20,150    4.45    37.23
    2012    54,535    12.05    49.28
    2013    16,400    3.62    52.90
    2014    56,200    12.41    65.31
    2015    14,250    3.15    68.46
    2016    64,600    14.27    82.73
    2017    18,200    4.02    86.75
    2018    60,000    13.25    100.00
                
    Total    452,755    100.00

The dataset is 4 variable wide and 452,755 observations long because each state year contains individual responses to a survey question about gubernatorial approval. Thus, there are 495 observations for Alabama in 2006, 131 obs for Alaska in 2006 etc. I would like to reshape the dataset and create 6 variables " Strongly_Approve", "Approve", "Disapprove", "Strongly_Disapprove", "Never_Heard", "Neither_Approve_Nor_Disapprove" instead of "approval_gov". Each of these 6 new variables would count the number of occurrences of each type of response in a given state year so that my dataset is 650 observations long (50 states by 13 years). There are about 2,000 missing values in approval_gov, which can be ignored/dropped. I tried to use `reshape`, but it does not seem to be able to give me what I want. I would be grateful for any help.

Tags: None

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

04 Mar 2020, 08:17

At this very moment, I cannot test if the code works in Stata but you can try something like:

Code:

bysort state year: tabulate approval_gov, gen(newvar)

Then, just - rename - the variables accordingly.

Hopefully it works.

Best regards,

Marcos
Comment
Michael Foreman

Join Date: Mar 2018

Posts: 54
#3

04 Mar 2020, 12:49

Dear Marcos,

Thank you for the advice! I've tested the code and I get the following error:

Code:

. bysort state year: tabulate approval_gov, gen(newvar) ---------------------------------------------------------------------------------- -> state = Alabama, year = 2006 request may not be combined with by r(190); end of do-file r(190);

And, of course, if i get rid of bysort, there is no error but it gives me wrong counts...
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

04 Mar 2020, 19:07

I tried to use `reshape`, but it does not seem to be able to give me what I want.

Perhaps if you were to tell us the reshape command you issued and what it gave you, we could tell you how to change the command to get what you want. It is a lot easier to solve a problem by starting from a partial solution than to invent an entire solution from scratch, as you are asking us to do.

In post #3 you did exactly what you should have done in post #1: copy the command and it's output from Stata's Results window and paste it into a code block in the forum, accompanying it if necessary with enough commentary to explain how you got differed from what you wanted. (Of course, if reshape quit with an error message, that doesn't take much explaining!).
Comment

Michael Foreman

Join Date: Mar 2018
Posts: 54

04 Mar 2020, 19:34

Dear William,

Thank you for your reply. Here is the reshape code I have used, but I think it won't do the job because i need the sums of all occurrence of each value of approval_gov per state-year, whereas the `reshape` function only changes the structure of the data. This is why I assumed it would not be useful to post the code.

Code:

. reshape wide approval_gov, i(state) j(year)
(note: j = 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018)
values of variable year not unique within state
    Your data are currently long.  You are performing a reshape wide.  You
    specified i(state) and j(year).  There are observations within i(state) with
    the same value of j(year).  In the long data, variables i() and j() together
    must uniquely identify the observations.

         long                                wide
        +---------------+                   +------------------+
        | i   j   a   b |                   | i   a1 a2  b1 b2 |
        |---------------| <--- reshape ---> |------------------|
        | 1   1   1   2 |                   | 1   1   3   2  4 |
        | 1   2   3   4 |                   | 2   5   7   6  8 |
        | 2   1   5   6 |                   +------------------+
        | 2   2   7   8 |
        +---------------+
    Type reshape error for a list of the problem variables.
r(9);

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

04 Mar 2020, 19:57

Well, I started down the reshape path, which took me far enough to understand what Marcos recommended in post #2.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(year state approval_gov)
2006 1 3
2006 1 3
2006 1 2
2006 1 1
2006 1 .
2006 2 3
2006 2 1
2006 2 3
2006 2 5
2006 2 2
2007 1 5
2007 1 3
2007 1 .
2007 1 4
2007 1 1
2007 2 .
2007 2 6
2007 2 6
2007 2 4
2007 2 1
end

tabulate approval_gov, generate(ag)
list in 1/5
collapse (sum) ag1-ag6, by(year state)
rename (ag1-ag6) (SA A D SD NH NAOD)
list, clean noobs

Code:

. list in 1/5

     +-------------------------------------------------------------+
     | year   state   approv~v   ag1   ag2   ag3   ag4   ag5   ag6 |
     |-------------------------------------------------------------|
  1. | 2006       1          3     0     0     1     0     0     0 |
  2. | 2006       1          3     0     0     1     0     0     0 |
  3. | 2006       1          2     0     1     0     0     0     0 |
  4. | 2006       1          1     1     0     0     0     0     0 |
  5. | 2006       1          .     .     .     .     .     .     . |
     +-------------------------------------------------------------+

. collapse (sum) ag1-ag6, by(year state)

. rename (ag1-ag6) (SA A D SD NH NAOD)

. list, clean noobs

    year   state   SA   A   D   SD   NH   NAOD  
    2006       1    1   1   2    0    0      0  
    2006       2    1   1   2    0    1      0  
    2007       1    1   0   1    1    1      0  
    2007       2    1   0   0    1    0      2

Comment

Michael Foreman

Join Date: Mar 2018

Posts: 54
#7

04 Mar 2020, 20:43

William,

I tested the code with my dataset and it gives me exactly what I needed! Thank you so much for your time and this elegant solution!
Comment
John Staehl

Join Date: Aug 2018

Posts: 2
#8

04 Mar 2020, 20:47

I have a merged data set for mutual funds and etfs. I want to create a bar graph to compare the number of mutual funds and etfs over the sample period (2010-2019). I created a dummy variable that distinguished between mutual funds and etfs (mutual == 1 if the observation is a mutual fund). How would I use this dummy variable to create a comparative bar graph of the number of mutual funds and etfs over the sample period?
Comment
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#9

05 Mar 2020, 03:06

Please start a new thread on this new matter.

Also, please take a at FAQ, particularly the recommendations about uses - dataex - or code delimiters to share a fraction of data.

Best regards,

Marcos
Comment

Announcement

Reshaping data with multiple obs per id

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment