Reshaping Data

Budu Gulo

Join Date: Feb 2018
Posts: 238

11 Jun 2018, 14:23

Dear Statalister,

My data currently has the following shape.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 A str12 B str4 C str6 D str5 E str12 F str5 G str12 H
"Name"  "CA1956151098" "Name" "#ERROR" "Name"  "KR7071970008" "Name"  "TW0002437001"
"42754" "6.25"         ""     ""       "42754" "0"            "42761" "0"           
"42755" "17.65"        ""     ""       "42755" "0"            "42762" "0"           
"42758" "-15"          ""     ""       "42758" "0"            "42765" "0"           
"42759" "14.71"        ""     ""       "42759" "0"            "42766" "0"           
"42760" "-2.56"        ""     ""       "42760" "0"            "42767" "0"           
"42761" "2.63"         ""     ""       "42761" "0"            "42768" ".95"         
"42762" "2.56"         ""     ""       "42762" "0"            "42769" "1.13"        
"42765" "0"            ""     ""       "42765" "0"            "42772" "2.43"        
"42766" "-2.5"         ""     ""       "42766" "0"            "42773" "3.83"        
"42767" "-5.13"        ""     ""       ""      ""             ""      ""            
end

How can I reshape it like the following ? Thanks in advance.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 date str12 name str5 return
"42754" "CA1956151098" "6.25" 
"42755" "CA1956151099" "17.65"
"42758" "CA1956151100" "-15"  
"42759" "CA1956151101" "14.71"
"42760" "CA1956151102" "-2.56"
"42761" "CA1956151103" "2.63" 
"42762" "CA1956151104" "2.56" 
"42765" "CA1956151105" "0"    
"42766" "CA1956151106" "-2.5" 
"42767" "CA1956151107" "-5.13"
"."     "#ERROR"       "."    
"42754" "KR7071970008" "0"    
"42755" "KR7071970008" "0"    
"42758" "KR7071970008" "0"    
"42759" "KR7071970008" "0"    
"42760" "KR7071970008" "0"    
"42761" "KR7071970008" "0"    
"42762" "KR7071970008" "0"    
"42765" "KR7071970008" "0"    
"42766" "KR7071970008" "0"    
"42761" "TW0002437001" "0"    
"42762" "TW0002437002" "0"    
"42765" "TW0002437003" "0"    
"42766" "TW0002437004" "0"    
"42767" "TW0002437005" "0"    
"42768" "TW0002437006" ".95"  
"42769" "TW0002437007" "1.13" 
"42772" "TW0002437008" "2.43" 
"42773" "TW0002437009" "3.83" 
end

Tags: None

Matt Warkentin

Join Date: May 2016

Posts: 104
#2

11 Jun 2018, 16:12

Hi Budu,

The first snippet of data you provided is improper if the variable A is meant to be a unique ID. I see that generally the same ID's appear for the variables named E and G. But you can see for observation 2, they take the following values for variables A, E, and G --> 42754, 42754, 42761. If these are meant to be uniquely identifying, they aren't.

In addition, the first observation in your top snippet of data appears to be the column headers. I recommend taking the top dataex code you provided and run it in Stata and view it using --browse--. Does it look right to you? I'm guessing not. If you can provide a properly structured dataset, I'd be happy to help you reshape it.
Comment
Budu Gulo

Join Date: Feb 2018

Posts: 238
#3

11 Jun 2018, 16:40

Hi Matt,

Thank you for your response. Then number starting with 4 are Excel format dates.

I provided unstructured dataset because "Name" appears many times. When I import first row as variables, the third variable becomes C, fifth variable becomes E, and so on. Put another way, my intention was to show how my data appears.

Here is the data when I import first row as variable:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long Name double CA1956151098 byte(C ERROR) long E byte KR7071970008 long G double TW0002437001 42754 6.25 . . 42754 0 42761 0 42755 17.65 . . 42755 0 42762 0 42758 -15 . . 42758 0 42765 0 42759 14.71 . . 42759 0 42766 0 42760 -2.56 . . 42760 0 42767 0 42761 2.63 . . 42761 0 42768 .95 42762 2.56 . . 42762 0 42769 1.13 42765 0 . . 42765 0 42772 2.43 42766 -2.5 . . 42766 0 42773 3.83 42767 -5.13 . . . . . . end

Please let me know if there is any confusion. Thank you again for your help.
Comment
Matt Warkentin

Join Date: May 2016

Posts: 104
#4

11 Jun 2018, 16:52

Hi Budu,

I think you clarified some of my confusion, however, the issue remains that some variable in your data set needs to uniquely identify the rows and I am not sure what it is. You may want to consider cleaning up the Excel spreadsheet before importing it into Stata (remember not to save over any original files).

Maybe I could offer better advice if I knew what the data were meant to represent.
Comment
Budu Gulo

Join Date: Feb 2018

Posts: 238
#5

11 Jun 2018, 17:00

Hi Matt,

Please have a look at the Excel file Target Stock Return Reshape.xlsx . Many thanks.
Comment
Matt Warkentin

Join Date: May 2016

Posts: 104
#6

12 Jun 2018, 08:32

Hi Budu,

Some members of this forum, myself somewhat included, are generally uncomfortable downloading a file we don't know the contents or origin of. This is why we prefer posters use the --dataex-- command.

I took the risk and opened your data set. As mentioned before, I would consider cleaning this file up. Multiple columns have identical names ('Name') which isn't very informative and are not unique for Stata when importing. Secondly, the even numbered columns are named with unacceptable characters. Stata recommends using letters (A-Z, capital or lowercase), numbers, and underscores only. Consider fixing your columns to adhere to these rules. Thirdly, you have some empty and nearly empty columns, I will let you decide whether they should be removed.

I would say the biggest issue for reshaping and analysis is I still do not know if there are variables which uniquely identify your observations. What is the unit of measure? Are these all measures of the same company on different dates? I am very unclear.
Comment
Budu Gulo

Join Date: Feb 2018

Posts: 238
#7

12 Jun 2018, 11:48

Hi Matt,

Thank you for your reply. These are stock return (percentage) data corresponding to different companies, where time period varies for some companies. For example company A has return data from 7 January to 8 March , and company B has return data from 14 January to 15 March. Note that for some companies return data is not available which is why there are some empty columns ( "#ERROR" as company names).

A somewhat similar reshape problem of mine was answered by Clyde Schechter here: https://www.statalist.org/forums/for...441970-reshape .

Please let me know if there is any confusion. Thanks.
Comment
Matt Warkentin

Join Date: May 2016

Posts: 104
#8

12 Jun 2018, 15:29

HI Budu,

You'll see that Clyde pointed out many of the same issues I have noted in our discussion. I don't think I have a Stata solution for you. I don't know how to use Stata to structure the data the way you want because your data is neither appropriately wide nor appropriately long. You need a column in your data that contains unique identifiers and you don't have one.

In order to go from wide to long data format, you need a column that uniquely identifies the rows, and then you need additional columns that are repeat-measures of the row observations (companies, in your case). Without these I cannot help. Sorry.
Comment
Budu Gulo

Join Date: Feb 2018

Posts: 238
#9

12 Jun 2018, 16:32

No worries Matt! Thanks for the try.
Comment
Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#10

12 Jun 2018, 16:45

The below code should give out the solution.

Code:

stack*, into(date return) clear bys _stack: gen name=return[1] bys _stack: drop if _n==1
3 likes
Comment
Budu Gulo

Join Date: Feb 2018

Posts: 238
#11

12 Jun 2018, 17:17

Wow! Thanks for the magic Romalpa Akzo.
Comment

Budu Gulo

Join Date: Feb 2018
Posts: 238

#12

13 Jun 2018, 08:40

Hi Romalpa Akzo ,

The codes work in both cases -- import or do not import first row as variable names. However, results do not match. For example, in the "import" case, I get

Code:

. stack*, into(date return) clear

. bys _stack: gen name=return[1]
(10 missing values generated)

. bys _stack: drop if _n==1
(4 observations deleted)

. drop _stack

. 
end of do-file

. sum

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        date |         25    42763.24    4.789572      42755      42773
      return |         25        .828    5.813459        -15      17.65
        name |         27    2.083333    3.002403          0       6.25

In the "do not import" case, I get

Code:

. stack*, into(date return) clear

. bys _stack: gen name=return[1]

. bys _stack: drop if _n==1
(4 observations deleted)

. drop _stack

. 
end of do-file

. destring return, replace
return has all characters numeric; replaced as double
(12 missing values generated)

. sum

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        date |          0
      return |         28       .9625    5.582293        -15      17.65
        name |          0

.

Here is the "import" data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long Name double CA1956151098 byte(C ERROR) long E byte KR7071970008 long G double TW0002437001
42754  6.25 . . 42754 0 42761    0
42755 17.65 . . 42755 0 42762    0
42758   -15 . . 42758 0 42765    0
42759 14.71 . . 42759 0 42766    0
42760 -2.56 . . 42760 0 42767    0
42761  2.63 . . 42761 0 42768  .95
42762  2.56 . . 42762 0 42769 1.13
42765     0 . . 42765 0 42772 2.43
42766  -2.5 . . 42766 0 42773 3.83
42767 -5.13 . .     . .     .    .
end

and "do not import" data

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 A str12 B str4 C str6 D str5 E str12 F str5 G str12 H
"Name"  "CA1956151098" "Name" "#ERROR" "Name"  "KR7071970008" "Name"  "TW0002437001"
"42754" "6.25"         ""     ""       "42754" "0"            "42761" "0"           
"42755" "17.65"        ""     ""       "42755" "0"            "42762" "0"           
"42758" "-15"          ""     ""       "42758" "0"            "42765" "0"           
"42759" "14.71"        ""     ""       "42759" "0"            "42766" "0"           
"42760" "-2.56"        ""     ""       "42760" "0"            "42767" "0"           
"42761" "2.63"         ""     ""       "42761" "0"            "42768" ".95"         
"42762" "2.56"         ""     ""       "42762" "0"            "42769" "1.13"        
"42765" "0"            ""     ""       "42765" "0"            "42772" "2.43"        
"42766" "-2.5"         ""     ""       "42766" "0"            "42773" "3.83"        
"42767" "-5.13"        ""     ""       ""      ""             ""      ""            
end

Which option should I pick?

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#13

13 Jun 2018, 15:13

NO firstrow is the right choice. If you browse the data just after running each line of the code, you would see it out more clearly.
1 like
Comment
Budu Gulo

Join Date: Feb 2018

Posts: 238
#14

13 Jun 2018, 16:53

Thank you for your feedback Romalpa Akzo .
Comment

Announcement

Reshaping Data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment