Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reshaping Data

    Dear Statalister,

    My data currently has the following shape.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str5 A str12 B str4 C str6 D str5 E str12 F str5 G str12 H
    "Name"  "CA1956151098" "Name" "#ERROR" "Name"  "KR7071970008" "Name"  "TW0002437001"
    "42754" "6.25"         ""     ""       "42754" "0"            "42761" "0"           
    "42755" "17.65"        ""     ""       "42755" "0"            "42762" "0"           
    "42758" "-15"          ""     ""       "42758" "0"            "42765" "0"           
    "42759" "14.71"        ""     ""       "42759" "0"            "42766" "0"           
    "42760" "-2.56"        ""     ""       "42760" "0"            "42767" "0"           
    "42761" "2.63"         ""     ""       "42761" "0"            "42768" ".95"         
    "42762" "2.56"         ""     ""       "42762" "0"            "42769" "1.13"        
    "42765" "0"            ""     ""       "42765" "0"            "42772" "2.43"        
    "42766" "-2.5"         ""     ""       "42766" "0"            "42773" "3.83"        
    "42767" "-5.13"        ""     ""       ""      ""             ""      ""            
    end
    How can I reshape it like the following ? Thanks in advance.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str5 date str12 name str5 return
    "42754" "CA1956151098" "6.25" 
    "42755" "CA1956151099" "17.65"
    "42758" "CA1956151100" "-15"  
    "42759" "CA1956151101" "14.71"
    "42760" "CA1956151102" "-2.56"
    "42761" "CA1956151103" "2.63" 
    "42762" "CA1956151104" "2.56" 
    "42765" "CA1956151105" "0"    
    "42766" "CA1956151106" "-2.5" 
    "42767" "CA1956151107" "-5.13"
    "."     "#ERROR"       "."    
    "42754" "KR7071970008" "0"    
    "42755" "KR7071970008" "0"    
    "42758" "KR7071970008" "0"    
    "42759" "KR7071970008" "0"    
    "42760" "KR7071970008" "0"    
    "42761" "KR7071970008" "0"    
    "42762" "KR7071970008" "0"    
    "42765" "KR7071970008" "0"    
    "42766" "KR7071970008" "0"    
    "42761" "TW0002437001" "0"    
    "42762" "TW0002437002" "0"    
    "42765" "TW0002437003" "0"    
    "42766" "TW0002437004" "0"    
    "42767" "TW0002437005" "0"    
    "42768" "TW0002437006" ".95"  
    "42769" "TW0002437007" "1.13" 
    "42772" "TW0002437008" "2.43" 
    "42773" "TW0002437009" "3.83" 
    end

  • #2
    Hi Budu,

    The first snippet of data you provided is improper if the variable A is meant to be a unique ID. I see that generally the same ID's appear for the variables named E and G. But you can see for observation 2, they take the following values for variables A, E, and G --> 42754, 42754, 42761. If these are meant to be uniquely identifying, they aren't.

    In addition, the first observation in your top snippet of data appears to be the column headers. I recommend taking the top dataex code you provided and run it in Stata and view it using --browse--. Does it look right to you? I'm guessing not. If you can provide a properly structured dataset, I'd be happy to help you reshape it.

    Comment


    • #3
      Hi Matt,

      Thank you for your response. Then number starting with 4 are Excel format dates.

      I provided unstructured dataset because "Name" appears many times. When I import first row as variables, the third variable becomes C, fifth variable becomes E, and so on. Put another way, my intention was to show how my data appears.

      Here is the data when I import first row as variable:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input long Name double CA1956151098 byte(C ERROR) long E byte KR7071970008 long G double TW0002437001
      42754  6.25 . . 42754 0 42761    0
      42755 17.65 . . 42755 0 42762    0
      42758   -15 . . 42758 0 42765    0
      42759 14.71 . . 42759 0 42766    0
      42760 -2.56 . . 42760 0 42767    0
      42761  2.63 . . 42761 0 42768  .95
      42762  2.56 . . 42762 0 42769 1.13
      42765     0 . . 42765 0 42772 2.43
      42766  -2.5 . . 42766 0 42773 3.83
      42767 -5.13 . .     . .     .    .
      end
      Please let me know if there is any confusion. Thank you again for your help.

      Comment


      • #4
        Hi Budu,

        I think you clarified some of my confusion, however, the issue remains that some variable in your data set needs to uniquely identify the rows and I am not sure what it is. You may want to consider cleaning up the Excel spreadsheet before importing it into Stata (remember not to save over any original files).

        Maybe I could offer better advice if I knew what the data were meant to represent.

        Comment


        • #5
          Hi Matt,

          Please have a look at the Excel file Target Stock Return Reshape.xlsx . Many thanks.

          Comment


          • #6
            Hi Budu,

            Some members of this forum, myself somewhat included, are generally uncomfortable downloading a file we don't know the contents or origin of. This is why we prefer posters use the --dataex-- command.

            I took the risk and opened your data set. As mentioned before, I would consider cleaning this file up. Multiple columns have identical names ('Name') which isn't very informative and are not unique for Stata when importing. Secondly, the even numbered columns are named with unacceptable characters. Stata recommends using letters (A-Z, capital or lowercase), numbers, and underscores only. Consider fixing your columns to adhere to these rules. Thirdly, you have some empty and nearly empty columns, I will let you decide whether they should be removed.

            I would say the biggest issue for reshaping and analysis is I still do not know if there are variables which uniquely identify your observations. What is the unit of measure? Are these all measures of the same company on different dates? I am very unclear.

            Comment


            • #7
              Hi Matt,

              Thank you for your reply. These are stock return (percentage) data corresponding to different companies, where time period varies for some companies. For example company A has return data from 7 January to 8 March , and company B has return data from 14 January to 15 March. Note that for some companies return data is not available which is why there are some empty columns ( "#ERROR" as company names).

              A somewhat similar reshape problem of mine was answered by Clyde Schechter here: https://www.statalist.org/forums/for...441970-reshape .

              Please let me know if there is any confusion. Thanks.

              Comment


              • #8
                HI Budu,

                You'll see that Clyde pointed out many of the same issues I have noted in our discussion. I don't think I have a Stata solution for you. I don't know how to use Stata to structure the data the way you want because your data is neither appropriately wide nor appropriately long. You need a column in your data that contains unique identifiers and you don't have one.

                In order to go from wide to long data format, you need a column that uniquely identifies the rows, and then you need additional columns that are repeat-measures of the row observations (companies, in your case). Without these I cannot help. Sorry.

                Comment


                • #9
                  No worries Matt! Thanks for the try.

                  Comment


                  • #10
                    The below code should give out the solution.
                    Code:
                    stack*, into(date return) clear
                    bys _stack: gen name=return[1]
                    bys _stack: drop if _n==1

                    Comment


                    • #11
                      Wow! Thanks for the magic Romalpa Akzo.

                      Comment


                      • #12
                        Hi Romalpa Akzo ,

                        The codes work in both cases -- import or do not import first row as variable names. However, results do not match. For example, in the "import" case, I get

                        Code:
                        . stack*, into(date return) clear
                        
                        . bys _stack: gen name=return[1]
                        (10 missing values generated)
                        
                        . bys _stack: drop if _n==1
                        (4 observations deleted)
                        
                        . drop _stack
                        
                        . 
                        end of do-file
                        
                        . sum
                        
                            Variable |        Obs        Mean    Std. Dev.       Min        Max
                        -------------+---------------------------------------------------------
                                date |         25    42763.24    4.789572      42755      42773
                              return |         25        .828    5.813459        -15      17.65
                                name |         27    2.083333    3.002403          0       6.25
                        In the "do not import" case, I get
                        Code:
                        . stack*, into(date return) clear
                        
                        . bys _stack: gen name=return[1]
                        
                        . bys _stack: drop if _n==1
                        (4 observations deleted)
                        
                        . drop _stack
                        
                        . 
                        end of do-file
                        
                        . destring return, replace
                        return has all characters numeric; replaced as double
                        (12 missing values generated)
                        
                        . sum
                        
                            Variable |        Obs        Mean    Std. Dev.       Min        Max
                        -------------+---------------------------------------------------------
                                date |          0
                              return |         28       .9625    5.582293        -15      17.65
                                name |          0
                        
                        .

                        Here is the "import" data

                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input long Name double CA1956151098 byte(C ERROR) long E byte KR7071970008 long G double TW0002437001
                        42754  6.25 . . 42754 0 42761    0
                        42755 17.65 . . 42755 0 42762    0
                        42758   -15 . . 42758 0 42765    0
                        42759 14.71 . . 42759 0 42766    0
                        42760 -2.56 . . 42760 0 42767    0
                        42761  2.63 . . 42761 0 42768  .95
                        42762  2.56 . . 42762 0 42769 1.13
                        42765     0 . . 42765 0 42772 2.43
                        42766  -2.5 . . 42766 0 42773 3.83
                        42767 -5.13 . .     . .     .    .
                        end

                        and "do not import" data

                        Code:
                        * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input str5 A str12 B str4 C str6 D str5 E str12 F str5 G str12 H
                        "Name"  "CA1956151098" "Name" "#ERROR" "Name"  "KR7071970008" "Name"  "TW0002437001"
                        "42754" "6.25"         ""     ""       "42754" "0"            "42761" "0"           
                        "42755" "17.65"        ""     ""       "42755" "0"            "42762" "0"           
                        "42758" "-15"          ""     ""       "42758" "0"            "42765" "0"           
                        "42759" "14.71"        ""     ""       "42759" "0"            "42766" "0"           
                        "42760" "-2.56"        ""     ""       "42760" "0"            "42767" "0"           
                        "42761" "2.63"         ""     ""       "42761" "0"            "42768" ".95"         
                        "42762" "2.56"         ""     ""       "42762" "0"            "42769" "1.13"        
                        "42765" "0"            ""     ""       "42765" "0"            "42772" "2.43"        
                        "42766" "-2.5"         ""     ""       "42766" "0"            "42773" "3.83"        
                        "42767" "-5.13"        ""     ""       ""      ""             ""      ""            
                        end
                        Which option should I pick?

                        Comment


                        • #13
                          NO firstrow is the right choice. If you browse the data just after running each line of the code, you would see it out more clearly.

                          Comment


                          • #14
                            Thank you for your feedback Romalpa Akzo .

                            Comment

                            Working...
                            X