Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    but, I do not want to change the number of observations
    Why not? In fact, even if you weren't engaged in this particular problem, reshaping to long would probably be a good idea anyway. There isn't much you can do with wide data in Stata; its commands are mostly designed to work best (or only) with long layout data.

    I suppose there is some convoluted way to do what you are looking for with the data in wide layout. But it's clunky, and kludgy. And not wanting to increase the number of observations sounds like an arbitrary whim that doesn't justify writing opaque error-prone code to do something that is reducible to a transparent one-liner once the data is long. Even if you do have a compelling reason to have the data in wide layout, you can always just -reshape wide- after the logit and you'll have it back (though, I emphasize, you will probably only make your life more difficult in whatever comes next if you do that.)

    Comment


    • #17
      Clyde,
      I read carefully your previous posts to understand clearly. Indeed, the codes run very well with the real data.

      As I commented you before, I was concerned about the increased number of observations as a result of reshape command.

      Looking at the browse window, this increase is a multiplication of the unique observation according the number of dates that were part of variable labels in "tp" and "t2m". Ceteris paribus, the rest of variable are the same or are copies of the unique observations. Until this point, it's ok for me.

      Now, my key doubts are below:
      * Could more observations alter the results of logit regression?


      I made some modifications to your codes:
      Code:
      logit sick improv_water toilet tp if interview == the_date
      Code:
      logit sick improv_water toilet tp if interview == the_date - 1
      Code:
      logit sick improv_water toilet tp if interview == the_date - 2
      Each of these line commands shows as result a different quantity of observations evaluated. For example, a) there are 406 observations as result in first line commmand, b) there are 298 in the second line, and c) there are 150 observations. I guess -not quite sure about it- that the number of observations evaluated has an effect on estimates: coefficients, standard errors and p-values.

      * Why the number of observations evaluated differs according to the relationship between "interview and "the_date"?

      Thanks again

      Comment


      • #18
        But you would observe exactly the same changing of the N's if you were to work this out in wide layout. The source of this is not the change in number of observations arising from -reshape long-. The source of this is varying numbers of observations with interview dates that relate in those ways to the value of the_date.

        Comment


        • #19
          Hi again,
          I would like to show more points to discuss and my doubts:

          1. I tested what you mentioned before:
          But you would observe exactly the same changing of the N's if you were to work this out in wide layout. The source of this is not the change in number of observations arising from -reshape long-
          with the sample data posted in #15. I performed regressions of this sample before and after reshaping considering only:

          Code:
          logit sick improv_water toiler
          Comparing the results, they showed different number of observations evaluated. For example, in data base before reshaping the # of obs were 406, and # of obs were 2030 when the data were reshaped. Looking this, I notice there are differences but the question still remains, why?

          If the regression performed includes the rest of variables (which only can be run after reshaping the data, as you suggested):
          Code:
          logit sick tp improv_water toilet
          In this case, # of obs evaluated are 406, exactly as in the case where regression was run with data not reshaped and considering only "improv_water" and "toilet".


          2. Also, you mentioned:
          The source of this is varying numbers of observations with interview dates that relate in those ways to the value of the_date.
          I understood that variance in # of obs was due to the relationship between "interview" and "the_date" variables; for example, if both are equals (interview == the_date), or lagged by 1 day or more (interview == the_date - n). Where "n" is the number of lagged days. So, to test this, I run regress the following:

          Code:
          logit sick improv_water tp if interview == the_date
          logit sick improv_water tp if interview == the_date - 1
          logit sick improv_water tp if interview == the_date - 2
          The first regression showed 406 obs, the second one 298 obs, and the third one, 150 obs. So, the # of obs varied according to the link between "interview" and "the_date". This confirms what you said.


          3. The reshaped sample data has two variables related to dates: "interview" and "the_date". "interview" is a fixed value by each observation (member_key), it's the date when the household habited by a particular member is visited. Because of reshaping, the unique observations (member_key) are multiplied according to the # of dates linked to "tp" and "t2m -as suffixes-. So copies of this observations have a fixed value of "interview" but not for "the_date". "the_date" is the date when "tp" or "t2m" variable reports a value.

          I want to regress logit sick improv_water tp if the value of "tp" occurs in the same day a household (thereby a member) is interviewed or in previous days. Bearing in mind this, I have 2 options:
          • logit sick improv_water tp if interview == the_date - n
          • logit sick improv_water tp if the_date == interview - n
          Where "n" is the number of lagged days. So, I guess though not quite sure, the correct code is the second one, am I right?


          4. For all previous I used a sample data. When I tried to replicate with the real data, I could not do it. The reason: the reshape command runs indefinitely or stop working.
          • Sample data has 10 vars and 2,105 obs
          • Real data has 2,688 vars and 21,970 obs. Also, the "tp" and "t2m" vars has a range of dates between 20852 (02feb2017) and 21173 (20dec2017): tp20852, t2m20852,....tp21173, t2m21173. There are nearly 321 days.
          I tried twice to run shape command, it took to long to run (approx. more than 2 hours) only to realize that it did not work out.. I think this is due to real data size or the number of "tp" and "t2m" variables to be change in the reshape process. Is there any way to solve this?

          Comment


          • #20
            1. Yes, of course, the -reshape- to long changes the number of observations, and when you do an unrestricted -logit- on all observations without regard to the relationship between interview and the_date, the long data set has many more observations included. But you are, according to your original question, only interested in the restricted analyses where there is a specific relationship between interview and the_date. That will be the same either way.

            2. OK.

            3.
            I want to regress logit sick improv_water tp if the value of "tp" occurs in the same day a household (thereby a member) is interviewed or in previous days. Bearing in mind this, I have 2 options:
            • logit sick improv_water tp if interview == the_date - n
            • logit sick improv_water tp if the_date == interview - n
            Where "n" is the number of lagged days. So, I guess though not quite sure, the correct code is the second one, am I right?
            If you want tp to be before the interview, then the second one is indeed correct.

            4. Yes, -reshape- is going to be painfully slow with a data set this size. There is a faster command, -tolong- written by Rafal Raciborski, available from SSC. This will speed things up considerably compared to -reshape-, but it will still take a while to run in a data set that large. The -tolong- syntax is very similar to that of -reshape long- but there are some subtle differences, so be sure to read the -help tolong- help-file to see if you have to do it slightly differently.

            Comment


            • #21
              The command -tolong- works well and it's faster than -reshape-. It only took approx. 8 min. The reshaped data has 1,727 vars and 6,415,240 obs.

              Another round of questions:

              1. Given the amount of information of the reshaped data, it takes longer to run simple commands like codebook or save the data. This is a disadvantage because I will perform many regressions.

              I though a solution would be to lighten the database. As I mentioned before, there are two variables related to dates: "interview" and "the_date". The observations of "the_date" are from 02feb2017 to 20dec2017. There are 321 days. Variable "interview" has a one value for each observation.

              My main focus is on date values of "the_date" before "interview". Particularly, a range of 7 days, i.e, between 2 weeks and 1 week before "interview". So, I think if some observations can be dropped, size data will decrease considerably and the time to perform regressions will fasten. Each observation only will be multiplied seven times (7 days).

              Intuitevily, I try to create a command line to achieve it. This is what I wrote:

              Code:
              keep if the_date == interview - 14 & the_date == interview - 7
              I did not try yet because Stata reacts either taking more than a while or not responding. Is the above code correct? What others codes would respond to what I want?


              2. Considering...
              Yes, of course, the -reshape- to long changes the number of observations, and when you do an unrestricted -logit- on all observations without regard to the relationship between interview and the_date, the long data set has many more observations included. But you are, according to your original question, only interested in the restricted analyses where there is a specific relationship between interview and the_date. That will be the same either way.
              I mentioned the data comes from a survey. I will performed regressions considering the survey design, this means including information of variables like sample weights and clustering. The reshaped data has a bunch of copies of unique observations ("member_key"), most of variables are copies with a fixed value except the value of "the_date" which varies. So, a repeated observation has repeated values for "sample weight" and "cluster".

              My doubt is: in the case I perfom restricted logit (where the link between "the_date" and "interview" is included), will the presence of observation copies of sample weight and cluster affect the results of the regression?
              Last edited by Brian Yalle; 05 Feb 2021, 20:56.

              Comment


              • #22
                keep if the_date == interview - 14 & the_date == interview - 7
                That code will leave you with no observations at all. Think about it. Whatever the value of the_date is in an observation, it cannot possibly be equal to both interview - 14 and interview -7. You don't want & there, you want | (which is Stata for or). Better still, use the inlist() function:
                Code:
                keep if inlist(interview-date, 7, 14)
                Regarding the regressions, when you do your restricted regressions, even though the long layout contains many more observations than the wide layout started with, the regressions will only use the same number of observations they would have in wide layout. In the original wide layout, an observation would only participate in some of the regressions, namely those for which there was an appropriate difference between the interview date and the date that was suffixed onto one of those variables (I forget which, tm or something like that.) All that has happened with -reshape- is that each of those different tm variables now lives in a different observation. When you impose the restriction that interview-tm = whatever, only those new observations that meet that restriction will participate in the regression. These "observations" are just the pieces of the original wide-layout observations, and their participation in the regressions are exactly the same either way.

                Comment


                • #23
                  I've just run your code but the results are not what I expected.

                  Looking in the browse window I noticed each observation was only doubled. I though observations will be multiplied seven times. Also, the values in columns of "interview" and "the_date" helped me understand the reason behind. The line command only has kept two date values: one for 7 days previous to the date value of "interview" and the other is 14 days previous the "interview".

                  I want to keep date values from seven days until 14 days previous the "interview". After surfing the Internet, I guess the command line more suitable to resolve this problem could be:

                  Code:
                  keep if inlist (interview - the_date, 7, 8, 9, 10, 11, 12, 13, 14)
                  Is there any more efficiente way to recode this?

                  Comment


                  • #24
                    Yes, you can code that more succinctly as
                    Code:
                    keep if inrange(interview-the_date, 7, 14)

                    Comment


                    • #25
                      Yes, it works!

                      Clyde, thank you so much for all your answers. I appreciate the time you've devoted to solve my doubts.

                      Comment


                      • #26
                        Hello!
                        I want to add explanatory variables in the regression sequentially, how can I do that using loop?
                        For example The dependent variable is Y and my explanatory variables are x1, x2, x3, z1,z2,z3. The common control variables exist in all the regressions , but i want to add z1, z2, z3 sequentially.
                        The first regression would be
                        Code:
                        reg Y x1 x2 x3 z1
                        The second regression would be
                        Code:
                        reg Y x1 x2 x3 z2
                        The third regression would be
                        Code:
                        reg Y x1 x2 x3 z3
                        How can i add variables sequentially in regression using loop.
                        ​​​​​​​

                        Comment


                        • #27
                          Code:
                          forvalues i=1/3 {
                            reg Y x1 x2 x3 z`i'
                          }
                          See https://www.stata.com/manuals13/pforvalues.pdf

                          Comment


                          • #28
                            Thanks Daniel Feenberg

                            Will try this out

                            Comment

                            Working...
                            X