Forloops in panel data - am I indexing incorrectly and using 'if' statements correctly?

David Lade

Join Date: Nov 2019
Posts: 9

Forloops in panel data - am I indexing incorrectly and using 'if' statements correctly?

26 Nov 2019, 22:08

I am trying to use a loop in stata to achieve the following:

My example data is below, but a brief decsription:

id = ID, date =mofd(date variable), month = 1-12, the number corresponding to what month of the year it is, rank = some integer from 1-10, that is prescribed every 6 months.

Example data:

Code:

clear
input byte id float date byte(month) byte(rank)
1  324  1   .
1  325  2   .
1  326  3   .
1  327  4   .
1  328  5   .
1  329  6   4
1  330  7   .
1  331  8   .
1  332  9   .
1  333  10  .
1  334  11  .
1  335  12  .
1  336  1   .
1  337  2   .
1  338  3   .
1  339  4   .
1  340  5   .
1  341  6   7
1  342  7   .
1  343  8   .
1  344  9   .
1  345  10  .
1  347  11  .
1  348  12  .
2  326  3   .
2  327  4   .
2  328  5   .
2  329  6   9
2  330  7   .
2  331  8   .
2  332  9   .
2  333  10  .
2  334  11  .
2  335  12  .
2  336  1   .
2  337  2   .
2  338  3   .
2  339  4   .
2  340  5   .
2  341  6   .
2  342  7   .
2  343  8   .
2  344  9   .
2  345  10  .
2  347  11  .
2  348  12  .
end

It is also xtset:

Code:

 xtset id date

In my output, I am trying to get the following to happen:

After a rank has been prescribed, I want to copy that value down for the 11 periods below, until we get to the new rank value at month = 6.

The tricky part is dealing with the case where "rank" = " . " when month ==6, because then my code is incorrectly carrying over the previous rank, when it should just be assigning " . " This can be seen for ID = 2, date = 341, the rank = .

My required output would then look something like this.

Code:

clear
input byte id float date byte(month) byte(rank)  byte(continued_rank)
1  324  1   .  .
1  325  2   .  .
1  326  3   .  .
1  327  4   .  .
1  328  5   .  .
1  329  6   4  4
1  330  7   .  4
1  331  8   .  4
1  332  9   .  4
1  333  10  .  4
1  334  11  .  4
1  335  12  .  4
1  336  1   .  4
1  337  2   .  4
1  338  3   .  4
1  339  4   .  4
1  340  5   .  4
1  341  6   7  7
1  342  7   .  7
1  343  8   .  7
1  344  9   .  7
1  345  10  .  7
1  347  11  .  7
1  348  12  .  7
2  326  3   .  .
2  327  4   .  .
2  328  5   .  .
2  329  6   9  9
2  330  7   .  9
2  331  8   .  9
2  332  9   .  9
2  333  10  .  9
2  334  11  .  9
2  335  12  .  9
2  336  1   .  9
2  337  2   .  9
2  338  3   .  9
2  339  4   .  9
2  340  5   .  9
2  341  6   .  .
2  342  7   .  .
2  343  8   .  .
2  344  9   .  .
2  345  10  .  .
2  347  11  .  .
2  348  12  .  .
end

What I tried was:

Code:

gen continued_rank = rank
forvalues 1=1/12{
replace continued_rank = L`i'.rank if continued_rank == .
}

but this returns the exact same output as the "rank" column, without carrying anything over.

I feel like I am not sure what the rule of

Code:

 forval i=1/12{ ... }

is when it comes to panel data. Am I wrong in saying that

Code:

 forval i = 1/10{...}

or any other "1/x" would give me the same output?

A caveat is that the FIRST observation of a certain ID may not start with month == 1. In the above example it starts at month ==3, for id ==2.

Any help will be greatly appreciated.

Moreover, I was also wondering if anyone had a link to practice examples or guides on looping in panel data, and using conditional statements (if, else if etc), since I don't feel like I quite get the hang of it yet.

Last edited by David Lade; 26 Nov 2019, 22:12.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

26 Nov 2019, 23:39

I could explain to you why your loop produces the results it does rather than the results you want, but there's no reason to use a loop for this:

Code:

by id (date), sort: replace rank = rank[_n-1] if missing(rank)

By the way, I notice in your example data, the date variable sometimes skips a month. For example date always jumps from 345 to 347, skipping 346. Is that an error in your data, or is it correct? If it's an error, and in the real data the dates are always consecutive, you can simplify it even more, eliminating the -by- by doing:

Code:

replace rank = L.rank if missing(rank)

Now, as for what is wrong with your loop, your -foreach- command creates a loop index named l (letter ell), but inside the loop you make reference to a non-existent `i'. Since `i' does not exist, Stata treats it as an empty string, so L`i'.rank is parsed as L.rank, so your loop only looks back 1 time period, regardless of which iteration through the loop you are in.
Comment
David Lade

Join Date: Nov 2019

Posts: 9
#3

27 Nov 2019, 01:07

Hello Clyde, thank you for your response, that's such a silly error on my behalf, I think I should finally invest in a pair of glasses..

So, actually, the date shouldn't have any jumps (although, it doesn't necessarily have to start at month = 1 for every new ID observation) - that was a mistake in my inputting the data - sorry about that.

Looking at your proposed solution, I believe it will fail for the case where ID = 2. For instance, at id =2, date = 341, the rank = .

In that case, rank for the following 11 months should also be = .

But I think with your solution, it will look behind and incorrectly pick up the previous rank = 9, which is not what I want it to do. I want the ranks in-between to based on every new rank at month = 6.

Also, I am curious about whether you knew how to do this in a loop anyway? A lot of my data-cleaning work seems to integrate loops at one stage or another, especially when working with panel data, and I just can't seem to get the hang of them!

Thanks again for all your help, I really appreciate it. People like you make this forum an invaluable learning resource.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

27 Nov 2019, 13:40

Oh, I didn't realize that you wanted the spread of the value do go only as far as the next month == 6 observation; I thought you wanted to keep going until a new non-missing rank was encountered. So for what you want, then, the code would be:

Code:

by id (date), sort: gen six_to_six_run = sum(month == 6) by id six_to_six_run (date), sort: gen c_rank = rank[1]

I suppose there is a way to do this using a loop instead of -by-, but it would probably take me a really long time to figure out, and the result would be really complicated. Also, given the nature of your "stopping condition" it might be better done with -while- than with -foreach-. While it's importat to learn how to use loops in Stata, it is at least as important to learn when not to. Loops are most useful in Stata when you want to apply the same calculations repetitively to a list of different variables, or to some list of data files, or list of names. They can also be useful for applying repetitive calculations over a list of the values of a variable--but always look for a -by:- solution first in this situation. If a -by:- solution is available, it is usually better.
1 like
Comment
David Lade

Join Date: Nov 2019

Posts: 9
#5

27 Nov 2019, 16:59

Thanks for that Clyde - it works beautifully. Curious thing about my forloop code above is that my dofile editor actually had 'i' instead of 'l' (ell). Oh well.

Would you mind explaining what gen new_var = var[1] does? I couldn't find the answer on google (didn't know what to google).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

27 Nov 2019, 17:35

In general, var[1] refers to the value of variable var in the first observation. When done specifically in the context of a by: prefix it takes on a slightly different meaning.

The -by:- prefix, in effect, partitions the data set into subset of observations defined by the values of the non-parenthesized variables mentioned in the prefix. In this case, it means the data set is (conceptually, not physically) divided into groups of observations, each group consisting of observations with the same value of id and six_to_six_run. The mention of date in parentheses does not change the grouping, but specifies that within those groupings, the observations must be sorted on date. In this context, the [1] tells Stata to refer to the first observation in the group, rather than the first observation in the data set (except in the first group, where those are the same thing).

So the command -by id six_to_six_run (date), sort: gen c_rank = rank[1]- tells Stata to conceptually partition the data into subsets defined by values of id and six_to_six_run, sort them chronologically within each group, and then set the value of new variable c_rank equal to the value of the rank in the group's first (i.e. earliest) observation.

Let me make a suggestion. You are clearly serious about learning to use Stata effectively and efficiently. Open up the PDF documentation that comes installed with Stata and read the Getting Started [GS] and User's Guide [U] volumes. It's a somewhat lengthy read, but it really covers the basics that every Stata user needs to know. You won't remember every detail, but having seen the lay of the land, when you need the details, you will likely remember what commands are applicable, and reading the -help- files for those will probably get you most or all of what you need at that point.
1 like
Comment
David Lade

Join Date: Nov 2019

Posts: 9
#7

27 Nov 2019, 18:08

Originally posted by Clyde Schechter View Post

In general, var[1] refers to the value of variable var in the first observation. When done specifically in the context of a by: prefix it takes on a slightly different meaning.

The -by:- prefix, in effect, partitions the data set into subset of observations defined by the values of the non-parenthesized variables mentioned in the prefix. In this case, it means the data set is (conceptually, not physically) divided into groups of observations, each group consisting of observations with the same value of id and six_to_six_run. The mention of date in parentheses does not change the grouping, but specifies that within those groupings, the observations must be sorted on date. In this context, the [1] tells Stata to refer to the first observation in the group, rather than the first observation in the data set (except in the first group, where those are the same thing).

So the command -by id six_to_six_run (date), sort: gen c_rank = rank[1]- tells Stata to conceptually partition the data into subsets defined by values of id and six_to_six_run, sort them chronologically within each group, and then set the value of new variable c_rank equal to the value of the rank in the group's first (i.e. earliest) observation.

Let me make a suggestion. You are clearly serious about learning to use Stata effectively and efficiently. Open up the PDF documentation that comes installed with Stata and read the Getting Started [GS] and User's Guide [U] volumes. It's a somewhat lengthy read, but it really covers the basics that every Stata user needs to know. You won't remember every detail, but having seen the lay of the land, when you need the details, you will likely remember what commands are applicable, and reading the -help- files for those will probably get you most or all of what you need at that point.

Fantastic, thank you so much. And I'll do just that. I've spent enough time dilly-dallying around trying to find a specific Stata guide that was more focused on what I need, but given my poor grasp of a lot of the general tools I think your suggestion is my best bet for a starting point.

Thanks again!
Comment

Announcement

Forloops in panel data - am I indexing incorrectly and using 'if' statements correctly?

Comment

Comment

Comment

Comment

Comment

Comment