How to keep selected observations using a criterion linked to other variables' values?

Daniel McAdams

Join Date: Nov 2022

Posts: 26
#1

How to keep selected observations using a criterion linked to other variables' values?

16 Nov 2022, 13:36

Hello!
I am working with a big panel dataset so here is a peek generated using randomtag:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str3 countrycode str20 indicatorcode int year str3 incomegrpcode double value int lastyear double lastvalue "COL" "SE.TER.GRAD.FE.SI.ZS" 2002 "UMC" 36.78437 2018 33.40987014770508 "LIC" "SL.EMP.VULN.MA.ZS" 2004 "" 11.470940657134927 2019 10.029830932617188 "LIC" "SL.EMP.VULN.MA.ZS" 2014 "" 10.21737611206278 2019 10.029830932617188 "LMC" "SL.EMP.VULN.MA.ZS" 2016 "" 9.993775550015 2019 10.029830932617188 "HIC" "SL.UEM.NEET.FE.ZS" 2004 "" 13.25578375763103 2020 13.660029411315918 "HIC" "SL.EMP.VULN.FE.ZS" 2007 "" 34.93653748381432 2019 33.77582931518555 "LCN" "SL.EMP.VULN.MA.ZS" 2016 "" 32.627529254827486 2019 33.37175750732422 "LCN" "SL.UEM.NEET.FE.ZS" 2009 "" 27.604067393450375 2020 29.49407386779785 "UMC" "SL.EMP.VULN.MA.ZS" 2016 "" 37.18845687356563 2019 35.69723892211914 "UMC" "SP.ADO.TFRT" 2014 "" 30.77510272630119 2020 29.038925170898438 end

The "countrycode" of interest is COL (for country Colombia) and I want to:
1. Keep observations for "countrycode" if "countrycode" equals COL, LCN, and either HIC, or UMC or LMC or LIC. To make this decision, I want to generate a command that keeps the "countrycode" value that matches COL's "incomegrpcode" (in this case "UMC"). So, I will keep "countrycode" for COL, LCN, and UMC. (I have 30 other files for other 30 countries which is why I want to automatize this selection process).

2. Lastly, I want to keep the observations for variables "year" and "value" but only if "countrycode" is COL. Note that I do not want to delete other observations for LCN and UMC. I tried:
-keep value if countrycode == "COL"-
but the syntax is wrong.

This is my first week on statalist so appreciate the help and patience!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

16 Nov 2022, 14:38

Keep observations for "countrycode" if "countrycode" equals COL, LCN, and either HIC, or UMC or LMC or LIC.

Code:

keep if inlist(countrycode, "COL", "LCN", "HIC", "UMC", "LMC")

To make this decision, I want to generate a command that keeps the "countrycode" value that matches COL's "incomegrpcode" (in this case "UMC"). So, I will keep "countrycode" for COL, LCN, and UMC

I don't understand what you want here. I can understand that you want to keep observations where countrycode == "COL" and where countrycode == "UMC" because "UMC" is the value associated with "COL" in variable incomegrpcode. But why "LCN", and if there is some reason for observations with "LCN" to be kept, why not also "HIC", nor "LIC" and "LMC"? What is the connection?

I tried:
-keep value if countrycode == "COL"-
but the syntax is wrong.

Not only is the syntax wrong, but it reflects a misunderstanding of the two ways that -keep- can be used and how Stata data sets are organized.

FIrst, let's be clear about terminology. In Stata, an observation is what looks like a row when you view the data in the Browser. A variable is what looks like a column. You can -keep- observations with commands that specify observations, like:

Code:

keep if value > 100 keep in 1/5 keep if inlist(countrycode, "COL", "LCN", "HIC", "UMC", "LMC")

Notice that in these commands, there is no variable named after -keep-. And when executed, they keep only the observations that meet the criterion specified: all the other observations ("rows") are removed in their entirety.

-keep- has another use. You can choose to keep particular variables, with commands like:

Code:

keep countrycode value keep *value

Notice that these have no -if- or -in- clauses specifying observations. When these commands are executed, the variables named in the command are retained, and all other variables are removed, in their entirety.

You cannot combine these two syntaxes into a single command, because the meaning of such a command would be nonsensical. What, after all, should Stata understand -keep value if countrycode == "COL"- to mean? If we -keep value-, then there is no longer any countrycode variable to compare to "COL". And if we just "keep if countrycode == "COL", then the variable value, if it was there before, is still there, so why even mention it.

I can imagine that you have in mind something like this: get rid of the numbers that appear in the value variable, except retain them if countrycode == "COL". OK, but as noted, if an observation is in the data set, it is there for every single variable in the data set. You cannot have an observation that has only some variables and not others. What you want to do is not "get rid of" those numbers but replace them with missing values. So the command to do that would be:

Code:

replace value = . if countrycode != "COL"
Comment
Daniel McAdams

Join Date: Nov 2022

Posts: 26
#3

16 Nov 2022, 15:31

I don't understand what you want here. I can understand that you want to keep observations where countrycode == "COL" and where countrycode == "UMC" because "UMC" is the value associated with "COL" in variable incomegrpcode. But why "LCN", and if there is some reason for observations with "LCN" to be kept, why not also "HIC", nor "LIC" and "LMC"? What is the connection?

Clyde Schechter You understood correctly that I want to keep UMC for COL because UMC is COL's incomegrpcode. Here UMC stands for "upper middle country" (HIC is high income country, LMC is lower middle income, and LIC is lower income country). I also want to keep LCN because that is the average value for "Latin America and Caribbean". I ultimately want to compare the country at hand (in this case COL) with the average of its respective income level group (in this case UMC) and the average of Latin America and Caribbean (i.e. LCN). Hope this gives more sense to what I want ultimately want to do here and then repeat with the rest of the other 30 countries that I have in other 30 separate dta files, all under the same folder.

And thanks a lot for the clarification regarding my second question. The code you provided is exactly what I will needed:

Code:

replace value = . if countrycode != "COL"
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#4

16 Nov 2022, 16:01

Well, I can help you with the inomegrpcode part, but not the other.

Here is a scheme that willl allow you to automate selecting a group of countries to work on, going through the data one such country at a time and picking observations for that country and its corresponding income group code, processing those observations, and then moving on to the next country to be worked on.

Code:

by countrycode (incomegrpcode), sort: assert incomegrpcode[1] == incomegrpcode[_N] encode incomegrpcode, gen(ig_code) local countries_to_work_on COL // ADD THE OTHER COUNTRIES OF INTEREST HERE foreach c of local countries_to_work_on { summ ig_code if countrycode == `"`c'"', meanonly local target_code `r(mean)' local target_code: label (ig_code) `target_code' preserve keep if inlist(countrycode, `"`c'"', `"`target_code'"') // PLACE CODE TO PROCESS OBSERVATIONS FOR THIS COUNTRY CODE AND ITS // CORRESPONDING INCOMEGRPCODE restore }

I cannot, however, help you with the selection of "LCN" because there is nothing in the data set that says that COL belongs to the group of LCN countries, so there is no way for Stata to figure that out. Stata does not know that COL stands for Colombia, nor that LCN stands for Latin American and Caribbean. And even if Stata knew that, it does not know that Colombia is a Latin American nation. So, to solve that problem, you will need to augment your data set with a new variable like incomegrpcode that shows which other code (I guess they are all geographic region codes) applies. When you do that, it is not hard to modify the code to accommodate that.

One other thing: this code depends upon, and verifies at the beginning, the assumption that any given country is always associated with the same incomegrpcode value wherever it appears in the data set. This may be a problem if your data extends over a long period of time, because some countries can change income groups. This code will would produce incorrect results in that situation, and so the first command ensures it will halt with an error message if that happens. If your data set allows for the same country to be in different income groups at different times, then post back with a data example that shows that, and I can modify the code to accommodate this.
Comment
Daniel McAdams

Join Date: Nov 2022

Posts: 26
#5

17 Nov 2022, 09:54

Clyde Schechter Thanks for the help. I ran your code but the only change made was the generation of the new variable "ig_code" while the observations for HIC, LMC, and HIC stayed (i.e. UMC was not the only kept for COL). My data does keep the the same incomgrpcode value for any given country at ALL TIMES.

I should note that my intention is to run each country file separately using a code like:

Code:

foreach base in ARG BRA COL *list all other 28 countries { use "${cntrs}/`base'_LAC.dta", clear

and then run a code to keep observations for 1) country at hand, 2)LCN, and 3) relevant incomegroup. But first, I want to identify the countrycode that matches the country's incomegrpcode. I tried to combine your code but my syntax was understandably wrong. I don't think I should be listing all the country names of interest under the local "countries_to_work_on, right?

Code:

foreach base in ARG BRA COL{ use "${cntrs}/`base'_LAC.dta", clear by countrycode (incomegrpcode), sort: assert incomegrpcode[1] == incomegrpcode[_N] encode incomegrpcode, gen(ig_code) local countries_to_work_on COL BRA ARG foreach c of local countries_to_work_on { summ ig_code if countrycode == `"`c' " ', meanonly local target_code `r(mean)' local target_code: label (ig_code) `target_code' preserve keep if inlist(countrycode, `"`c'"', `"`target_code'"') // PLACE CODE TO PROCESS OBSERVATIONS FOR THIS COUNTRY CODE AND ITS // CORRESPONDING INCOMEGRPCODE replace incomegrpcode = "." if countrycode != "`c' " restore } save "${cntrs}/`base'_LAC.dta", replace }

Re:

I cannot, however, help you with the selection of "LCN" because there is nothing in the data set that says that COL belongs to the group of LCN countries, so there is no way for Stata to figure that out. Stata does not know that COL stands for Colombia, nor that LCN stands for Latin American and Caribbean. And even if Stata knew that, it does not know that Colombia is a Latin American nation. So, to solve that problem, you will need to augment your data set with a new variable like incomegrpcode that shows which other code (I guess they are all geographic region codes) applies. When you do that, it is not hard to modify the code to accommodate that.

All countries have LCN because they are all in the La. America and Caribbean region. This is why I was hoping to simply drop the other observations for other income groups like HIC, LMC, LIC, and only keep 1) country at hand, 2)LCN, and 3) relevant incomegroup. Is this possible to run for all files? Hope I am not aiming for the impossible.

Last edited by Daniel McAdams; 17 Nov 2022, 09:56.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#6

17 Nov 2022, 11:02

I did not understand from #1 that each country's data was in a separate file. I had imagined a file that contained all of the countries and you wanted to serially process each country within that one file. The code was written for that specific purpose. The reason you did not see the changes made is because of the -preserve- and -restore- commands. They resulted in the original data being brought back into memory at the end of each iteration of the loop, so the work you did in the loop was lost. I had assumed that the "CODE TO PROCESS OBSERVATIONS FOR THIS COUNTRY" would include some commands that would save the country-specific results somewhere. But lacking that, the changes made got lost.

Knowing that you are using a separate file for each country, the code looks slightly different:

Code:

foreach base in ARG BRA COL{ use "${cntrs}/`base'_LAC.dta", clear by countrycode (incomegrpcode), sort: assert incomegrpcode[1] == incomegrpcode[_N] encode incomegrpcode, gen(ig_code) summ ig_code if countrycode == `"`base'" ', meanonly local target_code `r(mean)' local target_code: label (ig_code) `target_code' keep if inlist(countrycode, `"`base'"', `"`target_code'"', `"LCN"') // PLACE CODE TO PROCESS OBSERVATIONS FOR THIS COUNTRY CODE AND ITS // CORRESPONDING INCOMEGRPCODE replace incomegrpcode = "." if countrycode != "`base'" save "${cntrs}/`base'_LAC.dta", replace }

The key differences are that -preserve- and -restore- are gone, and the -foreach c of local countries_to_work_on is gone, because you are already inside a -foreach base in ARG BRA COL- loop that handles that.

Last edited by Clyde Schechter; 17 Nov 2022, 11:06.
Comment

Daniel McAdams

Join Date: Nov 2022
Posts: 26

17 Nov 2022, 12:35

You are a real hero, Clyde Schechter! Thanks so much. Just noting that there was small mistake in the code--it was a matter of deleting ' ' in -summ ig_code if countrycode == `"`base'" ', meanonly-

Here is the code that worked for future reference

Code:

foreach base in ARG BRA COL {
    use "${cntrs}/`base'_LAC.dta", clear
    by countrycode (incomegrpcode), sort: assert incomegrpcode[1] == incomegrpcode[_N]
    encode incomegrpcode, gen(ig_code)
    summ ig_code if countrycode == "`base'", meanonly
    local target_code `r(mean)'
    local target_code: label (ig_code) `target_code'
    keep if inlist(countrycode, `"`base'"', `"`target_code'"', `"LCN"')
    //  PLACE CODE TO PROCESS OBSERVATIONS FOR THIS COUNTRY CODE AND ITS
    //  CORRESPONDING INCOMEGRPCODE
    replace incomegrpcode = "." if countrycode != "`base'"
    save "${cntrs}/`base'_LAC.dta", replace
    }

Last edited by Daniel McAdams; 17 Nov 2022, 12:37.

Comment

Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#8

14 Sep 2023, 03:39

Dear Clyde Schechter
I want to keep selected group of countries (87 ) for my sample . The dataset contains almost 180 countries. The sample country codes are strings. I am writing the following command and want to include all the 87 country code in inlist. Therefore, i write

Code:

keep if inlist(iso3, "AGO", "ARG", "BDI", "BEN", "BFA", "BGD", "BGR", "BOL", "BRA", "BWA", "CAF", "CHL", "CHN", "CMR", "COL", "CRI", "CZE")

But this throws an error expression is too long.
Is there any alternative way ?
Thanks and regards
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35433
#9

14 Sep 2023, 04:11

https://www.stata.com/support/faqs/d...s-for-subsets/ is an answer to #8.
Comment
Ridwan Sheikh

Join Date: Apr 2021

Posts: 167
#10

14 Sep 2023, 10:43

Thanks Nick Cox will have a look at it
Comment

Announcement

How to keep selected observations using a criterion linked to other variables' values?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment