How to create a variable of other variables' names repeated for the number of their categories

Laura Leone

Join Date: Feb 2022

Posts: 32
#1

How to create a variable of other variables' names repeated for the number of their categories

04 Feb 2022, 12:37

Dear Stata experts,

I would need to generate a variable whose observations are other variables' names repeated for the number of categories they have. Let's say, the variable "age" has two categories (e.g. "0-18" "19-25") and the variable "gender" has two categories (e.g. "female" "male"). I would like to generate a new variable "var" which looks like this:

var

age
age
gender
gender

Is there a way to do it? Thank you in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

04 Feb 2022, 13:03

Here's one way, illustrated in the auto.dta

Code:

sysuse auto, clear local newvar_values local list_of_vbles_to_process foreign rep78 make foreach l of local list_of_vbles_to_process { distinct `l' local lcount = r(ndistinct) forvalues i = 1/`lcount' { local newvar_values `newvar_values' `l' } } local i = 1 gen newvar = "" set obs `=max(_N, `:word count `newvar_values'')' foreach nvv of local newvar_values { replace newvar = `"`nvv'"' in `i' local ++i }

Now, if the total number of different values of all the variables you process exceeds the number of observations in the data set, then the program will extend the data set to accommodate all of them.

By the way, his strikes me as an odd thing to do, and I'm curious what the purpose is.

Added: -distinct- is available from SSC. It is written by Gary Longton and Nick Cox.

Last edited by Clyde Schechter; 04 Feb 2022, 13:05.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#3

04 Feb 2022, 13:42

Thanks for the mention. distinct is to be considered as coming from the Stata Journal, so

Code:

search distinct, sj

to see a link to the 2008 write-up and later updates.
Comment
Laura Leone

Join Date: Feb 2022

Posts: 32
#4

04 Feb 2022, 14:01

Thank you very much for the answer, very helpful!

Basically I am running a number of univariate regression for the subpopulations in my sample identified by the categories of some factor variables, like gender, age, place of residence, etc. The goal is to identify the effect of a treatment (which has two versions) on my outcome variable for each subpopulation. I would like then to create two forest plots, one for each treatment, representing the coefficient for each subpopulation together. I thought about creating a table with variable name, variable' categories names and coefficients and confidence intervals for each, and then use the metan command to create the plot from it.

I was able to create a table with coefficients, C.I. and categories' numeric codes, but I am not able to add categories labels and variable names. Thus, I thought about creating a dataset of just variables names and categories by using label save and then merging the table I created by generating the variable I asked you about.

I guess it's not the best/most efficient way to create my forest plot, but I am having some problems in adding the variables and categories labels directly in the table and other commands like coefplot does not seem to work for my case. Below you can see the code I am using:

postutil clear
postfile table category or lb ub using results, replace
foreach var in Age_enc Gender_enc Region_enc Placeresidence_enc {
levelsof(`var'), local(levels`var')
foreach l of local levels`var' {
quietly logit Vax_kid_01 i.Treatchild_enc if children_018==1 & `var'==`l', or vce(cluster ResponseId)
matrix table=r(table)
post table (`l') (table[rownumb(table,"b"),colnumb(table,"Vax_kid_01:1.Tre atchild_enc")]) (table[rownumb(table,"ll"),colnumb(table,"Vax_kid_01:1.Tr eatchild_enc")]) (table[rownumb(table,"ul"),colnumb(table,"Vax_kid_01:1.Tr eatchild_enc")])
}
}
postclose table
use results, clear
list, clean noobs

metan or ul ll, nooverall forestplot(xlab(-4 1 5)) null (1) label(namevar=category)

I would really appreciate if you had any suggestion about this, and I hope I explained myself clearly! Thank you again
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

04 Feb 2022, 20:41

Try this:

Code:

postutil clear
postfile table str32 var float (category or lb ub) using results, replace
tempfile labeler
local labels_to_save
local variables_of_interest Age_enc Gender_enc Region_enc Placeresidence_enc
foreach v of local variables_of_interest {
    local labels_to_save `labels_to_save' `:value label `v''
}
label save `labels_to_save' using `labeler'
foreach var of local variables_of_interest {
    levelsof(`var'), local(levels`var')
    foreach l of local levels`var' {
        quietly logit Vax_kid_01 i.Treatchild_enc if children_018==1 & `var'==`l', ///
            or vce(cluster ResponseId)
        matrix table=r(table)
        post table ("`var'") (`l') (table[rownumb(table,"b"),colnumb(table,"Vax_kid_01:1.Tre atchild_enc")]) ///
            (table[rownumb(table,"ll"),colnumb(table,"Vax_kid_01:1.Tr eatchild_enc")]) ///
            (table[rownumb(table,"ul"),colnumb(table,"Vax_kid_01:1.Tr eatchild_enc")])
    }
}
postclose table
use results, clear
run `labeler'
local n_labels: word count `labels_to_save'
forvalues i = 1/`n_labels' {
    label values `:word `i' of `variables_of_interest'' `:word `i' of `labels_to_save''
}
list, clean noobs

I'm not sure I fully understand what you need, but I think the above is it, or close. (I don't use -coefplot- or -metan-, so my understanding of the context is limited.) The above is, of course, untested, and may contain typos or other errors, but hopefully you can figure out any fixes it needs. Or post back.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

05 Feb 2022, 04:03

A rather different approach is to copy and paste the output of distinct and pull out what you want as new variables. Then apply expand.

. sysuse auto, clear
(1978 automobile data)

. distinct

-------------------------------------
| total distinct
--------------+----------------------
make | 74 74
price | 74 74
mpg | 74 21
rep78 | 69 5
headroom | 74 8
trunk | 74 18
weight | 74 64
length | 74 47
turn | 74 18
displacement | 74 31
gear_ratio | 74 36
foreign | 74 2
-------------------------------------

. edit

. *(1 variable, 12 observations pasted into data editor)

.. l var13 if !missing(var13)

+--------------------------------------+
| var13 |
|--------------------------------------|
1. | make | 74 74 |
2. | price | 74 74 |
3. | mpg | 74 21 |
4. | rep78 | 69 5 |
5. | headroom | 74 8 |
|--------------------------------------|
6. | trunk | 74 18 |
7. | weight | 74 64 |
8. | length | 74 47 |
9. | turn | 74 18 |
10. | displacement | 74 31 |
|--------------------------------------|
11. | gear_ratio | 74 36 |
12. | foreign | 74 2 |
+--------------------------------------+

. gen varname = word(var13, 1)
(62 missing values generated)

. gen ndistinct = real(word(var13, -1))
(62 missing values generated)

. keep varname ndistinct

. keep in 1/12
(62 observations deleted)

. l

+-------------------------+
| varname ndisti~t |
|-------------------------|
1. | make 74 |
2. | price 74 |
3. | mpg 21 |
4. | rep78 5 |
5. | headroom 8 |
|-------------------------|
6. | trunk 18 |
7. | weight 64 |
8. | length 47 |
9. | turn 18 |
10. | displacement 31 |
|-------------------------|
11. | gear_ratio 36 |
12. | foreign 2 |
+-------------------------+

[/CODE]

Now apply expand.

I agree with whatever anyone says about this being terrible coding style, but it works.

I imagine that in the real problem the need is only for a subset of variables. That doesn't underlmine this approach, as you just need to be more selective with what you feed to distinct.

(In principle, you could clone distinct and get it to save results optionally as a new dataset. I am not interesting in doing that myself (yet) without more evidence that this is often needed.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

05 Feb 2022, 05:07

My apologies. I was distracted and did not edit my previous post carefully enough,

A rather different approach is to copy and paste the output of distinct and pull out what you want as new variables. Then apply expand.

Code:

 
. sysuse auto, clear
(1978 automobile data)

. distinct

-------------------------------------
| total distinct
--------------+----------------------
make | 74 74
price | 74 74
mpg | 74 21
rep78 | 69 5
headroom | 74 8
trunk | 74 18
weight | 74 64
length | 74 47
turn | 74 18
displacement | 74 31
gear_ratio | 74 36
foreign | 74 2
-------------------------------------

. edit

. *(1 variable, 12 observations pasted into data editor)

.. l var13 if !missing(var13)

+--------------------------------------+
| var13 |
|--------------------------------------|
1. | make | 74 74 |
2. | price | 74 74 |
3. | mpg | 74 21 |
4. | rep78 | 69 5 |
5. | headroom | 74 8 |
|--------------------------------------|
6. | trunk | 74 18 |
7. | weight | 74 64 |
8. | length | 74 47 |
9. | turn | 74 18 |
10. | displacement | 74 31 |
|--------------------------------------|
11. | gear_ratio | 74 36 |
12. | foreign | 74 2 |
+--------------------------------------+

. gen varname = word(var13, 1)
(62 missing values generated)

. gen ndistinct = real(word(var13, -1))
(62 missing values generated)

. keep varname ndistinct

. keep in 1/12
(62 observations deleted)

. l

+-------------------------+
| varname ndisti~t |
|-------------------------|
1. | make 74 |
2. | price 74 |
3. | mpg 21 |
4. | rep78 5 |
5. | headroom 8 |
|-------------------------|
6. | trunk 18 |
7. | weight 64 |
8. | length 47 |
9. | turn 18 |
10. | displacement 31 |
|-------------------------|
11. | gear_ratio 36 |
12. | foreign 2 |
+-------------------------+

Now apply expand.

I agree with whatever anyone says about this being terrible coding style, but it works.

I imagine that in the real problem the need is only for a subset of variables. That doesn't undermine this approach, as you just need to be more selective with what you feed to distinct.

(In principle, you could clone distinct and get it to save results optionally as a new dataset. I am not interested in doing that myself (yet) without more evidence that this is often needed.)

Comment

Laura Leone

Join Date: Feb 2022

Posts: 32
#8

06 Feb 2022, 05:27

Thank you very much to both of you! The code to create the new variable with other variables' names works perfectly!

About Clyde's advice on how to create the table with the variables' name already in it, I don't understand what is not working in the code, but the or lb and ub values in the table appear as missing for some reason. However the code potentially does exactly what I need, I am going to look closer into it, thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

06 Feb 2022, 10:46

I'm not sure why you're getting a problem with the lower and upper bounds. I didn't alter that part of the code, and if it was working before, it should still be, as far as I can tell.

One thing I see that looks problematic is -colnumb(table,"Vax_kid_01:1.Tr eatchild_enc")-. That blank between Tr and eatchild has to be wrong, no? But you also have it in the expression for capturing the odds ratio, so I would expect this problem to affect that as well.

Another thing I would suggest, if you are using the current version of Stata (17), or version 16, is that you do not need to use -rownumb()- and -colnumb()-. You can just put:

Code:

(table["b","Vax_kid_01:1.Treatchild_enc"])

in the -post- command's place for the odds ratio, and similarly for the ll and ul of the confidence interval. (This is a relatively recent change in Stata. I no longer have versions 15 or earlier, and I'm not sure when this was introduced. If you are running version 15 or earlier, you might try testing it to see if this works in your version as well.)

This simplification won't fix any existing errors, but it makes the code easier to read and modify.

If reomving the blank space between Tr and eatchild_enc doesn't fix the problem, if you post back with example data, using the -dataex- command, I will try to troubleshoot the code.
Comment
Laura Leone

Join Date: Feb 2022

Posts: 32
#10

06 Feb 2022, 17:57

Thank you very much Clyde for the advice, the space was the issue, I didn't notice it. Now it works! And thank you also for the advice about colnumb, the code is more readable now.
The only issue now remains the variables' labels, because when running the last portion of the code, namely:

run `labeler'
local n_labels: word count `labels_to_save'
forvalues i = 1/`n_labels' {
label values `:word `i' of `variables_of_interest'' `:word `i' of `labels_to_save''
}
list, clean noobs

An error message appears: "variable Age_enc not found r(111)". I guess it is because "label values" should normally refer to variables, while in my dataset they are observations. Something is probably not working with `:word `i' of `variables_of_interest'. I'll try to understand what it might be, maybe you already have some suggestion?
Thank you again, you are being very helpful!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#11

06 Feb 2022, 18:18

No, this part of the code is just wrong. Sorry--I wasn't thinking clearly when I wrote it. The problem is just what you note: labels apply to variables, whereas here we are dealing with a single variable that has values of multiple variables. So that needs a complete rewrite and a complete re-conceptualization. It will not be possible to label those values in the new data set, because the same numeric value will mean one thing if it refers to an age group and another thing if it refers to gender or region. Instead a new variable containing those labels has to be created. The end result, actually is simpler and shorter, because it can be done back in the creation of the postfile and does not require extra code at the end.

Code:

postutil clear postfile table str32 (var cat_label) float (category or lb ub) using results, replace local variables_of_interest Age_enc Gender_enc Region_enc Placeresidence_enc foreach var of local variables_of_interest { levelsof(`var'), local(levels`var') foreach l of local levels`var' { quietly logit Vax_kid_01 i.Treatchild_enc if children_018==1 & `var'==`l', /// or vce(cluster ResponseId) matrix table=r(table) post table ("`var'") (`"`:label (`var') `l''"') (`l') /// (table["b", "Vax_kid_01:1.Treatchild_enc"]) /// (table["ll", "Vax_kid_01:1.Treatchild_enc"]) /// (table["ul", "Vax_kid_01:1.Treatchild_enc"]) } } postclose table use results, clear list, noobs clean

Code added in this post is shown in bold face. But most of the changes from before are deletions. As you can see the code is substantially shorter.

The data set results.dta will now contain, in addition to what was there before, a variable, cat_label, that contains the labels that were associated with the values of category for that var in the original data set.

Sorry for the false start yesterday. I'm really not sure what I was thinking.
Comment
Laura Leone

Join Date: Feb 2022

Posts: 32
#12

07 Feb 2022, 01:17

Thank you again Clyde, it works like this. And no worries for yesterday! Also, if I wanted to add also variables' labels additionally to variables' names and variables' categories labels to the table, I guess the code would look something like this (changes in bold):

postutil clear
postfile table str32 (var var_label cat_label) float (category or lb ub) using results, replace
local variables_of_interest Age_enc Gender_enc Region_enc Placeresidence_enc
foreach var of local variables_of_interest {
levelsof(`var'), local(levels`var')
foreach l of local levels`var' {
quietly logit Vax_kid_01 i.Treatchild_enc if children_018==1 & `var'==`l', or vce(cluster ResponseId)
matrix table=r(table)
post table ("`var'") (`"`:label `var' "') (`"`:label (`var') `l''"') (`l') (table["b", "Vax_kid_01:1.Treatchild_enc"]) (table["ll", "Vax_kid_01:1.Treatchild_enc"]) (table["ul", "Vax_kid_01:1.Treatchild_enc"])
}
}
postclose table
use results, clear
list, noobs clean

But it does not work, and it gives me the following error messages:

invalid syntax
too few quotes
post: above message corresponds to expression 2, variable var_label
r(132);

I guess it is a matter of quotes put in the wrong place, what do you think?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#13

07 Feb 2022, 11:15

=-"`:label `var' "'- should be -`"`:var label `v''"'-.

Also, in the -postfile- declaration, you assigned str32 storage type to this. If your variable labels are short enough, that will be fine. But variable labels (as opposed to value labels and variable names) are often rather lengthy. The postfile will truncate all variable labels you send it to their first 32 characters if they are longer than that. So you might want to give var_label a larger storage type. If you allocate a size that is too large, thereby making -browse- or -list- displays too spreaad out and having too much whitespace, you can always correct that at the end of the code by using -compress-.

Note also that I am using compound double-quotes (`"..."') around this one. Again, it is not uncommon for variable labels, unlike variable names (which can't contain them) and value labels (which could contain them but seldom do) to contain quotes themselves, so the use of single quotes in the -post- command would result in errors. If you are certain that your variable labels don't contain any quotes, then this doesn't matter. But it never hurts to use compound double quotes, even when they are not strictly needed.
Comment
Laura Leone

Join Date: Feb 2022

Posts: 32
#14

08 Feb 2022, 17:47

Thank you very much, now it works! And thank you again both for the suggestions.
Comment

Announcement