In STATaA, how can I run margins with the interaction between a categorical and a z transferred (standardized) indepent variables

Iqbal Chowdhury

Join Date: May 2023

Posts: 33
#1

In STATaA, how can I run margins with the interaction between a categorical and a z transferred (standardized) indepent variables

24 Jun 2024, 21:10

Hello everyone,

I have run an OLS regression with an interaction between z transferred (standardized) and a categorical independent variable. I am planning to run margins for this interaction term. I used the following comman:
margins CAT-var, at (z-var=(.(.).)).

However, I am not sure which range of the z transferred variable I should choose.

Can any of you please suggest me?

Thank you in advance
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#2

24 Jun 2024, 21:33

Well, the usual approach is to pick "interesting" values of the continuous variable. Interesting values are typically a range of values that span and represent the bulk of the distribution of values. Far tail values are usually not selected, unless there is something particular about that variable that makes them of interest. Since your continuous variable is standardized, one might typically chose -1, 0, and 1. (With a non-standardized variable, mean - 1 SD, mean, and mean + 1 SD is, similarly, a common choice.) But if there are other values of z that have an inherent interest choosing those would make sense. If, for example, your variable were a conventional scale score used to screen for depression, choosing the cutoff value that is generally recognized as the cutoffs for possible mild and severe depression would make more sense than following some arbitrary statistical recipe.

The above represents what is typically done on a more or less knee-jerk reflex basis. But you are free to pick whatever values you think are most relevant to the actual meaning of the variables in question and to your research goals, values which, knowing their associated margins, gives the most insight into your substantive research questions.

Last edited by Clyde Schechter; 24 Jun 2024, 21:36.
Comment
Iqbal Chowdhury

Join Date: May 2023

Posts: 33
#3

25 Jun 2024, 01:05

Originally posted by Clyde Schechter View Post

Well, the usual approach is to pick "interesting" values of the continuous variable. Interesting values are typically a range of values that span and represent the bulk of the distribution of values. Far tail values are usually not selected, unless there is something particular about that variable that makes them of interest. Since your continuous variable is standardized, one might typically chose -1, 0, and 1. (With a non-standardized variable, mean - 1 SD, mean, and mean + 1 SD is, similarly, a common choice.) But if there are other values of z that have an inherent interest choosing those would make sense. If, for example, your variable were a conventional scale score used to screen for depression, choosing the cutoff value that is generally recognized as the cutoffs for possible mild and severe depression would make more sense than following some arbitrary statistical recipe.

The above represents what is typically done on a more or less knee-jerk reflex basis. But you are free to pick whatever values you think are most relevant to the actual meaning of the variables in question and to your research goals, values which, knowing their associated margins, gives the most insight into your substantive research questions.

Thank you so much Clyde Schechter,
My standardized variables are regional GDP percapita and regional proportion (%) of minority people. So, if I use -1 0 and +1. I think it will help me show low average and high GDP or minority people. What do you think?

Thank you once again.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3404
#4

25 Jun 2024, 01:56

Why did you standardize those variables? You had variables with perfectly interpretable units, and now you threw that away. The loss of unit is the root cause of you being uncertain what values to choose.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#5

25 Jun 2024, 02:05

I'd add to @Maarten Buis's cmment that whenever I see data on GDP per head, I want to use log scale. Even within a country, I'd expect marked skewness and perhaps outliers. Much depends on how fine the regional breakdown is, but I'd expect that with about 10 regions -- with which very often regions including major cities have much GDP per head than other more rural regions -- and even more with about 100 regions.

The larger point is that you are asking for advice about data analysis without really telling us much about your data.

transferred is I think a misunderstanding of or typo for transformed, even though you use the word consistently.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#6

25 Jun 2024, 10:08

Just want to add my note agreement with #4. (This doesn't mean that I disagree with #5, but it's a different issue.) I was tempted, when I wrote #2, to inquire about the nature of this standardized variable since, in my experience, most standardization serves only to obfuscate the results. But I decided to let it go, partly, I think, because I was pressed for time, and spare you my usual rant against standardizing variables. So thanks to Maarten for filling in that gap in my response.
Comment
Iqbal Chowdhury

Join Date: May 2023

Posts: 33
#7

25 Jun 2024, 11:08

Thank you so much Maarten Buis , Nick Cox and Clyde Schechter for your valuable insights,

please allow me to peovide you with the back ground of the project. I am going to explain whether the relationship between immigration status (Canadian-born, recent immigrant and long residing immigrant) based on the regional GDP per capita. In this context, have constructed a variable based on the GDP per capita of 10 Canadian provinces. In the model, I have first used the log of GDP. When I use interaction term between log-GDP with immigration staus (Canadian-born as control categroy), it looks like the the interaction coefficents seem ok. However, the individual coefficients for immigration status variable seem a bit big like 17 for recent immigrant and 18 for long residing immigrant. That's why I am a tensed and trying to use the standardized value for GDP.

the code I used is svy:regress PMH (which is positive mental health scale) ib1.immigrant Z_GDP controlls c.Z_GDP#ib1.immigrant, absorb(region)

Nick Cox I am sorry, it should be transformed.

Thank you all again for your support.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3404
#8

26 Jun 2024, 02:57

So, you are worried about the size of the main effect after adding an interaction. You need to remember that the main effect of one variable is the effect of that variable when the other variable is 0. In your case: the effect of being a recent immigrant in a province with a log GDP of 0 (or a GDP of 1). I haven't looked it up, but I think it is a pretty safe assumption that no Canadian province has a GDP of 1 Canadian dollar. In other words: your estimates aren't wrong, but they are a meaningless extrapolations. The solution is not to standardize, but to make sure that the value 0 of log GDP represents a reasonable value within the range of the data (we sometimes call that centering a variable, but terminology is hopelessly non-uniform in statistics). So you have GDP, you make a new variable log of GDP, than you choose a nice round number somewhere in the middle of the distribution of GDP (not log GDP) take the logarithm of that value than you make another new variable which is log (GDP) - log (nice value). This is the variable you use in your analysis. Here is an example:

Code:

// open and prepare example data sysuse nlsw88, clear gen byte urban = c_city + smsa label define urban 2 "central city" /// 1 "suburban" /// 0 "rural" label value urban urban label variable urban "urbanicity" // create log wage (analogous to your log GDP) gen lnwage = ln(wage) // look at the distribution of wage sum wage , d // I choose 6 dollars/hour as my "nice value" // (closest round number near the median) gen lnwagec = lnwage - ln(6) // If you remember your rules for logarithms, // you could also create that varialbe in one command: // gen lnwagec = ln(wage/6) // now you estimate your model reg hours c.lnwagec##i.urban

Last edited by Maarten Buis; 26 Jun 2024, 03:00.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#9

26 Jun 2024, 05:22

#7 makes clear that you are looking at GDP per head on a log scale, which really wasn't clear to me previously.

In addition to @Maarten Buis's comments, I would add that working with

log GDP per head for province MINUS log GDP per head for Canada

= log (GDP per head for province / GDP per head for Canada)

would give you a zero that is, if a little arbitrary, at least fairly easy to defend and work with. Using any particular province as comparator might make sense too.
1 like
Comment
Iqbal Chowdhury

Join Date: May 2023

Posts: 33
#10

26 Jun 2024, 11:56

Originally posted by Maarten Buis View Post

So, you are worried about the size of the main effect after adding an interaction. You need to remember that the main effect of one variable is the effect of that variable when the other variable is 0. In your case: the effect of being a recent immigrant in a province with a log GDP of 0 (or a GDP of 1). I haven't looked it up, but I think it is a pretty safe assumption that no Canadian province has a GDP of 1 Canadian dollar. In other words: your estimates aren't wrong, but they are a meaningless extrapolations. The solution is not to standardize, but to make sure that the value 0 of log GDP represents a reasonable value within the range of the data (we sometimes call that centering a variable, but terminology is hopelessly non-uniform in statistics). So you have GDP, you make a new variable log of GDP, than you choose a nice round number somewhere in the middle of the distribution of GDP (not log GDP) take the logarithm of that value than you make another new variable which is log (GDP) - log (nice value). This is the variable you use in your analysis. Here is an example:

Code:

// open and prepare example data sysuse nlsw88, clear gen byte urban = c_city + smsa label define urban 2 "central city" /// 1 "suburban" /// 0 "rural" label value urban urban label variable urban "urbanicity" // create log wage (analogous to your log GDP) gen lnwage = ln(wage) // look at the distribution of wage sum wage , d // I choose 6 dollars/hour as my "nice value" // (closest round number near the median) gen lnwagec = lnwage - ln(6) // If you remember your rules for logarithms, // you could also create that varialbe in one command: // gen lnwagec = ln(wage/6) // now you estimate your model reg hours c.lnwagec##i.urban

Thank you so much Maarten Buis and Nick Cox for your wonderful suggestions.
I have tried according to your suggestions. It looks like the indepent effect of immigration status resuded and looks good. I am thinking to follow your suggestions.

However, I am not sure what this technique is called if I substract a nice value/Canadian GDP/GDP of a particular province? I understand the mean centering, but don't know about substracting a different value.
So, I would appreciate it a lot if you kindly,

1. What this technique is called.
2. If I substract value from a particluar province, say Ontario, which is the province with highest GDP per capita, or the GDP of Canada, how can I interprete the outcomes. I know in case of the log GDP, one % changes in the GDP is associated with a certain unit changes in the Mental health. Or in case of interaction term, one % changes in the GDP is corresponds to a sertain incrase in the mental health outcomes of recent or long residing immigrants compared to Canadian-born. However, a bit confused in the case of this techniques that you have suggested.
3. Which range I should use when calcualting margins for the interaction effect for immigration status and GDP?
4. Can you please provide me with couples of references so that I can defend when my committee members ask my why I have done that?

Thank you so so much for being my saviors
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35208
#11

26 Jun 2024, 12:57

I can't see that you need a name for what I am suggesting. What's the audience or readership you are addressing? If you are addressing lay people, it might be harder work to get everyone thinking on logarithmic scale at all, even though just anybody knows that a price or income change of say 5% means what it does. But you're focusing on your committee. Is the issue selling the idea to them or showing that you understand it?

You are just looking at relative values and using an elementary property of logarithms that log (A / B) = log A - log B. That's an identity, but which way round you want to explain it is up to you.

In detail, I would not use Ontario here. Although the principle is the same, it can help to have reference values somewhere in the middle of your data.

I think the biggest point is as Maarten Buis pointed out that a reference level of 0 = log 1 is in practice absurd, as in effect you realised by yourself.
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#12

26 Jun 2024, 14:41

What's the audience or readership you are addressing?

I think O.P. answered that question in #10.

,,,so that I can defend when my committee membersask my why I have done that? [emphasis added]

My interpretation is that O.P. is writing a doctoral dissertation and is, unsurprisingly, concerned about what his dissertation committee and readers will say.

I'm going to indulge myself with a little rant here.

I have served as a reader or committee member on many dissertations over the years--I haven't kept count. All of them have either been in the fields of public health or health psychology. So perhaps it is different in non-health related fields, but I doubt it. I have become convinced that the doctoral dissertation writing and defense process is a major roadblock to good statistical training of the upcoming generations of researchers. There is, if you will, a template of statistical practices that these committees expect to see in a dissertation, and they enforce it. The problem is that some of the components of that template are practices that are obsolete (e.g. mandatory statistical testing of residuals for normality, even when the sample is clearly large enough to make it unnecessary, or when using robust standard errors deals with the problem), or perhaps were never a good idea (e.g. testing dependent variables for normality--which makes no sense at all and never did, or a required table of Pearson correlations among all variables, with "significance stars"). Moreover, they reinforce the notion that statistical analysis is a rigidly structured set of procedures that must be followed without thought to the nature of the data, the research goals, and which analyses elucidate the findings and which are just obfuscation. The questions raised by O.P. in #10 are a small example of this: he has been taught that every technique must have a name and a citation. Actually just applying simple mathematics that everyone past secondary school should know, and examining the meaning of what was done, will not do.

End of rant.
Comment
Iqbal Chowdhury

Join Date: May 2023

Posts: 33
#13

29 Jun 2024, 06:13

Hello Maarten Buis Nick Cox and Clyde Schechter,
Thank you so much for your suggestions with clear explanations that totally makes sense to me. Following your suggestions I have run the models and the results appeared to me ok. Now, I am abit confused with with the marginsplot.
My research questions is whether the relationship between immigrations status (Canadain-born, recent immigrants and long residing immigrat) and mental health varies in terms of regions GDP per capita. So I have run the model like:
svy: regress MH ib1.immirant(control for Canadian-born group) centered_log_GDP controls(age, sex, marital status etc) c.centered_log_GDP#ib1.immigrant, absorb(regions)

I have run the margins command:
margins immigrant, at (centered_log_GDP=(0(5)50))
marginsplot
the marginsplot shows centered_log_GDP in the x axis and three linse each for one immigrtion group. I am sharing a similar marginsplot below. Can you please suggest me whether this will help me answer the research quetions. If this graph is not supportive to answer my research question and if I need to change the xdimension with immigration category, how can I do that? I am a bit confuce about how to construct a line for GDP based on the immirant groups on the x axis. I will always remeber your help with respect.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29792
#14

29 Jun 2024, 10:42

As I am neither a demographer nor an economist nor a psychologist, I can't advise you with regard to substantive questions. What I can comment on is your choice of the values in the -at()- option from a statistical perspective. You are using a centered log GDP variable. I don't know which value you centered it at, but typically that would be the mean, or median, or something near the middle of the (uncentered) log GDP distribution. So 0 in the centered value corresponds to that near-middle value in the log-GDP distribution. That being the case, it would make more sense to have your graphs and analyses consider values of log GDP that also go both above and below that near-middle value. Consequently, I would use some negative number, not 0, as the lowest value for centered log GDP. Then there is the matter of the range. I think going up to 50 makes no sense. If the log is going up to 50, then that corresponds to values of uncentered and non-log-transformed GDP that are about 10²¹ (if you used natural logarithm for the transformation) or 10⁵⁰ (if you used base 10 logarithm) times the near middle value. Even the smaller 10²¹ factor is much higher than is realistic. The highest national GDP in the world is about 25 trillion dollars, and the lowest is around 10 million. That is a ratio of about 10⁶. So I think your -at()- specification should cover a range something like -at(-3(0.5)3)- or something like that if you used base 10 logarithm in your transform. If you used natural logarithm for the transform, then -at(-7(1)7)- would get it about right. These would range from 1/1000 as large as to 1000 times as large as the middle-range value of GDP.

Your graph clearly shows three separate, non-parallel lines, so there is some interesting interaction going on between immigrant group and GDP effect on mental health.

Last edited by Clyde Schechter; 29 Jun 2024, 10:44.
Comment
Iqbal Chowdhury

Join Date: May 2023

Posts: 33
#15

03 Jul 2024, 23:39

Thank you so much Clyde Schechter,
It is very helpful for my work. I think I can manage the analysis part and defend with a bit more confidence.

I am really grateful to you.

Iqbal Chowdhury
Comment

Announcement

In STATaA, how can I run margins with the interaction between a categorical and a z transferred (standardized) indepent variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment