Returning the name of a variable that takes on the maximum value within a set of other variables?

Greg Boudreaux

Join Date: Dec 2019

Posts: 24
#1

Returning the name of a variable that takes on the maximum value within a set of other variables?

08 Mar 2020, 12:39

Hey folks. I have a dataset with respondents reporting how many visits they took to certain recreation sites in the past year, and I have created a variable which records the maximum number of visits the respondents report across all sites.Here is an example dataset showing the structure:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(Site1 Site2 Site3 MaxVisits) 3 5 1 5 6 1 0 6 12 6 11 12 3 0 0 3 end

I want to generate a new variable which takes on the value of the most visited site, like this :

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(Site1 Site2 Site3 MaxVisits) str5 MostVisitedSite 3 5 1 5 "Site2" 6 1 0 6 "Site1" 12 6 11 12 "Site1" 3 0 0 3 "Site1" end

Any idea how to go about this? Also is there any way to deal with the possibility here that more than one site may be the most visited (like randomly returning one of the tied sites for the value of MostVisitedSite)? Thank you so much in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

08 Mar 2020, 13:31

Code:

gen long obs_no = _n rename Site* visitsSite* reshape long visits, i(obs_no) j(site) string by obs_no (visits), sort: gen most_visited_site = site[_N] reshape wide

Note: Just as this problem was facilitated by reshaping the data to long layout, most thing in Stata are most easily done with long data. So consider omitting the final -reshape wide- command and leaving your data in wide layout. It will probably make everything you do from this point on easier.

By the way, this code does break ties randomly.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

08 Mar 2020, 13:49

If there's some compelling reason to remain in the wide format, which is unlikely, as Clyde points out, you can do this:

Code:

set seed 7554
// Make a list of all the variables tied for MaxVisits -- a standard kind of trick
gen str MostVisited = ""
foreach v of varlist Site1 Site2 Site3 {
   quiet replace MostVisited = MostVisited + " " + "`v'" ///
      if (`v' == MaxVisits)
}
// Randomly pick one out of the list.
gen byte pickone = runiformint(1, wordcount(MostVisited))
replace MostVisited = word(MostVisited, pickone)

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35698

09 Mar 2020, 06:45

In #3 I wouldn't overwrite multiple names with just one. Someone downstream (examiner, reviewer) might insist on knowing the number of ties, that you ignore ties, or whatever. So the helpful code of Mike Lacy I would tweak to

Code:

 set seed 7554

// Make a list of all the variables tied for MaxVisits -- a standard kind of trick
gen str MostVisited = ""
quietly foreach v of varlist Site1 Site2 Site3 {    
    replace MostVisited = MostVisited + " " + "`v'" ///       if (`v' == MaxVisits)
}  

// Randomly pick one out of the list.
gen byte pickone = word(MostVisited, runiformint(1, wordcount(MostVisited))

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

09 Mar 2020, 09:45

As it happens, this problem is the focus of a column to appear in the Stata Journal 20(2), due out in about 3-4 months time. Here's a version of that, long-winded as it may seem.

A previous column (Cox 2010) gave a review of methods for working
rowwise in Stata. Here "rows" mean observations in a dataset and
the concern is calculations in each observation with a bundle of
variables. For example, a row mean variable can be generated as the
mean of some numeric variables in each observation.

This column is an update. It is briefly flagged that official Stata now
has -rowmedian()- and -rowpctile()- functions for -egen-.
The main focus is on returning which variable or variables are equal to
the maximum or minimum in a row. The twist that requires care is that
two or more variables can tie for minimum or maximum. That may entail a
decision on what is to be recorded, such as all of them, or just the
first or last occurrence of an extreme.

Introduction

The main data structure of Stata is asymmetric: rows of a dataset define
observations and columns define variables. The great majority of Stata
commands, starting with workhorses like -summarize-, -regress-,
and -graph-, are designed to work with that structure. Such a
structure is not eccentric; it has long been regarded as standard or at
least default across many statistical programs. Despite that there has
recently been much fuss in some quarters about the benefits of this
model, as if they were novel. References are suppressed on this point
to protect the innocent and the guilty alike.

Other way round, in a previous column (Cox 2009) I reviewed a variety of
ideas and methods for working rowwise. Perhaps the most valuable single
suggestion, which is repeated here but not elaborated beyond this
paragraph, is that the impulse to work rowwise may signal that your
dataset layout (format, if you wish) is not ideal for Stata purposes.
Often on Statalist and elsewhere, people have panel or longitudinal data
in a wide layout in which observations define individuals (patients,
firms, countries, sites, or whatever else) and different variables
define various properties at different times. The answer here is,
usually, to -reshape long-, so that times as well as identifiers
define observations.

This column is an update on that previous column. First, it is easy and
congenial to flag that official Stata added -rowmean()- and
-rowpctile()- functions to -egen- in Stata 11 in 2009, just a few
months after the column was published. Section 5 of the column discussed
an -egen- function for row median written by myself. That Section 5
may retain some interest, but in practice users naturally should reach
for the official functions (unless exceptionally they are on Stata 9 or
10).

An assumption here is that you are broadly familiar with loops in Stata.
If not, and in any case, the previous column (Cox 2009) may be worth
reading first, at least briskly.

Which variables hold the extremes in a row?

The remainder of this column focuses on a single class of problems,
studiously avoided earlier. Suppose within an observation variables
-x1-, -x2-, -x3- have values 3.14159, 7, and 42. Which variable
holds the minimum? Clearly -x1-. Which holds the maximum? Clearly
-x3-. The problem is trivial in this and many other examples, and it
is not too difficult to code for such problems.

You can imagine what comes next. What do we do if there are ties? When
filling in surveys, particularly when people are asked to give opinions
or beliefs on integer scales, ties are not just common but highly
likely. It seems that some people even give the same response (say, 3 3
3 ...) on purpose, perhaps out of boredom or as a personal protest of
some kind. Here we bite the bullet, grasp the nettle, and seize the bull
by its horns and show how to program return of variable names even if
ties are present.

As a sandbox dataset we use what appear to be some real data on percent
marks (grades, if you prefer) for a group of students on five
mathematics examinations in an unnamed institution in an unnamed year.
The anonymity here is all to the good. The data are from Mardia, Kent,
and Bibby (1979, pp.3--4) and have been used by many
statistically-minded writers, typically for purposes or with methods
much more subtle than in this column.

In essence, even if we used the functions -max()- or -min()- to
find the maximum or the mininum, we still need to loop over the
variables to find precisely which variable or variables are equal to
either, so we might as well loop in the first place. Similarly, those
familiar with -rowsort- and -rowranks- (Cox 2009) might be
tempted to start there, but the approach here is more direct.

The dataset can be read in from the media for this column.
% NJC note 7 March 2020 you can't do this yet

. use mathsmarks, clear

. describe

Contains data from mathsmarks.dta
obs: 88
vars: 5 27 Jan 2020 18:32
size: 440
--------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
--------------------------------------------------------------------------------------------------------
mechanics byte %10.0g
vectors byte %10.0g
algebra byte %10.0g
analysis byte %10.0g
statistics byte %10.0g
--------------------------------------------------------------------------------------------------------
Sorted by:

Evidently we have splendidly evocative variable names like
-mechanics- and -statistics-, rather than tediously unoriginal names
like -x1- and -x2-.

Figure 1. Distributions of the sandbox dataset. Quantiles, median and
quartile boxes and extra reference lines showing means.

No harm comes from looking at a graph. Figure 1 is a quantile-box plot
(Parzen 1979). Each set of marks is plotted in ascending order (a
quantile plot) with cumulative probability scales implicit on the
horizontal axis. Boxes show median and quartiles, as in a conventional
box plot. Plotting all the data points underlines the key point about
such a box: 50% of the data points for each variable lie within each
box (and so 50% lie outside). Finally, horizontal reference lines show
the mean for each variable. The details of how to get such a plot are
suppressed for a forthcoming column. Anyone intrigued by the result can
download the program and help file using

. ssc install stripplot

Apart from allowing the usual kind of assessment of the distributions
here -- very simply, the variables are well behaved and, as advertised,
percent marks -- the graph serves as propaganda for a point of view.
There is often, especially with small or modest datasets, scope for
showing all the data without making the display too busy or too complex.

Here is a chunk of code confronting the ties issue, which will be
followed by commentary.

. generate which_max = ""
. gen which_min = ""

. generate max = -1
. generate min = 101

. foreach v of var mech-stat {
. replace which_max = "`v'" if `v' > max
. replace which_max = which_max + " `v'" if `v' == max
. replace max = `v' if `v' > max
. replace which_min = "`v'" if `v' < min
. replace which_min = which_min + " `v'" if `v' == min
. replace min = `v' if `v' < min
. }

The general logic is first to initialize variables to hold the minimum
and maximum values and the names of the variables in which they occur.
Taste can supervene here in detail. Some people would use the name and
values of the first variable in the loop, on the grounds that they could
hold the minimum and/or the maximum; and if they do not, we will change
our minds accordingly. Remember that in different observations a
particular variable may hold both row minima and row maxima; and also
that the loop rowwise includes implicitly a loop across observations. I
have no feelings or recommendations against using the first name or the
first value in initialization. Indeed, it is my practice sometimes, as a
matter of caprice.

Caprice or not, a flag should be waved vigorously here. A -float-
storage type, the default, works fine for this example for new
variables. The original data are stored as -byte-, as
-describe- told us. In other problems, be careful to think about whether
a -double- or -long- is needed.

Here the code deliberately uses impossible values to initialise. We know
that the marks range from 0 to 100 in principle and a quick plot, such
as Figure 1, or a -summarize- confirms that no marks occur outside
that range. So, -1 as initial value for maximum is safe as we expect
to find at least one higher value. Similarly 101 is safe for minimum as
we expect to find at least one lower value. But why, you may ask, not
use say 0 and 100 as initial values? There is not much in it, but I have
sometimes found that initializing with impossible values makes logic
bugs a little easier to catch. If the initial values are unchanged by a
look at the data, either the data are not as expected or the code is.

The overarching logic inside the loop has one theme. If the value we are
looking at now is greater than the maximum so far, it is the new
maximum. Symmetrically, if the value we are looking at now is less than
the minimum so far, it is the new minimum.

The tricky detail inside the loop concerns the handling of ties. If the
value we are looking at now is as extreme as any seen yet, so equal to
the maximum or minimum so far, then we add its name to the name(s) we
have stored far. Again, this addition does or does not occur within
individual observations, depending on the values of the variables.

Note a crucial tiny point: Two or more variable names should be
separated by spaces. Other punctuation such as commas is not of order,
but may not be so convenient. We need not worry about being forced into
a dead end with no retreat, as a fresh minimum or maximum will lead to
overwriting the names tentatively stored.

I would slap a -quietly- on the entire loop once I am happy that it
is working.

Let's look at the results.

. tab1 which_max which_min

-> tabulation of which_max

which_max | Freq. Percent Cum.
--------------------+-----------------------------------
algebra | 19 21.59 21.59
analysis | 17 19.32 40.91
analysis statistics | 1 1.14 42.05
mechanics | 7 7.95 50.00
mechanics algebra | 1 1.14 51.14
mechanics vectors | 1 1.14 52.27
statistics | 13 14.77 67.05
vectors | 28 31.82 98.86
vectors algebra | 1 1.14 100.00
--------------------+-----------------------------------
Total | 88 100.00

-> tabulation of which_min

which_min | Freq. Percent Cum.
-----------------------------+-----------------------------------
algebra | 1 1.14 1.14
algebra analysis | 1 1.14 2.27
analysis | 12 13.64 15.91
mechanics | 34 38.64 54.55
mechanics vectors | 2 2.27 56.82
mechanics vectors statistics | 1 1.14 57.95
statistics | 33 37.50 95.45
vectors | 4 4.55 100.00
-----------------------------+-----------------------------------
Total | 88 100.00

The results are not too complicated. 4 students had 2 subjects at which
they were equally good and 4 had 2 or 3 subjects at which they were
equally bad. Knowing that ties are uncommon, we might be confident at
setting aside the ties and focusing on rank order. As the -which*-
variables are string, and spaces were used for separation, a convenient
function for finding single subjects is -wordcount()-. Stata's rule
that variable names must be single words frustrates some goals, but
provides the needed simplicity for this purpose. -wordcount()- is in
my personal list of functions experienced users should know (Cox 2011).
As alluded earlier, commas or other punctuation apart from spaces do not
allow this simple device.

. tab which_max if wordcount(which_max) == 1, sort

which_max | Freq. Percent Cum.
--------------------+-----------------------------------
vectors | 28 33.33 33.33
algebra | 19 22.62 55.95
analysis | 17 20.24 76.19
statistics | 13 15.48 91.67
mechanics | 7 8.33 100.00
--------------------+-----------------------------------
Total | 84 100.00

. tab which_min if wordcount(which_min) == 1, sort

which_min | Freq. Percent Cum.
-----------------------------+-----------------------------------
mechanics | 34 40.48 40.48
statistics | 33 39.29 79.76
analysis | 12 14.29 94.05
vectors | 4 4.76 98.81
algebra | 1 1.19 100.00
-----------------------------+-----------------------------------
Total | 84 100.00

It looks bad for mechanics and statistics. Teachers (some of us) and
past or present students (all of us) know the Law of Interpreting Exam
Results: It's always the teachers' fault. Good marks? The teaching was
too elementary, the exam was too easy, or you were lucky that so many
smart students chose your class. Bad marks? The teaching was poor or the
exam was too hard. Naturally, there are no students who are not smart.

Back to the problem: Let's imagine a variant in which we are happy to
record the maximum or maximum the first time it occurs. There might be
some rationale for this, say that variables have some time order and
`first occurrence' is intrinsically interesting. Or, it's just a variant
on the problem and we are thinking through the logic of the code.

The code is simpler and we won't show results here, partly to save space
and partly because there is no information that the variables in the
example dataset have any kind of natural order. The initializations are
the same; it is just the loop that differs.

. foreach v of var mech-stat {
. replace which_max = "`v'" if `v' > max
. replace max = `v' if `v' > max
. replace which_min = "`v'" if `v' < min
. replace min = `v' if `v' < min
. }

In words, the idea that the first variable met is the only variable that
counts means that we ignore ties, ties being with a maximum or minimum
previously observed.

As a final flourish, what about the opposite problem in which the last
occurrence only is remembered?

. foreach v of var mech-stat {
. replace which_max = "`v'" if `v' >= max
. replace max = `v' if `v' > max
. replace which_min = "`v'" if `v' <= min
. replace min = `v' if `v' < min
. }

Here we overwrite names if we see a new instance of the maximum or
minimum so far. Hence we exit the loop with the last occurrence of a
maximum or minimum recorded as the only occurrence remembered with a
name.

Conclusion

Apart from the belated update flagging the introduction of -egen}
functions -rowmedian()- and -rowpctile()- in Stata 11, this
column has focused on one further problem. We wish to record not just
the row maximum and row minimum across a bunch of numeric variables, but
also in which variables they occur. The aim calls for a loop across
variables; use of string variables to store the results; and careful
attention to the possibility of tied values.

References

Cox, N. J. 2009. Speaking Stata: Rowwise.
Stata Journal 9: 137--157.

Cox, N. J. 2011.
Speaking Stata: Fun and fluency with functions.
Stata Journal 11: 460--471.

Mardia, K. V., J. T. Kent and J. M. Bibby. 1979.
Multivariate Analysis.
London: Academic Press.

Parzen, E. 1979.
Nonparametric statistical data modeling.
Journal, American Statistical Association 74: 105--121.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#6

09 Mar 2020, 11:25

#5 starts with a reference Cox 2010 and nearly ends with a reference to Cox 2009, which is that intended. Sorry about that.
Comment
Greg Boudreaux

Join Date: Dec 2019

Posts: 24
#7

11 Mar 2020, 22:43

Thanks so much everyone!
Comment

Announcement

Returning the name of a variable that takes on the maximum value within a set of other variables?

Comment

Comment

Comment

Comment

Comment

Comment