Multiple values in same cell

Tarjei W. Havneraas

Join Date: Nov 2016

Posts: 136
#1

Multiple values in same cell

16 Sep 2020, 10:20

Hi Statalisters

I am working on a fairly simple question that I have not confronted before and hope that I may get input here.

Basically, I have created a dataset with an ID where some variables contain multiple values in a single cell. My aim is a data set in long format. I wonder what the best approach is. Do I make several rows for each ID?

I have created a data example to illustrate:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float ID str25 Discipline str23 Country 1 `""Economics""' "USA" 2 `""Statistics, Economics""' "Canada" 3 `""Sociology""' "USA, Germany" 4 `""Economics, Epidemiology""' "USA" 5 `""Sociology""' "UK" 6 `""Sociology""' "Norway, Sweden, Denmark" 7 `""Economics""' "UK, USA" 8 `""Sociology""' "Italy" 9 `""Epidemiology, Economics""' "USA" 10 `""Economics""' "USA" end

For analytic purposes, it's important to be able to count the number of e.g. "Economics", "Sociology", etc., as well as number of countries.

Note that there is not perfect overlap between number of values in Discipline and Country. It would be great if someone had input on how this may be solved through code instead of manual work as my real data set is larger.

Best
Tarjei
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 29962

16 Sep 2020, 10:48

Well, for starters, the -split- command (see -help split-) will break up the Discipline and Country variables into however many variable are needed to accommodate each separate response. It isn't clear to me where you go from there: that depends on what you will be doing. -reshape long- may not be appropriate here because I take

Note that there is not perfect overlap between number of values in Discipline and Country.

to mean that the first listed discipline does not necessarily have any connection to the first listed country, etc., and -reshape long- would necessarily associate those by putting them in the same observation. A small hand-worked example of what you want the resulting data set to actually look like might be helpful.

If all you want to do is count the number of disciplines and the number of countries, you don't even need to do -split-. Just create a couple of variables with the -wordcount()- function:

Code:

. gen d_count = wordcount(Discipline)

. gen c_count = wordcount(Country)

. list, noobs clean

    ID                  Discipline                   Country   d_count   c_count  
     1                 "Economics"                       USA         1         1  
     2     "Statistics, Economics"                    Canada         2         1  
     3                 "Sociology"              USA, Germany         1         2  
     4   "Economics, Epidemiology"                       USA         2         1  
     5                 "Sociology"                        UK         1         1  
     6                 "Sociology"   Norway, Sweden, Denmark         1         3  
     7                 "Economics"                   UK, USA         1         2  
     8                 "Sociology"                     Italy         1         1  
     9   "Epidemiology, Economics"                       USA         2         1  
    10                 "Economics"                       USA         1         1

Note: The above requires that no discipline or country name contains any embedded blanks. If you have some that do, such as "Asian Studies" or "Burkina Faso" then you might have to do something a little fancier:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float ID str25 Discipline str24 Country
 1 `""Economics""'               "USA"                    
 2 `""Statistics, Economics""'   "Canada"                 
 3 `""Sociology""'               "USA, Germany"           
 4 `""Economics, Epidemiology""' "USA"                    
 5 `""Sociology""'               "UK"                     
 6 `""Sociology""'               "Norway, Sweden, Denmark"
 7 `""Economics""'               "UK, USA"                
 8 `""Sociology""'               "Italy"                  
 9 `""Epidemiology, Economics""' "USA"                    
10 `""Economics""'               "USA" 
11 `""Economics, Asian Studies""'  "UK, France, Burkina Faso"                 
end

foreach v of varlist Discipline Country {
    gen _`v' = subinstr(`v', ",", "", .)
    gen `v'_count = strlen(`v') - strlen(_`v') + 1
    drop _`v'
}
list, noobs clean

Note, this in turn assumes that the responses are consistently separated by commas and contain no embedded commas.

Comment

Tarjei W. Havneraas

Join Date: Nov 2016
Posts: 136

16 Sep 2020, 12:32

Thanks for your reply, Clyde! I'll follow up on your tip to provide a small manually created version of what I want to obtain.

First, some context: I am collecting data from studies (id = study). I want to count numbers of a particular method by year. Some studies use several versions of the method and data from several countries, causing cells to contain multiple values. I want to plot bar charts with e.g. versions of method and country the data are from over years.

I imagine that the easiest way to proceed is to first make a long formatted data set and calculate summary statistics then perhaps collapse by year for graphs.

Here is some anonymous observations from my data set in its current format:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float id long study float year str19 iv_cat strL country
 1 26 2007 "Physician"           "USA"                                                                                                       
 2 25 2015 "Facility"            "USA"                                                                                                       
 3 24 2009 "Facility"            "USA"                                                                                                       
 4 23 2010 "Physician"           "USA"                                                                                                       
 5 22 2020 "Facility"            "Denmark"                                                                                                   
 6 21 2011 "Facility"            "USA"                                                                                                       
 7 20 2020 "Facility"            "USA"                                                                                                       
 8 18 2014 "Physician"           "Netherlands"                                                                                               
 9 19 2016 "Physician"           "UK"                                                                                                        
10 17 2014 "Physician"           "USA"                                                                                                       
11 16 2015 "Regional"            "USA"                                                                                                       
12 15 2013 "Regional"            "USA"                                                                                                       
13 13 2014 "Physician, Regional" "USA"                                                                                                       
14 11 2007 "Regional"            "USA"                                                                                                       
15 12 2011 "regional"            "USA"                                                                                                       
16 14 2015 "Regional"            "USA"                                                                                                       
17 10 2009 "physician"           "USA"                                                                                                       
18  9 2006 "Physician"           "USA"                                                                                                       
19  8 2017 "physician"           "Canada"                                                                                                    
20  7 2015 "facility"            "Australia, Belgium, Canada, France, Germany, Italy, Japan, New Zealand, Spain, Sweden, United Kingdom, USA"
21  6 2020 "Regional"            "USA"                                                                                                       
22  5 2018 "Regional, facility"  "Sweden"                                                                                                    
23  4 2013 "Physician"           "USA"                                                                                                       
24  2 2016 "Physician"           "Canada"                                                                                                    
25  3 2018 "Physician"           "Austria"                                                                                                   
26  1 2011 "Facility"            "Canada"                                                                                                    
end

This is what I imagine a correctly formatted dataset would look like (created manually):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id long study int year str9 iv_cat str14 country
 1 26 2007 "Physician" "USA"           
 2 25 2015 "Facility"  "USA"           
 3 24 2009 "Facility"  "USA"           
 4 23 2010 "Physician" "USA"           
 5 22 2020 "Facility"  "Denmark"       
 6 21 2011 "Facility"  "USA"           
 7 20 2020 "Facility"  "USA"           
 8 18 2014 "Physician" "Netherlands"   
 9 19 2016 "Physician" "UK"            
10 17 2014 "Physician" "USA"           
11 16 2015 "Regional"  "USA"           
12 15 2013 "Regional"  "USA"           
13 13 2014 "Physician" "USA"           
13 13 2014 "Regional"  "USA"           
14 11 2007 "Regional"  "USA"           
15 12 2011 "regional"  "USA"           
16 14 2015 "Regional"  "USA"           
17 10 2009 "physician" "USA"           
18  9 2006 "Physician" "USA"           
19  8 2017 "physician" "Canada"        
20  7 2015 "facility"  "Australia"     
20  7 2015 "facility"  "Belgium"       
20  7 2015 "facility"  "Canada"        
20  7 2015 "facility"  "France"        
20  7 2015 "facility"  "Germany"       
20  7 2015 "facility"  "Italy"         
20  7 2015 "facility"  "Japan"         
20  7 2015 "facility"  "New Zealand"   
20  7 2015 "facility"  "Spain"         
20  7 2015 "facility"  "Sweden"        
20  7 2015 "facility"  "United Kingdom"
20  7 2015 "facility"  "USA"           
21  6 2020 "Regional"  "USA"           
22  5 2018 "Regional"  "Sweden"        
22  5 2018 "facility"  "Sweden"        
23  4 2013 "Physician" "USA"           
24  2 2016 "Physician" "Canada"        
25  3 2018 "Physician" "Austria"       
26  1 2011 "Facility"  "Canada"        
end

In this case observation 13, 20 and 22 is repeated to "solve" the multiple values in cell format.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29962

16 Sep 2020, 13:05

So it looks like what you want is this:

Code:

preserve
keep id study year iv_cat?*
reshape long iv_cat, i(id study year)
drop if missing(iv_cat)
drop _j
tempfile ivcats
save `ivcats'

restore
keep id study year country?*
reshape long country, i(id study year)
drop if missing(country)
drop _j
joinby id study year using `ivcats', unmatched(both)

Comment

Tarjei W. Havneraas

Join Date: Nov 2016

Posts: 136
#5

16 Sep 2020, 13:50

Thanks, I appreciate your time and suggestions. I tried the code you provided, and also some edits, but the same error keeps occurring:

Code:

. preserve . keep id study year iv_cat?* . reshape long iv_cat, i(id study year) variable _j contains all missing values r(498);
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29962
#6

16 Sep 2020, 15:34

Oops! When I copied the code from the do-editor here, I missed the top lines:

Code:

foreach v of varlist iv_cat country { split `v', parse(",") gen(`v') }

Sorry about that. Put that before all of the code in #4. It will create the additional variables needed to avoid that -reshape- problem.
Comment
Tarjei W. Havneraas

Join Date: Nov 2016

Posts: 136
#7

16 Sep 2020, 23:58

Wow, that did the trick - thanks for the helpful feedback! I would surely have spent quite some more time on this issue without your input. I also found the code you provided in a related thread (post #3) relevant for summary statistics and include the link for others who may find it useful.
Comment

Announcement