Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple values in same cell

    Hi Statalisters

    I am working on a fairly simple question that I have not confronted before and hope that I may get input here.

    Basically, I have created a dataset with an ID where some variables contain multiple values in a single cell. My aim is a data set in long format. I wonder what the best approach is. Do I make several rows for each ID?

    I have created a data example to illustrate:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float ID str25 Discipline str23 Country
     1 `""Economics""'               "USA"                    
     2 `""Statistics, Economics""'   "Canada"                 
     3 `""Sociology""'               "USA, Germany"           
     4 `""Economics, Epidemiology""' "USA"                    
     5 `""Sociology""'               "UK"                     
     6 `""Sociology""'               "Norway, Sweden, Denmark"
     7 `""Economics""'               "UK, USA"                
     8 `""Sociology""'               "Italy"                  
     9 `""Epidemiology, Economics""' "USA"                    
    10 `""Economics""'               "USA"                    
    end
    For analytic purposes, it's important to be able to count the number of e.g. "Economics", "Sociology", etc., as well as number of countries.

    Note that there is not perfect overlap between number of values in Discipline and Country. It would be great if someone had input on how this may be solved through code instead of manual work as my real data set is larger.

    Best
    Tarjei

  • #2
    Well, for starters, the -split- command (see -help split-) will break up the Discipline and Country variables into however many variable are needed to accommodate each separate response. It isn't clear to me where you go from there: that depends on what you will be doing. -reshape long- may not be appropriate here because I take
    Note that there is not perfect overlap between number of values in Discipline and Country.
    to mean that the first listed discipline does not necessarily have any connection to the first listed country, etc., and -reshape long- would necessarily associate those by putting them in the same observation. A small hand-worked example of what you want the resulting data set to actually look like might be helpful.

    If all you want to do is count the number of disciplines and the number of countries, you don't even need to do -split-. Just create a couple of variables with the -wordcount()- function:
    Code:
    . gen d_count = wordcount(Discipline)
    
    . gen c_count = wordcount(Country)
    
    . list, noobs clean
    
        ID                  Discipline                   Country   d_count   c_count  
         1                 "Economics"                       USA         1         1  
         2     "Statistics, Economics"                    Canada         2         1  
         3                 "Sociology"              USA, Germany         1         2  
         4   "Economics, Epidemiology"                       USA         2         1  
         5                 "Sociology"                        UK         1         1  
         6                 "Sociology"   Norway, Sweden, Denmark         1         3  
         7                 "Economics"                   UK, USA         1         2  
         8                 "Sociology"                     Italy         1         1  
         9   "Epidemiology, Economics"                       USA         2         1  
        10                 "Economics"                       USA         1         1
    Note: The above requires that no discipline or country name contains any embedded blanks. If you have some that do, such as "Asian Studies" or "Burkina Faso" then you might have to do something a little fancier:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float ID str25 Discipline str24 Country
     1 `""Economics""'               "USA"                    
     2 `""Statistics, Economics""'   "Canada"                 
     3 `""Sociology""'               "USA, Germany"           
     4 `""Economics, Epidemiology""' "USA"                    
     5 `""Sociology""'               "UK"                     
     6 `""Sociology""'               "Norway, Sweden, Denmark"
     7 `""Economics""'               "UK, USA"                
     8 `""Sociology""'               "Italy"                  
     9 `""Epidemiology, Economics""' "USA"                    
    10 `""Economics""'               "USA" 
    11 `""Economics, Asian Studies""'  "UK, France, Burkina Faso"                 
    end
    
    foreach v of varlist Discipline Country {
        gen _`v' = subinstr(`v', ",", "", .)
        gen `v'_count = strlen(`v') - strlen(_`v') + 1
        drop _`v'
    }
    list, noobs clean
    Note, this in turn assumes that the responses are consistently separated by commas and contain no embedded commas.

    Comment


    • #3
      Thanks for your reply, Clyde! I'll follow up on your tip to provide a small manually created version of what I want to obtain.

      First, some context: I am collecting data from studies (id = study). I want to count numbers of a particular method by year. Some studies use several versions of the method and data from several countries, causing cells to contain multiple values. I want to plot bar charts with e.g. versions of method and country the data are from over years.

      I imagine that the easiest way to proceed is to first make a long formatted data set and calculate summary statistics then perhaps collapse by year for graphs.

      Here is some anonymous observations from my data set in its current format:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float id long study float year str19 iv_cat strL country
       1 26 2007 "Physician"           "USA"                                                                                                       
       2 25 2015 "Facility"            "USA"                                                                                                       
       3 24 2009 "Facility"            "USA"                                                                                                       
       4 23 2010 "Physician"           "USA"                                                                                                       
       5 22 2020 "Facility"            "Denmark"                                                                                                   
       6 21 2011 "Facility"            "USA"                                                                                                       
       7 20 2020 "Facility"            "USA"                                                                                                       
       8 18 2014 "Physician"           "Netherlands"                                                                                               
       9 19 2016 "Physician"           "UK"                                                                                                        
      10 17 2014 "Physician"           "USA"                                                                                                       
      11 16 2015 "Regional"            "USA"                                                                                                       
      12 15 2013 "Regional"            "USA"                                                                                                       
      13 13 2014 "Physician, Regional" "USA"                                                                                                       
      14 11 2007 "Regional"            "USA"                                                                                                       
      15 12 2011 "regional"            "USA"                                                                                                       
      16 14 2015 "Regional"            "USA"                                                                                                       
      17 10 2009 "physician"           "USA"                                                                                                       
      18  9 2006 "Physician"           "USA"                                                                                                       
      19  8 2017 "physician"           "Canada"                                                                                                    
      20  7 2015 "facility"            "Australia, Belgium, Canada, France, Germany, Italy, Japan, New Zealand, Spain, Sweden, United Kingdom, USA"
      21  6 2020 "Regional"            "USA"                                                                                                       
      22  5 2018 "Regional, facility"  "Sweden"                                                                                                    
      23  4 2013 "Physician"           "USA"                                                                                                       
      24  2 2016 "Physician"           "Canada"                                                                                                    
      25  3 2018 "Physician"           "Austria"                                                                                                   
      26  1 2011 "Facility"            "Canada"                                                                                                    
      end
      This is what I imagine a correctly formatted dataset would look like (created manually):
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte id long study int year str9 iv_cat str14 country
       1 26 2007 "Physician" "USA"           
       2 25 2015 "Facility"  "USA"           
       3 24 2009 "Facility"  "USA"           
       4 23 2010 "Physician" "USA"           
       5 22 2020 "Facility"  "Denmark"       
       6 21 2011 "Facility"  "USA"           
       7 20 2020 "Facility"  "USA"           
       8 18 2014 "Physician" "Netherlands"   
       9 19 2016 "Physician" "UK"            
      10 17 2014 "Physician" "USA"           
      11 16 2015 "Regional"  "USA"           
      12 15 2013 "Regional"  "USA"           
      13 13 2014 "Physician" "USA"           
      13 13 2014 "Regional"  "USA"           
      14 11 2007 "Regional"  "USA"           
      15 12 2011 "regional"  "USA"           
      16 14 2015 "Regional"  "USA"           
      17 10 2009 "physician" "USA"           
      18  9 2006 "Physician" "USA"           
      19  8 2017 "physician" "Canada"        
      20  7 2015 "facility"  "Australia"     
      20  7 2015 "facility"  "Belgium"       
      20  7 2015 "facility"  "Canada"        
      20  7 2015 "facility"  "France"        
      20  7 2015 "facility"  "Germany"       
      20  7 2015 "facility"  "Italy"         
      20  7 2015 "facility"  "Japan"         
      20  7 2015 "facility"  "New Zealand"   
      20  7 2015 "facility"  "Spain"         
      20  7 2015 "facility"  "Sweden"        
      20  7 2015 "facility"  "United Kingdom"
      20  7 2015 "facility"  "USA"           
      21  6 2020 "Regional"  "USA"           
      22  5 2018 "Regional"  "Sweden"        
      22  5 2018 "facility"  "Sweden"        
      23  4 2013 "Physician" "USA"           
      24  2 2016 "Physician" "Canada"        
      25  3 2018 "Physician" "Austria"       
      26  1 2011 "Facility"  "Canada"        
      end
      In this case observation 13, 20 and 22 is repeated to "solve" the multiple values in cell format.

      Comment


      • #4
        So it looks like what you want is this:
        Code:
        preserve
        keep id study year iv_cat?*
        reshape long iv_cat, i(id study year)
        drop if missing(iv_cat)
        drop _j
        tempfile ivcats
        save `ivcats'
        
        restore
        keep id study year country?*
        reshape long country, i(id study year)
        drop if missing(country)
        drop _j
        joinby id study year using `ivcats', unmatched(both)

        Comment


        • #5
          Thanks, I appreciate your time and suggestions. I tried the code you provided, and also some edits, but the same error keeps occurring:
          Code:
          . preserve
          
          . keep id study year iv_cat?*
          
          . reshape long iv_cat, i(id study year)
          variable _j contains all missing values
          r(498);

          Comment


          • #6
            Oops! When I copied the code from the do-editor here, I missed the top lines:

            Code:
            foreach v of varlist iv_cat country {
                split `v', parse(",") gen(`v')
            }
            Sorry about that. Put that before all of the code in #4. It will create the additional variables needed to avoid that -reshape- problem.

            Comment


            • #7
              Wow, that did the trick - thanks for the helpful feedback! I would surely have spent quite some more time on this issue without your input. I also found the code you provided in a related thread (post #3) relevant for summary statistics and include the link for others who may find it useful.

              Comment

              Working...
              X