Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting the repeated values in one column and assign the count to another column as cumulative for each group

    I am using Stata BE/18 and working with about 1.6m observations and about 40 columns. I am attempting to create a new column called "number_year" to count the repeated values in another column called "name" for each group sorted by "firms_id" and "year". If the name changes within the group, the count resets to 1. I have provided a picture of the table accompanying my question as follows: I sorted the data by "firms_id" and "year" as in the picture and I want to count the number of repeated names as shown in the "number_year" column and then reset to 1 if either the new name appears within the group (in this example, firms_id) or when starting counting the new group.

    I appreciate your help.


    Attached Files

  • #2
    Code:
    by firms_id (year), sort: gen group = sum(name != name[_n-1])
    by firms_id group (year), sort: gen wanted = _n
    sort firms_id year
    This code assumes that if the same name has a run of years, then is followed by a different name, and then ultimately returns again, that name's two runs should each start at 1, and are not treated as a single, interrupted run. (Imagine if emily and luna were william. I'm saying that when william returns in 2012, we restart from 1, we do not continue with 4, which is where william previously left off.) If that's not what you want, post back.

    Note: Although you meant well showing the image of your data, it isn't as helpful as you imagine. It isn't possible to import data from an image into Stata to develop and test code. So this code is untested and might not work. In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
    Last edited by Clyde Schechter; 24 Jan 2024, 16:54.

    Comment


    • #3
      Hello Clyde, your codes are exactly what I wanted to do. Thank you very much for your help and I will keep it in mind to use dataex- command in the future.

      Comment


      • #4
        I have questions following up on my previous question and instead of firmd_id, year, and name, I use facility_chemical, year_reporting, and nameofcertifyingofficial, respectively. Now I also want to count the same name that appears later as Clyde mentioned above if the name returns within the same group it should be added. Using the above example that if emily and luna were william, when william returns in 2012, we continue with 4 and 5, which is where william previously left off.)

        I tried the following code as Clyde suggested:
        Code:
        by facility_chemical (year_reporting), sort: gen group = sum( nameofcertifyingofficial !=  nameofcertifyingofficial[_n-1]) 
        by facility_chemical (year_reporting), sort: gen number_of_years = _n 
        sort facility_chemical year_reporting
        It works for the most part; however, there were some errors that the number of years keeps adding up even though the new name appears within the group of facility_chemical.

        I included my data here
        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str26 facility_chemical int year_reporting str45(nameofcertifyingofficial publiccontactname)
        "01001BRKSH36MYL_0000110543" 2004 "FRANK BASILE"          "FRANK BASILE"          
        "01001BRKSH36MYL_0000110543" 2005 "FRANK BASILE"          "FRANK BASILE"          
        "01001BRKSH36MYL_0000110543" 2007 "FRANK BASILE"          "FRANK BASILE"          
        "01001BRKSH36MYL_0000110543" 2008 "FRANK BASILE"          "FRANK BASILE"          
        "01001BRKSH36MYL_0007664417" 2004 "FRANK BASILE"          "FRANK BASILE"          
        "01001BRKSH36MYL_0007664417" 2005 "FRANK BASILE"          "FRANK BASILE"          
        "01001BRKSH36MYL_0007664417" 2006 "FRANK BASILE"          "FRANK BASILE"          
        "01001BRKSH36MYL_0007664417" 2007 "FRANK BASILE"          "FRANK BASILE"          
        "01001BRKSH36MYL_0007664417" 2008 "FRANK BASILE"          "FRANK BASILE"          
        "01001BRKSH36MYL_0007664417" 2009 "FRANK BASILE"          "FRANK BASILE"          
        "01001CSPRN188MS_N150"       2004 "E. SCOTT PORTER"       "MEG MORRIS"            
        "01001CSPRN188MS_N420"       2004 "E. SCOTT PORTER"       "MEG MORRIS"            
        "01001CSPRN188MS_N458"       2004 "E. SCOTT PORTER"       "MEG MORRIS"            
        "01001DFMCR99ABR_0007440508" 1993 "ERNEST E. DENBY"       "ALAN PARO"             
        "01001DFMCR99ABR_0007440508" 1994 "ERNEST E. DENBY"       "ALAN PARO"             
        "01001DFMCR99ABR_0007440508" 1995 "ERNEST E. DENBY"       "EDWARD S.G. HICKS, JR."
        "01001DFMCR99ABR_0007440508" 1996 "EDWARD S.G. HICK JR."  "EDWARD S.G. HICKS JR." 
        "01001DFMCR99ABR_0007440508" 1997 "EDWARD S.G. HICKS"     "EDWAARD S.G. HICKS"    
        "01001DFMCR99ABR_0007440508" 1998 "ERNEST E. DENBY"       "ERNEST E. DENBY"       
        "01001GVRNR720SI_0007439921" 2001 "DAVID OVELETTE"        "RANDAL CARLTON"        
        "01001GVRNR720SI_0007439921" 2002 "DAVID OUELLETTE"       "RANDAL CARLTON"        
        "01001GVRNR720SI_0007439921" 2003 "DAVID OUELETTE"        "RANDAL CARLTON"        
        "01001GVRNR720SI_0007439921" 2004 "DAVID OUELLETTE"       "RANDAL CARLTON"        
        "01001GVRNR720SI_0007439921" 2005 "DAVID OUELLETTE"       "RANDAL CARLTON"        
        "01001GVRNR720SI_0007439921" 2006 "DAVID OUELLETTE"       "RANDAL CARLTON"        
        "01001GVRNR720SI_0007439921" 2007 "DAVID OUELLETTE"       "RANDAL CARLTON"        
        "01001GVRNR720SI_0007439921" 2008 "DAVID OUELLETTE"       "RANDAL CARLTON"        
        "01001GVRNR720SI_0007439921" 2009 "DAVID OUELLETTE"       "RANDAL CARLTON"        
        "01001GVRNR720SI_0007439921" 2010 "DAVID OUELLETTE"       "RANDAL CARLTON"        
        "01001GVRNR720SI_0007439921" 2011 "COLLEEN SHELDON-FERRY" "COLLEEN SHELDON-FERRY" 
        end
        I already sorted the facility_chemical and year_reporting, I want to create a new column to count a number of names and keep adding up for names that appear in the next row for each facility_chemical group. For example, looking at the last group where facility_chemical = "01001GVRNR720SI_0007439921", I want to create a new column to counts the same names that appear within each group by comparing the names from "nameofcertifyingofficial" column and the values for each row in a new column should be 1,1, 1, 2, 3, 4, 5, 6, 7, 8, 1 as "DAVID OVELETTE" is the first name that appears so we set the new value equals to 1 then the following row is "DAVID OUELLETTE" which is different from the first name since the name is spelled a bit differently (spelling of the names are sensitive so the names must match) so it will also count as 1. In the third row "DAVID OUELETTE" which is different from the first two names (the spelling of the names are not the same) so it will be set to 1, then the 4th row is "DAVID OUELLETTE" which is similar to the 2nd row so this will add to 2 and after that, we see that "DAVID OUELLETTE" keeps showing up so we add it until the new name "COLLEEN SHELDON-FERRY" shows up which we will set this to 1 again. I also wanted to do the same thing for "publiccontactname" by creating another column.

        Please let me know if you need more explanations. Thank you for your time.

        Comment


        • #5
          This is actually simpler than what I original thought was required.
          Code:
          by facility_chemical nameofcertifyingofficial (year_reporting), sort: ///
              gen seq_certifier= _n
          by facility_chemical publiccontactname (year_reporting), sort: ///
              gen seq_contact = _n
          sort facility_chemical year_reporting

          Comment


          • #6
            Thank you, Clyde. I greatly appreciate your help.

            Comment

            Working...
            X