Counting the repeated values in one column and assign the count to another column as cumulative for each group

Amolwan Jittamai

Join Date: Jan 2024

Posts: 7
#1

Counting the repeated values in one column and assign the count to another column as cumulative for each group

24 Jan 2024, 16:24

I am using Stata BE/18 and working with about 1.6m observations and about 40 columns. I am attempting to create a new column called "number_year" to count the repeated values in another column called "name" for each group sorted by "firms_id" and "year". If the name changes within the group, the count resets to 1. I have provided a picture of the table accompanying my question as follows: I sorted the data by "firms_id" and "year" as in the picture and I want to count the number of repeated names as shown in the "number_year" column and then reset to 1 if either the new name appears within the group (in this example, firms_id) or when starting counting the new group.

I appreciate your help.

Attached Files
Tags: codes, data, loop, panel data
Clyde Schechter

Join Date: Apr 2014

Posts: 29799
#2

24 Jan 2024, 16:50

Code:

by firms_id (year), sort: gen group = sum(name != name[_n-1]) by firms_id group (year), sort: gen wanted = _n sort firms_id year

This code assumes that if the same name has a run of years, then is followed by a different name, and then ultimately returns again, that name's two runs should each start at 1, and are not treated as a single, interrupted run. (Imagine if emily and luna were william. I'm saying that when william returns in 2012, we restart from 1, we do not continue with 4, which is where william previously left off.) If that's not what you want, post back.

Note: Although you meant well showing the image of your data, it isn't as helpful as you imagine. It isn't possible to import data from an image into Stata to develop and test code. So this code is untested and might not work. In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Last edited by Clyde Schechter; 24 Jan 2024, 16:54.
1 like
Comment
Amolwan Jittamai

Join Date: Jan 2024

Posts: 7
#3

24 Jan 2024, 17:06

Hello Clyde, your codes are exactly what I wanted to do. Thank you very much for your help and I will keep it in mind to use dataex- command in the future.
Comment

Amolwan Jittamai

Join Date: Jan 2024
Posts: 7

30 Jan 2024, 23:06

I have questions following up on my previous question and instead of firmd_id, year, and name, I use facility_chemical, year_reporting, and nameofcertifyingofficial, respectively. Now I also want to count the same name that appears later as Clyde mentioned above if the name returns within the same group it should be added. Using the above example that if emily and luna were william, when william returns in 2012, we continue with 4 and 5, which is where william previously left off.)

I tried the following code as Clyde suggested:

Code:

by facility_chemical (year_reporting), sort: gen group = sum( nameofcertifyingofficial !=  nameofcertifyingofficial[_n-1]) 
by facility_chemical (year_reporting), sort: gen number_of_years = _n 
sort facility_chemical year_reporting

It works for the most part; however, there were some errors that the number of years keeps adding up even though the new name appears within the group of facility_chemical.

I included my data here

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str26 facility_chemical int year_reporting str45(nameofcertifyingofficial publiccontactname)
"01001BRKSH36MYL_0000110543" 2004 "FRANK BASILE"          "FRANK BASILE"          
"01001BRKSH36MYL_0000110543" 2005 "FRANK BASILE"          "FRANK BASILE"          
"01001BRKSH36MYL_0000110543" 2007 "FRANK BASILE"          "FRANK BASILE"          
"01001BRKSH36MYL_0000110543" 2008 "FRANK BASILE"          "FRANK BASILE"          
"01001BRKSH36MYL_0007664417" 2004 "FRANK BASILE"          "FRANK BASILE"          
"01001BRKSH36MYL_0007664417" 2005 "FRANK BASILE"          "FRANK BASILE"          
"01001BRKSH36MYL_0007664417" 2006 "FRANK BASILE"          "FRANK BASILE"          
"01001BRKSH36MYL_0007664417" 2007 "FRANK BASILE"          "FRANK BASILE"          
"01001BRKSH36MYL_0007664417" 2008 "FRANK BASILE"          "FRANK BASILE"          
"01001BRKSH36MYL_0007664417" 2009 "FRANK BASILE"          "FRANK BASILE"          
"01001CSPRN188MS_N150"       2004 "E. SCOTT PORTER"       "MEG MORRIS"            
"01001CSPRN188MS_N420"       2004 "E. SCOTT PORTER"       "MEG MORRIS"            
"01001CSPRN188MS_N458"       2004 "E. SCOTT PORTER"       "MEG MORRIS"            
"01001DFMCR99ABR_0007440508" 1993 "ERNEST E. DENBY"       "ALAN PARO"             
"01001DFMCR99ABR_0007440508" 1994 "ERNEST E. DENBY"       "ALAN PARO"             
"01001DFMCR99ABR_0007440508" 1995 "ERNEST E. DENBY"       "EDWARD S.G. HICKS, JR."
"01001DFMCR99ABR_0007440508" 1996 "EDWARD S.G. HICK JR."  "EDWARD S.G. HICKS JR." 
"01001DFMCR99ABR_0007440508" 1997 "EDWARD S.G. HICKS"     "EDWAARD S.G. HICKS"    
"01001DFMCR99ABR_0007440508" 1998 "ERNEST E. DENBY"       "ERNEST E. DENBY"       
"01001GVRNR720SI_0007439921" 2001 "DAVID OVELETTE"        "RANDAL CARLTON"        
"01001GVRNR720SI_0007439921" 2002 "DAVID OUELLETTE"       "RANDAL CARLTON"        
"01001GVRNR720SI_0007439921" 2003 "DAVID OUELETTE"        "RANDAL CARLTON"        
"01001GVRNR720SI_0007439921" 2004 "DAVID OUELLETTE"       "RANDAL CARLTON"        
"01001GVRNR720SI_0007439921" 2005 "DAVID OUELLETTE"       "RANDAL CARLTON"        
"01001GVRNR720SI_0007439921" 2006 "DAVID OUELLETTE"       "RANDAL CARLTON"        
"01001GVRNR720SI_0007439921" 2007 "DAVID OUELLETTE"       "RANDAL CARLTON"        
"01001GVRNR720SI_0007439921" 2008 "DAVID OUELLETTE"       "RANDAL CARLTON"        
"01001GVRNR720SI_0007439921" 2009 "DAVID OUELLETTE"       "RANDAL CARLTON"        
"01001GVRNR720SI_0007439921" 2010 "DAVID OUELLETTE"       "RANDAL CARLTON"        
"01001GVRNR720SI_0007439921" 2011 "COLLEEN SHELDON-FERRY" "COLLEEN SHELDON-FERRY" 
end

I already sorted the facility_chemical and year_reporting, I want to create a new column to count a number of names and keep adding up for names that appear in the next row for each facility_chemical group. For example, looking at the last group where facility_chemical = "01001GVRNR720SI_0007439921", I want to create a new column to counts the same names that appear within each group by comparing the names from "nameofcertifyingofficial" column and the values for each row in a new column should be 1,1, 1, 2, 3, 4, 5, 6, 7, 8, 1 as "DAVID OVELETTE" is the first name that appears so we set the new value equals to 1 then the following row is "DAVID OUELLETTE" which is different from the first name since the name is spelled a bit differently (spelling of the names are sensitive so the names must match) so it will also count as 1. In the third row "DAVID OUELETTE" which is different from the first two names (the spelling of the names are not the same) so it will be set to 1, then the 4th row is "DAVID OUELLETTE" which is similar to the 2nd row so this will add to 2 and after that, we see that "DAVID OUELLETTE" keeps showing up so we add it until the new name "COLLEEN SHELDON-FERRY" shows up which we will set this to 1 again. I also wanted to do the same thing for "publiccontactname" by creating another column.

Please let me know if you need more explanations. Thank you for your time.

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 29799

31 Jan 2024, 09:55

This is actually simpler than what I original thought was required.

Code:

by facility_chemical nameofcertifyingofficial (year_reporting), sort: ///
    gen seq_certifier= _n
by facility_chemical publiccontactname (year_reporting), sort: ///
    gen seq_contact = _n
sort facility_chemical year_reporting

Comment

Amolwan Jittamai

Join Date: Jan 2024

Posts: 7
#6

31 Jan 2024, 21:42

Thank you, Clyde. I greatly appreciate your help.
Comment

Announcement