Counting the number of people who have matched scores across time

gi peters

Join Date: Sep 2020

Posts: 44
#1

Counting the number of people who have matched scores across time

10 Jan 2024, 10:17

Hello! I am trying to count the number of people who have scores over consecutive time periods for different assessment areas. I've searched through the forum but didn't get an answer specific to my needs so I am hoping to get some direction here.
I will try to be as clear as possible in my explanation.

So my data looks like this :
dummyid - this is the id for identifying observations
whichreading - this indicates the time points that I have. It could go up to 9 readings
assess1 to assess3 - these are the assessment areas.
So for the first line, this would mean that person 12345 got a score of 9 for assessment area 1, missing score for assessment area2 and missing score for assessment area 3 and all of these scores are for the third reading.

I did duplicates report systemid and that gave me the number of duplicate entries that a person has so in essence, that told me that x number of people had entries for 4 time points/readings for example. However, it could not tell me how many have scores for the first and second reading alone, or how many have scores for first second and third reading alone, or how many have scores first second third and fourth reading alone and so on. I will need to have these counts for each assessment area.

I also tried crosstabulating whichreading (time variable) with each assessment area but that only showed me the number of people with a score of 1 (or 2 or 3) for each reading.

I am clueless on how to proceed so I would be grateful if someone could help. Thank you!

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long dummyid str23 whichreading byte(assess1 assess2 assess3) 12345 "03 - Third reading" 9 . . 12345 "04 - Fourth reading" 10 7 4 12345 "05 - Fifth reading" 8 7 7 12345 "06 - Sixth reading" 10 8 6 12345 "08 - Eighth reading" 9 8 8 113345 "01 - First reading" 10 2 10 22222 "01 - First reading" 7 7 9 33333 "01 - First reading" 7 6 6 33333 "09 - Ninth reading" 10 . 8 33333 "02 - Second reading" 9 6 7 44444 "01 - First reading" 8 8 4 44444 "02 - Second reading" 8 8 9 44444 "03 - Third reading" 9 10 9 44444 "05 - Fifth reading" 5 7 6 44444 "10 - Tenth reading" 7 7 7 44444 "06 - Sixth reading" 8 8 3 21345 "04 - Fourth reading" 6 8 1 21345 "05 - Fifth reading" 9 8 8 21345 "06 - Sixth reading" 9 8 8 21345 "07 - Seventh reading" 9 9 8 21345 "08 - Eighth reading" 9 9 9 54345 "01 - First reading" 8 5 5 54345 "02 - Second reading" 10 7 10 54345 "04 - Fourth reading" 10 8 10 54345 "05 - Fifth reading" 8 7 9 54345 "06 - Sixth reading" 8 7 10 54345 "07 - Seventh reading" 9 7 10 54345 "08 - Eighth reading" 8 7 10 end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#2

10 Jan 2024, 10:29

I'm very confused by your presentation. You ran -duplicates report systemid-, but your example contains no such variable. Did you mean dummyid? You say that the readings range up to 9, but your data clearly contains at least one 10th reading. You imply that scores must be 1, 2, or 3, but nearly all the scores in your example data are in the 7 to 10 range. And while you describe several things you tried that didn't give you what you want, after reading your post a few times, I am left clueless as to what you actually do want.

Perhaps somebody else will better understand what you are getting at and respond. But if that doesn't happen soon, I suggest you post back showing a hand-worked example of what you would like the results to look like.
1 like
Comment

gi peters

Join Date: Sep 2020
Posts: 44

10 Jan 2024, 13:09

Thanks for pointing out the sources of confusion Clyde Schechter

So you're right, systemid=dummy id.
Also the whichreading (time variable) ranges from first reading to 10th reading
And the scores for each assessment area range from 1-10 - it's just that the example pulled out from the dataset and displayed here are mostly on the higher range thats why they're mostly 7 8 9 but there were also low scores like for these cases:
1) id 113345 got a score of 2 for first reading in assessment area 2
2) id 44444 got a score of 3 for the sixth reading in assessment area 3

So what I was trying to do is

	With scores in Reading 1 & Reading 2	With scores in reading 1, 2 & 3	With scores in Reading 1, 2, 3 & 4	And so on...
Number of cases (based on dummy id)	?	?	?	?
Average scores for each column	?	?	?	?

I hope this is clearer than my original post. Thanks so much in advance for your help!

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#4

10 Jan 2024, 14:23

I think this will do what you want:

Code:

gen byte reading = real(substr(whichreading, 1, 2)) // VERIFY DATA MEETS ASSUMPTIONS REQUIRED FOR CODE TO BE CORRECT assert ustrregexm(whichreading, "\d\d - .*") assert !(missing(assess1) & missing(assess2) & missing(assess3)) isid dummyid reading, sort by dummyid (reading): assert reading >= _n by dummyid (reading): gen in_sync = (reading == _n) by dummyid (in_sync reading), sort: gen readings_group = reading[_N] if in_sync[_N] keep if !missing(readings_group) egen id_tag = tag(dummyid) collapse (sum) n_dummy_ids = id_tag (mean) assess*, by(readings_group) label var n_dummy_ids "Number of Cases" forvalues i = 1/3 { label var assess`i' "Mean Assessment `i' Score" } label var readings_group "Readings 1 through #" table (var) (readings_group), /// statistic(max n_dummy_ids) statistic(mean assess*) nformat(%2.1f mean) /// style(Table-1) nototals

Notes: For this code to work correctly, some assumptions about the data must be true:
The reading number is given as a two digit number in the first two characters .

There are no observations where all three of assess1, assess2, and assess3 have missing values.

Observations are uniquely identified by the combination of dummyid and whichreading.

These assumptions are verified by assert statements near the beginning of the code. If any of these assert statements halts with an error message, then your data will not be correctly processed by this code and you should not proceed. All of these assumptions are true of your example data.

I'm a little worried about what you mean by "And so on" in your desired results table. I'm interpreting it to mean that what you are trying to get here is, for each number n between 1 and 10, the number of patients who have results for exactly readings 1, 2, 3, ..., n with no skips, but did not have reading n+1, and may or may not have had additional readings from n+2 onward. That would give you a 10 column table (assuming there are some people who get all 10 assessments, and some with every number of assessments between 1 and 10) and the columns would reflect disjoint groups of people.

If your intent is to have a column of all possible combinations of any number of readings from 1 through 10, allowing for skips, and for starting later than reading 1, then I would say you need a new plan. That's because there are 1,024 such combinations, and there is no way you can make a usable table out of that.
Comment
gi peters

Join Date: Sep 2020

Posts: 44
#5

10 Jan 2024, 15:16

Thanks so much for the detailed explanation Clyde Schechter ! I will try the code and share whether it worked or not so that others who are in the same boat will benefit as well.
And yes, my intent is to have a 10-column table reflecting how many have additional readings n+1, n+2, n+3 starting at reading 1 and not to see all possible combinations.
Comment
gi peters

Join Date: Sep 2020

Posts: 44
#6

11 Jan 2024, 15:25

Hi Clyde Schechter,

I tried the code using the data example I posted here and everything went well except for this code
Is (var) supposed to be a specific variable name?

Code:

table (var) (readings_group), /// statistic(max n_dummy_ids) statistic(mean assess*) nformat(%2.1f mean) /// style(Table-1) nototals
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 29956
#7

11 Jan 2024, 16:45

I can think of two reasons you might be getting this response from Stata:
You are using an old version of Stata. The -table- command that accepts this syntax was introduced in version 17. If you are using version 16 or earlier, you cannot run this code. There is a -table- command in earlier versions of Stata, but it works differently and is not compatible with this syntax.

You are running a recent version of Stata, but you are trying to run this line-by-line. The entire thing from -table- through -nototals- is a single command and must be run all at once.

If you are using an old version of Stata, please read the Forum FAQ for advice on how to get the most out of your Statalist experience. It states that when you are not running the current version, you are supposed to say that in your post. That would avoid both you and me wasting our time on code that is not available to you.

Assuming the problem is your version of Stata pre-dating the current table syntax, do this instead of the -table- command previously shown:

Code:

gen junk = "-" label var junk " " table junk readings_group, /// c(mean n_dummy_ids mean assess1 mean assess2 mean assess3)

Notes:
A. Do this only if you are running version 16 or earlier. This code will be rejected as invalid syntax in version 17 or later.
B. The output you get will not be as pretty as you would get from current or recent Stata, but in substance it will be the same.
Comment
gi peters

Join Date: Sep 2020

Posts: 44
#8

11 Jan 2024, 20:30

I ran it again and it worked! Thank you so much! I realized that Im using STATA 16
Comment

Announcement