Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Transforming z-scores into NCEs

    Hello,

    I have a dataset where my dependent variable is student_score (from a math standardized test) and a series of independent variables, e.g. class size, household characteristics, age, gender, etc. Students can get a maximum of 30 points in the standardized exam. I converted raw scores into z-scores at the classroom level. Because Stata does not allow to combine -egen- with -by()-, I did this conversion manually as shown below:

    egen mean_score=mean(student_score), by(classroom)
    egen sd_score=sd(student_score), by(classroom)
    gen z_score=(student_score-mean_score)/sd_score

    Though the commands above worked fine, the interpretation of findings (i.e. in terms of standard deviations) is difficult to understand for the organization we are producing the study for, as their staff does not have a statistical background. I then decided to convert the z-scores into normal curve equivalents (NCEs) -- i.e. 1-99 scale. I did this by computing the following in Stata:

    z_score*21.06+50.

    However, I ended up with scores under 0 (negative) and scores over 99. I believe this is because I calculated the z-scores by classroom. How could I standardize scores by classroom and convert them into a different, more user-friendly scale? Is there any command or set of commands in Stata to make this score standardization more accurately?

    Many thanks!
    Patricia

  • #2
    Why are you standardizing the scaled scores? It would probably be easier for everyone to interpret results in units of the scaled score. If you are standardizing so you can pool across grade levels that is a bad practice that is only justified under extremely rigid assumptions about the underlying assessment that are likely to never be met. More importantly, however, if that classroom sizes are likely to be fairly small making the standardization fairly unstable.

    Comment


    • #3
      Thank you for your response, wbuchanan. I am standardizing the raw scores (not the scaled scores) by classroom. I'm standardizing at the classroom level because I want to make sure that z-scores are truly comparable. In other words, I want to account for the fact that, for example, a raw score of 15 points in a classroom with a mean of 12 is not comparable with 15 points in a classroom with a mean of 7. Does this make sense? I do have some small classrooms - is there any standard for what's the minimum class size I must have to calculate the z-scores rigorously?

      As for the normal curve equivalents (NCEs), I wanted to turn my z-score variable into something easier to interpret for the client. That's why I thought of coverting the z-scores into NCEs. I was able to calculate the NCEs with the commands shown below:

      * creating the z-scores percentiles *
      sort z_scores
      egen pcrank=rank(z_scores)
      gen scale=((99-1)*(pcrank-1)/(2347-1))+1

      * creating the NCEs *
      gen student_score=invnorm(scale/100)*21.06+50

      Comment


      • #4
        Patricia Alfonzo standardizing the scores does not achieve that goal. For example, if a student answers the 8 easiest items incorrectly but the 7 most difficult questions correctly it is fundamentally different - from a measurement perspective - than a student who randomly answers 7 questions correctly. Do you have access to the item responses? If do you'd be better served by either fitting and IRT model and developing a scale that way (if the item parameters are unknown you can - net inst raschjmle, from("http://www.paces-consulting.org/stata") - to get a Joint Maximum Likelihood estimator for a Rasch model) or fitting a CFA and using the value of the latent variable as a substitute. The other problem is that means and standard deviations are dependent on sample size:

        Code:
        clear
        set seed 7779311
        set obs 10
        g score = round(runiform(1, 20), 1)
        su score
        replace score = 20 in 1
        su score
        clear
        set obs 30
        g score = round(runiform(1, 20), 1)
        su score
        replace score = 20 in 28
        su score
        In the first example I randomly replaced the highest observed score with 20 and in the second I replaced a single instance of the lowest score with 20. In both cases, however, the mean shifts by approximately 0.5 points. The variance is also sensitive to changes in sample sizes:

        Code:
        forv i = 5/35 {
             set obs `i'
             g score = round(runiform(1, 20), 1)
             su score
             clear
        }
        Lastly, this makes a huge assumption about the actual measurement scale being used. If the assessment uses polytomous items, a z-score has no interpretation since there is no definition of a mean for an ordinal scale. This doesn't mean that it occurred in your case, but it is something that is constantly overlooked in education and carries a lot more assumptions (e.g., parallel odds regarding the difficulty/discrimination parameters for the thresholds, etc...) in order to even be a relatively decent approximation.

        In short, your best bet would be to build the measure/scale from the items so you could test for invariance/DIF. Once you have a scale in place you install Nick Cox's package egenmore from SSC which contains an xtile function that you can define by groups to get percentile rankings (although if you have a single scale you could do this outside of the bounds of a single classroom). If you're fitting some time of value added model or something along those lines if you use the shrunken estimators for the random effects the values should be N(0, 1) and will be adjusted for sample size already. Then you could transform those into percentiles/rankings.

        Comment


        • #5
          @wbuchanan, the scale of the exams is the same (0-30 points) and it does not use polygamous items. The class sizes do vary a lot - some classrooms are as small as 8 students whereas other are around 30 students. I don't have the item responses but I'm going to look more into the Joint Maximum Likelihood estimator for a Rasch model.

          Comment


          • #6
            @wbuchanan, I'm trying to download net inst raschjmle, from "http://www.paces-consulting.org/stata" and I can't open it. I saw in other Statalist posts that you need JRE 1.8 or above to run it, but even with a JRE above 1.8 is not working. It gives me the following error message: "The operation couldn’t be completed. (com.apple.installer.pagecontroller error -1.)" I'm working on a mac - could that be the problem?

            Comment


            • #7
              Wrong syntax.

              Code:
               net inst raschjmle, from("http://www.paces-consulting.org/stata")
              works for me, but I am not on a Mac. But it's still the wrong syntax.

              Comment


              • #8
                Thanks for pointing out the issue with the syntax, Nick Cox. I had included it as wbuchanan typed it but I missed the last ")". I tried it again and it worked fine on my Mac. Many thanks!

                Comment


                • #9
                  I cannot run "raschjmle" with the Stata version I currently have (version 11) - see error message below. Is there any other way I could get a Joint Maximum Likelihood estimator in my current Stata version?

                  this is version 11.2 of Stata; it cannot run version 14.0 programs
                  r(9);

                  Comment


                  • #10
                    I'd do a search and then filter out stuff requiring 12 up. I am afraid that this will bite you again. In this case the program is a wrapper to call a Java library, so other methods could exist.

                    Comment


                    • #11
                      Patricia Alfonzo to my knowledge all the other IRT programs developed by users implement marginal maximum likelihood estimators of the model parameters. This is fine if the item or person parameters are known a priori, but the JML estimator is specifically for applications when both the person and item parameters are unknown. You could install jMetrik directly from http://www.itemanalysis.com, but it wouldn't be quite same in terms of integration. Additionally, the Statalist FAQ states that everyone operates under the assumption that unless otherwise specified it is assumed that you are using the most current version of Stata. The Java API changed a bit from Stata 13 to Stata 14 in a way that would break the wrapper and I'm not sure when/if I'll be able to port some of the classes/methods used to move the data from Stata into the JVM any time soon. I also tend to do nearly all of my coding on *nix based systems, so if you work on a Mac you typically shouldn't have any issues. Outside of jMetrik the only other FOSS options for IRT are in R and/or lower level programming libraries like ETIRM.

                      Comment


                      • #12
                        wbuchanan I purchased Stata 14 and was able to run the IRT model. To clarify from my messages above, I have scores for individual questions (1=correct answer; 0=incorrect answer), but do not have item parameters to know which questions are harder or easier. Because I have data for 3rd and 6th graders and the exams used for each grade covered different math topics, I ran the following separate commands:

                        Code:
                        raschjmle Question_1-Question_30 if grade==3 raschjmle Question_1-Question_30 if grade==6 I'm not familiar with the outputs and I'm uncertain about what information I should use to build the scale. Please see below the output for 3rd grade:


                        What should I use from the output to build the scale?

                        Many thanks!

                        Comment


                        • #13
                          I believe the picture of the output is not displayed properly in my previous message. I attempted to simply copy and paste the output into the body of the message, but it looks too messy. Please find the output attached as a PDF here.
                          Attached Files

                          Comment


                          • #14
                            Patricia Alfonzo the first thing to know is that if the test blueprint was developed around different content, it would not make any sense to compare 3rd and 6th grade scores. It alters the substantive meaning of any scale if the domain from which items are sampled vary, although it is possible to link/vertically scale in some cases. Since there is such a large gap in the amount of instruction and content the children would have been exposed to it wouldn't make sense to do that in this case; ideally for something like this you would include some fraction of items that are above and below the given student's grade level that would serve as anchor items to allow the calibration of items to be adjusted across the grade spans.

                            That said, it would be easier if you had copied the log file since the pdf cuts off some of the information. try something like:

                            Code:
                            rename *, l
                            net inst dm88_1.pkg, from("http://www.stata-journal.com/software/sj5-4") replace
                            raschjmle question_1-question_30 if grade == 3, gc(0.000001) pc(0.0000001) pri(all)
                            renvars theta csem infit outfit stdinfit stdoutfit, postf(g3)
                            ​raschjmle question_1-question_30 if grade == 6, gc(0.000001) pc(0.0000001) pri(all)
                            renvars theta csem infit outfit stdinfit stdoutfit, postf(g6)
                            The difference here is that the convergence criterion for both the item and person parameters would be set more stringently (e.g., it would require the change in the estimated log likelihood to be smaller). It will also print a bit more information than you currently see in the output. In the example above I use the program renvars to add a suffix to the person parameters that are added to the data set from raschjmle (not to sure if the values would be over ridden otherwise). The difficulty parameter gives you the location - in logits - where the probability of a correct response is chance for the same value of theta (e.g., for question_1 the probability of a correct response for a student with theta -0.63 is 0.5, students with higher skill/ability/proficiency will have a greater probability of correct responses and conversely those with lower values of theta would have a lower probability of a correct response).

                            The score table at the end:

                            Code:
                            SCORE TABLE
                            ==================================
                            Score Theta Std. Err
                            ----------------------------------
                            0.00 -4.87 1.84
                            1.00 -3.62 1.03
                            2.00 -2.88 0.75
                            3.00 -2.41 0.63
                            4.00 -2.07 0.56
                            5.00 -1.78 0.51
                            6.00 -1.54 0.48
                            7.00 -1.33 0.45
                            8.00 -1.13 0.43
                            9.00 -0.95 0.42
                            10.00 -0.77 0.41
                            11.00 -0.61 0.40
                            12.00 -0.45 0.40
                            13.00 -0.30 0.39
                            14.00 -0.14 0.39
                            15.00 0.01 0.39
                            16.00 0.16 0.39
                            17.00 0.31 0.39
                            18.00 0.46 0.39
                            19.00 0.62 0.40
                            20.00 0.78 0.41
                            21.00 0.95 0.42
                            22.00 1.13 0.43
                            23.00 1.33 0.45
                            24.00 1.54 0.47
                            25.00 1.78 0.51
                            26.00 2.06 0.55
                            27.00 2.40 0.62
                            28.00 2.86 0.74
                            29.00 3.60 1.02
                            30.00 4.84 1.84
                            ==================================

                            Shows how the raw scores for third grade students can be mapped onto values of theta along with the corresponding standard errors of measurement. The adjustment parameter in raschjmle is used to identify the estimates in cases of where there is perfect success/failure. It does this essentially by increasing (in the case of perfect failure) and decreasing (in the case of perfect success) the raw score, typically by a fraction of a point.

                            Outfit (UMS), Infit (WMS), Standardized Outfit (Std. UMS), and Standardized Infit (Std. WMS) are all estimates of how well an individual parameter fits the data. It's a bit difficult to get into the meaning of each of the fit statistics in a succinct format. That said, Bond and Fox (2015) provide fairly decent coverage of the topic and the foundational work in this area can be found in Wright and Stone (1982) where the estimation of the fit statistics was defined (you can find the single page from the book here: http://www.rasch.org/rmt/rmt34e.htm, but their notational conventions with regards to indexing parameters can be a bit confusing at times). In short, infit and outfit values quantify the difference between the expected variances in the items as described by the model and what is observed. Values > 1 indicate greater observed variance than the model would predict, and values < 1 indicate less variance than would be expected. The standardized versions of these statistics are reported on a z-scale, so items 9, 11, 13, 14, 15, 23, and 25 for the third grade test might be worth additional review to determine what plausible reasons might exist for the additional variance.

                            All that said, this assumes that you've verified that those data conform to a unidimensional construct, if you find multiple factors, you can check to see if a higher order factor fits better than multiple individual factors as a way to justify the items fitting a unidimensional construct.

                            Some people aren't crazy about the Rasch model, so you could also fit a normal 1PL model; the difference is that a Rasch model has the discrimination parameter constrained to be equal to 1 across all items, while a 1PL model only imposes an equality constraint on the discrimination parameters to be equal (e.g., could be .87 instead of being constrained to 1 a priori). Now that you have Stata 14 it becomes much easier:

                            Code:
                            irt 1pl question_1-question_30
                            irtgraph icc
                            irtgraph tcc
                            This would fit the 1PL model and then draw the item characteristic curve and test characteristic curves for you.

                            Hope this helps,
                            Billy

                            Bond, T. G., & Fox, C. M. (2015). Applying the Rasch model: fundamental measurement in the human sciences. 3rd Ed. New York City, NY: Routledge
                            Wright, B. D., & Stone, M. H. (1982). Rating scale analysis. Chicago, IL: MESA Press.

                            Comment


                            • #15
                              wbuchanan Many thanks for the information - it does help a lot!

                              Comment

                              Working...
                              X