Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Forming one dummy for more than one items

    Hi people, I would like to know if it is ok to form one dummy for more than one question in which you have 5 categories fully agreed....totally not agree ?
    Otherwise how could you aggregate them, not only summing up, is summing up and taking the mean ok or should I use given cross section weights of the data set?
    And is it ok for example job satisfaction, 0-10, from unsatisfied to satisfied, to aggregate that say 8-10 is one and rest 0 as dummy? Or should I look at the distribution and the median to make a cut?
    Here is a copy of a given procedure....Is that ok?
    "
    For perceived

    harms, there is a five-point scale in which the highest category corresponds to the

    perception by the worker that a feature of working conditions is ‘very much’ an

    adverse factor at the workplace. For perceived hazards, the highest category among

    three possibilities is the one in which the respondent considers a feature at the

    workplace ‘a distinct hazard’. Responses to the questions about adverse working

    conditions can be aggregated by forming a du

    mmy variable that equals one if there is

    at least one clearly adverse factor (HARM) and a dummy that equals one if there is at

    least one distinct hazard (HAZARD).
    "

  • #2
    What kind of analyses are you planning to do with this data?

    As a general rule, creating categories out of continuous variables is a bad idea: it discards information and also distorts reality. The same can be said of reducing the number of categories in multi-category variable. But it does depend on what you're doing. Sometimes if you just want to present results of an opinion survey, it can be less burdensome on the audience to present total agreement (strongly agree + agree) and total disagreement (disagree + strongly disagree) rather than all four of those categories plus the "neither agree nor disagree" category. So for that purpose, you can choose a dividing line that has some substantive meaning for your audience and present descriptive summaries that way.

    But if you are going to use these data for testing hypotheses or estimating models, it is rarely a good idea to do this. I would go so far as to say it is never a good idea to do this from the start. After running an analysis, if it is clear that one category of a multi-category variable draws very few responses and, as a result, its presence is dragging down the precision of your model, and if there is another category with which it could be merged in a way that is meaningful, and where the outcome distributions in those two categories are fairly similar, then it might improve things. But that doesn't happen all that often, and you should only proceed in this way if you actually encounter such a situation. Information is scarce and precious; you should always conserve it where you can.

    Comment


    • #3
      Ok. this statement is from a work already published, but I am new to analysis and making thoughts about that. I have not his observation numbers for this categories. In my analysis I want to explain with a probit model the quit intention of workers. They have been asked how many times in a year they think about quitting their job, 5 categories: 1-every day, 2- sometimes per week, 3- ...per month, 4- per year,5- never, out of 5000 observations the first three answers got 500 observations the fourth category 1100 and the never option 3400 answers. So I would code for example all 4 kategories with 1 and the "never" category with 0, because of the number of observations. Similar to Job satisfaction:
      0 complete unsatisfied | 24 0.46 0.46
      1 | 17 0.32 0.78
      2 | 48 0.92 1.70
      3 | 77 1.47 3.17
      4 | 133 2.54 5.71
      5 | 269 7.04 12.75
      6 | 278 7.21 19.96
      7 | 953 18.18 38.14
      8 | 1,990 37.97 76.11
      9 | 824 15.72 91.83
      10 Completely satisfied | 428 8.17 100.00
      ---------------------------+-----------------------------------
      Total | 5,041 100.00

      I would code for a biprobit model then job satisfaction as 1 satisfied, for values 8-10, rest is 0. Is that ok? I want to do a biprobit model for them, so it is not possible to take ordered logit/probit, I will also do that separately.....
      And how can summarize questions which describe similar things, e.g. working conditions like noise, with more than one item and each 5 categories? Corresponding to this paper, I could create one dummy with value one, if different items have the highest categories as answers....?

      Comment


      • #4
        I agree with Clyde as usual. I don't get this at all. Why throw away information? You have detail on a scale from 0 to 10. Keep it!

        Comment


        • #5
          It is useful from a theoretical point of view and why I can use a biprobit, because it assumes a recursive structure........Is it better to use just one item for one working condition, which rebuilds the rest?

          Comment


          • #6
            I am calling your bluff here: What theory here leads to any reduction to indicator variables? Can you tell me that a theory tells you that 8, or any other level or levels, is a threshold for action or response? Why measure on a 0 to 10 scale at all if those steps aren't definably different?

            I'd want the model to reflect the nature of the problem. Being interested in a method on other grounds is harder for me to defend.

            Comment


            • #7
              So the theoretic approach is something about utility theory, in a latent form. And part of the non-pecuniary part of it is job satisfaction, wage the other part. But in general, it is a subjective variable, what does any answer will tell you in detail. If I have a satisfaction rate of 4 or 5 is maybe the same with what you mean, I think you can't be so strict in numbers with survey questions....But then after that you lose information and are not able to make calculations about uncertainty..... I understand why aggregating is very problematic.....But what shall I do instead? I am writing my master thesis, and that is the common way in the economic literature, even if statistically incorrect........
              Last edited by Sebastian Bauer; 19 Jan 2017, 13:11.

              Comment


              • #8
                If I have a satisfaction rate of 4 or 5 is maybe the same with what you mean, I think you can't be so strict in numbers with survey questions....
                But if, for example, you dichotomize this variable, with 4-5 in one group and 1-3 in the other, then you are saying that 3 and 4 are radically different responses, even though they are no more different than 4 and 5. In fact you are saying that a person who responds 1 is more like a person who responds 3 than a person who responds 4 is. This is how dichotomizing these variables distorts reality and throws away information. Even though the individual numbers are not precise, they are more precise than any arbitrary grouping. Only if you have a theory that justifies a specific way of grouping them does it make sense to do so.

                What you should do instead is use the variables in the existing numeric form in your analyses. Instead of -probit- or -logistic- regressions use -oprobit- (resp. -ologit-) or even just plain old -regress-.

                It is true that for the purposes of decision making and taking actions one might have to find a cut-off: employees who report thinking about leaving more than some cut-off frequency might benefit from some kind of workplace intervention, similarly for those below some threshold level of job satisfaction. But utility theory demands that you first model the data as they are to build a decision model and then use decision analysis to discover the threshold for intervention. If you start out with a cut-point that you pull out of nowhere, it is likely to be quite suboptimal from the perspective of utility theory.

                Comment


                • #9
                  Ok, thannks Clyde. It is clear, that as long there are no explored thresholds, the data have to be used like they are. So I will use seperately ordererd logit/probit models. And I have heard that there exists a command "bioprobit" for bivariate ordered probit models, I think that would be the good way. Besides that, I have still questions about dichotomizing variables: If I have response categories like 1) 1: fully agree 2: agree 3: undetermined 4: rather not agree 5: fully disagree// or 2) 1: very probable 2:probable 3: is not probable mostly 4: not probable......Then it should be ok, to say the two first categories are exactly the given case in the question-> value 1, the rest not....or?

                  Comment


                  • #10
                    Then it should be ok, to say the two first categories are exactly the given case in the question-> value 1, the rest not....or?
                    I don't understand what you're asking here.

                    If you are asking whether it is OK to combine categories 1 and 2, the answer is still that it depends on how you plan to use that. For displaying simple summary statistics that could be a reasonable approach. But to use in analysis, no.

                    Comment


                    • #11
                      Ok lets assume one example: as independent variable I want to use working conditions. One is harm, for the ordered equation of job satisfaction. It is measured as follows: "What is the case for your work? Answer for each statement how much you agree!...........F: At my work I am exposed to awkward noise or smell.-> categories 1: fully agree, 2: agree 3: undetermined 4: not agree 5: fully not agree......How can I include that in my analysis? Especially if there are more corresponding items? And e.g. if there is a question about the health status - 1:very good, 2: good, 3:bad....is it the right way to use that in stata as "i.health_status" as independent variable in the analysis?

                      Comment


                      • #12
                        if there is a question about the health status - 1:very good, 2: good, 3:bad....is it the right way to use that in stata as "i.health_status" as independent variable in the analysis?
                        It depends. If you are expecting a more or less linear effect of this predictor on your outcome, then I would just enter it as health_status without the i. prefix. But if you are expecting something very non-linear (say, perhaps, a U- or inverted U- relationship, or maybe you think it will just be flat for 1 through 3 and then jump up at 4 and jump higher at 5, etc.) then i.health_status is better because it has the flexibility to capture any kind of relationship. You might want to do some graphical exploration of these relationships before you start testing models..

                        Comment

                        Working...
                        X