Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • 0 instead of "missing" values in numerical variables

    Hi All,

    This might be a simple question but I just wanted to clarify how to handle 0 values where I have a numerical variable. Let's assume there are 100 farmers in my sample, 60 that only grow crop A and 40 that only grow crop B. In my dataset, the yield variable for those who grow crop A, will show a "missing" value for crop B and vice-versa. As an example, see the table below that shows a list of 4 farmers out of the sample.
    Farmer Crop grown Yield A Yield B
    1 A 5 .
    2 A 10 .
    3 B . 9
    4 A 8 .
    In this case, if I were to include the yield variables in a regression, there would be a lot of data lost given that not all respondents grow both crops. Would it be correct to replace "." by "0s" given that these are not really missing values but the question just didn't apply for that farmer?

    Thanks!

  • #2
    Rodrigo:
    I would -stack- -YieldA- and -YieldB- before -regress-:


    Code:
    . set obs 3
    
    . g A=runiform()
    
    . g B=runiform()
    
    .  stack A B , into(Harvest) clear
    
    . label define _stack 1 "Yield A" 2 "Yield B"
    
    . label val _stack _stack
    
    . list
    
         +--------------------+
         |  _stack    Harvest |
         |--------------------|
      1. | Yield A   .5844658 |
      2. | Yield A   .3697791 |
      3. | Yield A   .8506309 |
      4. | Yield B   .3913819 |
      5. | Yield B   .1196613 |
         |--------------------|
      6. | Yield B   .7542434 |
         +--------------------+
    
    .

    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hi Carlo,

      Thanks, that's a really good idea. My only concern would be that these are two very different products but given that the unit of measurement and the expected impact on the dependent variable is the same, I understand that wouldn't be an issue, right?

      Best regards,
      Rodrigo

      Comment


      • #4
        What's yield? If it's total production then arguably missing means zero if and only if there was no crop at all.

        If it means kg or tonnes per hectare replacement seems to imply that if farmers grew a crop they didn't in fact grow the crop would have perished. I can't see that making sense in principle, let alone helping any analysis with a line of zeros on each dimension on a scatter plot.

        I can imagine data on baseball stars lacking data on their cricket performance and vice versa, but that would be a case where if data did not exist there is no basis for imputing them.

        The deeper question is what you want to do with these data any way.

        Comment


        • #5
          I can't see that the yields of different crops can be compared meaningfully unless the measurement units make sense for the problem. Even the weight might not enough to be a proxy for transport costs as density and whether products are delicate can be drivers too.

          Comment


          • #6
            Hi Nick,

            Thanks for your inputs. The yield variable is measured as production/area. My line of thinking was more related to your comment about baseball and cricket players. In this case, there are areas where only one type of crop is produced while in other areas another crop is produced exclusively. So when arrived to the question of which crop farmers grew, they only responded the yield fro that crop. Therefore, there is only data for yield A or yield B, meaning that data does not exist for the other. That is why I thought that of using 0s instead of ".". The objective is to evaluate the effect of crop's yield on a dependent variable. It's a simple regression.

            Best regards,
            Rodrigo

            Comment


            • #7
              Your zeros aren’t benign. Once in your data they would be taken utterly literally. I can’t see that helping any goal.

              Comment

              Working...
              X