Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to replace values for an unique pair?

    Hi Statalisters!

    I am trying to impute the missing unemployment data for US counties with their nearest non-missing pair. However, I am unable to execute a code that works. It would be really helpful if someone could advise how. I share a data and code example below.

    County Pairs
    // Missing_County Using_County
    //22051-->22057-->21.5mi
    //22071-->28045-->35.3mi
    //22075-->22057-->66.2mi
    //22087-->28045-->35.1mi
    //22089-->22093-->27.1mi
    //22095-->22093-->18.4mi
    //22103-->28045-->28.1mi

    use LF_Impute.dta, clear
    bys fips (year): gen fips_using = .
    // Replace `fips_using` based on the specified combinations
    replace fips_using = 22057 if fips == 22051
    replace fips_using = 28045 if fips == 22071
    replace fips_using = 22057 if fips == 22075
    replace fips_using = 28045 if fips == 22087
    replace fips_using = 22093 if fips == 22089
    replace fips_using = 22093 if fips == 22095
    replace fips_using = 28045 if fips == 22103


    // Define the variables to replace
    local vars "Total_Labor_Force Employed Unemployed Unemp_Rate"

    // Replace values for the specified years
    foreach var of local vars {
    replace `var' = `var' if fips == 22051 & fips_using == 22057 & inlist(year, 2005, 2006) & missing(`var')
    }

    input long fips int year long(Total_Labor_Force Employed Unemployed) double Unemp_Rate float fips_using
    22051 1990 227512 214685 12827 5.6 22057
    22051 1991 233370 217847 15523 6.7 22057
    22051 1992 232850 213654 19196 8.2 22057
    22051 1993 232414 213861 18553 8 22057
    22051 1994 236591 218609 17982 7.6 22057
    22051 1995 236627 221041 15586 6.6 22057
    22051 1996 238749 224843 13906 5.8 22057
    22051 1997 240726 228747 11979 5 22057
    22051 1998 242255 231298 10957 4.5 22057
    22051 1999 240788 230686 10102 4.2 22057
    22051 2000 231068 220742 10326 4.5 22057
    22051 2001 227806 217126 10680 4.7 22057
    22051 2002 224226 212908 11318 5 22057
    22051 2003 223626 212307 11319 5.1 22057
    22051 2004 223975 213509 10466 4.7 22057
    22051 2005 . . . . 22057
    22051 2006 . . . . 22057
    22051 2007 221067 213404 7663 3.5 22057
    22051 2008 215107 206047 9060 4.2 22057
    22051 2009 211466 198168 13298 6.3 22057
    22051 2010 216160 201126 15034 7 22057
    22051 2011 213834 199220 14614 6.8 22057
    22051 2012 213213 199607 13606 6.4 22057
    22051 2013 213738 200869 12869 6 22057
    22051 2014 217170 204407 12763 5.9 22057
    22051 2015 218574 206102 12472 5.7 22057
    22051 2016 216748 205292 11456 5.3 22057
    22051 2017 215341 205590 9751 4.5 22057
    22051 2018 213823 204547 9276 4.3 22057
    22051 2019 214317 205693 8624 4 22057
    22051 2020 209739 189702 20037 9.6 22057
    22053 1990 11972 11163 809 6.8 .
    22053 1991 12681 11608 1073 8.5 .
    22053 1992 12496 11196 1300 10.4 .
    22053 1993 11708 10729 979 8.4 .
    22053 1994 11849 10885 964 8.1 .
    22053 1995 11862 11048 814 6.9 .
    22053 1996 11663 10831 832 7.1 .
    22053 1997 11837 11065 772 6.5 .
    22053 1998 11754 11057 697 5.9 .
    22053 1999 11358 10596 762 6.7 .
    22053 2000 12739 12012 727 5.7 .
    22053 2001 12835 12089 746 5.8 .
    22053 2002 13086 12343 743 5.7 .
    22053 2003 13543 12768 775 5.7 .
    22053 2004 13484 12818 666 4.9 .
    22053 2005 14369 13473 896 6.2 .
    22053 2006 14526 14078 448 3.1 .
    22053 2007 14578 14090 488 3.3 .
    22053 2008 14884 14319 565 3.8 .
    22053 2009 14708 13856 852 5.8 .
    22053 2010 13106 12173 933 7.1 .
    22053 2011 12957 12070 887 6.8 .
    22053 2012 13037 12228 809 6.2 .
    22053 2013 13169 12385 784 6 .
    22053 2014 13514 12735 779 5.8 .
    22053 2015 13543 12709 834 6.2 .
    22053 2016 13592 12773 819 6 .
    22053 2017 13643 12979 664 4.9 .
    22053 2018 13534 12899 635 4.7 .
    22053 2019 13328 12725 603 4.5 .
    22053 2020 12745 11731 1014 8 .
    22055 1990 82613 79173 3440 4.2 .
    22055 1991 85404 81351 4053 4.7 .
    22055 1992 86000 80879 5121 6 .
    22055 1993 85061 81033 4028 4.7 .
    22055 1994 87516 83314 4202 4.8 .
    22055 1995 89219 85136 4083 4.6 .
    22055 1996 91639 87906 3733 4.1 .
    22055 1997 94983 91487 3496 3.7 .
    22055 1998 97263 93456 3807 3.9 .
    22055 1999 95162 91292 3870 4.1 .
    22055 2000 97017 93156 3861 4 .
    22055 2001 99530 95432 4098 4.1 .
    22055 2002 98500 94036 4464 4.5 .
    22055 2003 98028 93343 4685 4.8 .
    22055 2004 98168 93948 4220 4.3 .
    22055 2005 104618 99139 5479 5.2 .
    22055 2006 108532 105388 3144 2.9 .
    22055 2007 110014 106850 3164 2.9 .
    22055 2008 112131 108523 3608 3.2 .
    22055 2009 112171 106478 5693 5.1 .
    22055 2010 113598 106923 6675 5.9 .
    22055 2011 114387 107958 6429 5.6 .
    22055 2012 117238 111620 5618 4.8 .
    22055 2013 119001 113366 5635 4.7 .
    22055 2014 121951 116068 5883 4.8 .
    22055 2015 119603 112863 6740 5.6 .
    22055 2016 114488 107456 7032 6.1 .
    22055 2017 113287 107757 5530 4.9 .
    22055 2018 114218 109176 5042 4.4 .
    22055 2019 115164 110381 4783 4.2 .
    22055 2020 114314 105843 8471 7.4 .
    22057 1990 36643 35133 1510 4.1 .
    22057 1991 37099 35363 1736 4.7 .
    22057 1992 36481 34237 2244 6.2 .
    22057 1993 36516 34363 2153 5.9 .
    22057 1994 37554 35442 2112 5.6 .
    22057 1995 37683 35938 1745 4.6 .
    22057 1996 38900 37387 1513 3.9 .
    end
    [/CODE]

  • #2
    Consider rangestat from SSC.

    Code:
    replace fips_using = 0 if fips_using == . 
    
    rangestat wanted=Unemp_Rate, interval(fips fips_using fips_using) by(year)

    Comment


    • #3
      Your code will do nothing. There are only a few observations in the data set that meet the -if- condition, but even there, -replace `var' = `var'- evidently leaves everything as it was in the first place.

      I'm not entirely sure I understand what you want to do. But perhaps what you mean is that where one of the variables in local macro vars has a missing value, replace it with the value of vars found in the observation having the same year and whose fips value is the fips_using value of the current observation. If so, I believe the code you want is this:

      Code:
      // Define the variables to replace
      local vars Total_Labor_Force Employed Unemployed Unemp_Rate
      
      frame copy default imputers
      frame imputers {
          local mvars: subinstr local vars " " ", ", all
          drop if missing(`mvars', fips)
      }
      
      frlink m:1 fips_using year, frame(imputers fips year)
      foreach v of varlist `vars' {
          replace `v' = frval(imputers, `v') if missing(`v')
      }
      By the way, this code does not work in your example data, because it has no year 2005 and 2006 observations for fips 22057. But when suitably extended with fake data to cover that gap, it does this trick. I hope this trick is what you are looking for. If not, please post back with a clear explanation of what you want. Better still, show an example data set that includes the necessary data and then show what you want the result to look like.

      Added: Crossed with #2 which shows a different solution.

      Comment


      • #4
        Hi Nick,

        Thanks for responding. I don't think I am using the code in the right way perhaps, as the missing values still exist, can you please elaborate on this.
        use LF_Impute.dta, clear
        bys fips (year): gen fips_using = .
        // Replace `fips_using` based on the specified combinations
        replace fips_using = 22057 if fips == 22051
        replace fips_using = 28045 if fips == 22071
        replace fips_using = 22057 if fips == 22075
        replace fips_using = 28045 if fips == 22087
        replace fips_using = 22093 if fips == 22089
        replace fips_using = 22093 if fips == 22095
        replace fips_using = 28045 if fips == 22103
        ************************************************** ********************
        replace fips_using = 0 if fips_using == .

        rangestat wanted=Unemp_Rate, interval(fips fips_using fips_using) by(year)
        *

        }
        Click image for larger version

Name:	missing.png
Views:	1
Size:	127.5 KB
ID:	1768663

        Comment


        • #5
          It's up to you to replace your values if you wish. My code doesn't purport to do that.

          Comment


          • #6
            Hi Clyde,

            Let me try explaining the problem that I am trying to solve. I have 7 counties that have missing labor force data for the years 2005 and 2006 and I want to use their nearest non-missing neighbor to impute for the missing values. These county combinations are as follows:

            // Missing_County Using_County
            //22051-->22057-->21.5mi
            //22071-->28045-->35.3mi
            //22075-->22057-->66.2mi
            //22087-->28045-->35.1mi
            //22089-->22093-->27.1mi
            //22095-->22093-->18.4mi
            //22103-->28045-->28.1mi

            Thus I need a code that works for this purpose.

            Comment


            • #7
              Nick,

              It would be really helpful if you can suggest how can I use your code chunk with replace? I am quite confused about this.

              Comment


              • #8
                Re #7: Then the code in #3 will do that.

                Comment


                • #9
                  You should not be confused. If you want to replace your variable Unemp_Rate with wanted you go


                  Code:
                  replace Unemp_Rate = wanted if missing(Unemp_Rate)
                  To me it is a cardinal principle of data management never to change original data, but always to generate a new variable. (That's not to say I've never done it, but it is practice to avoid.)

                  Comment


                  • #10
                    Going about it your way is going to produce a weird numbers, unless the closest match is identical in size.

                    If you want to proceed, then use growth rates. That way, the imputed values match the level of the data.

                    There would be several ways to proceed. You could use the mean growth rates for the missing years. Or, you could find the series with the lowest RMSE and use those growth rates. Or use neighboring counties.

                    As you're basically making data up, you could also use interpolation. Unless there's a calendar year effect for the missings, it probably just as good as any approach (maybe not impute).

                    Code:
                    xtset fips year
                    g dTotal_Labor_Force = Total_Labor_Force/l.Total_Labor_Force
                    egen change = mean(dTotal_Labor_Force), by(year)
                    tabstat change , by(year)
                    bys fips: carryforward Total_Labor_Force, gen(Total_Labor_Force_Fix)
                    
                    replace Total_Labor_Force_Fix = round(change*l.Total_Labor_Force_Fix) if mi(Total_Labor_Force)
                    
                    bys fips: ipolate Total_Labor_Force year, gen(Total_Labor_Force_ipolate)
                    
                    g TLFmissing = mi(Total_Labor_Force)
                    "The data include missing values for seven counties in years 2005 and 2006. Missing values are imputed using the sample mean growth rates in those years for each variable. [Missing values are imputed using the mean growth rate for each variable for the county with the smallest RMSE for the available data.] (Missing values are imputed using the mean growth rate for variable for neighboring counties.)"

                    Estimate both with and without the imputed data to see if much changes (I doubt it, given the small number of missing). And, estimate with the missing data too, if feasible. And, include a dummy variable for any replaced values as another check (TFLmissing).
                    Last edited by George Ford; 02 Dec 2024, 16:18.

                    Comment


                    • #11

                      Actually, that's not quite right. You need to grow from the prior year.
                      Code:
                      replace Total_Labor_Force_Fix = round(change*l.Total_Labor_Force_Fix) if mi(Total_Labor_Force) & year==2005
                      replace Total_Labor_Force_Fix = round(change*l.Total_Labor_Force_Fix) if mi(Total_Labor_Force) & year==2006

                      Comment


                      • #12
                        Thank you all for the great suggestions, however, I couldn't get the desired results using any of these, hence I kind of used a manual replacement.

                        Comment


                        • #13
                          I ran that code on your dataex.

                          How are you going to explain manual replacement?

                          Comment


                          • #14
                            What I was trying to achieve is replace the missing year's data with the nearest non-missing neighbor. Here's what I did,

                            // Missing_County Using_County
                            //22051-->22057-->21.5mi
                            //22071-->28045-->35.3mi
                            //22075-->22057-->66.2mi
                            //22087-->28045-->35.1mi
                            //22089-->22093-->27.1mi
                            //22095-->22093-->18.4mi
                            //22103-->28045-->28.1mi
                            ************************************************** ********************
                            // Load the dataset
                            use LF_Impute.dta, clear

                            // Only way to do this by code would be to create merges of the dataset.

                            gen fips_missing = .
                            keep if fips==22057
                            keep if year==2005|year==2006
                            replace fips_missing = 22051
                            expand 2
                            replace fips_missing = 22075 if _n>2 // third and fourth observation
                            save Impute1.dta, replace

                            ************************************************** ********************
                            use LF_Impute.dta, clear

                            // Only way to do this by code would be to create merges of the dataset.

                            gen fips_missing = .
                            keep if fips==22093
                            keep if year==2005|year==2006
                            replace fips_missing = 22089
                            expand 2
                            replace fips_missing = 22095 if _n>2 // third and fourth observation
                            save Impute2.dta, replace


                            ************************************************** ********************
                            use LF_Impute.dta, clear

                            // Only way to do this by code would be to create merges of the dataset.

                            gen fips_missing = .
                            keep if fips==28045
                            keep if year==2005|year==2006
                            replace fips_missing = 22071
                            expand 2
                            replace fips_missing = 22087 if _n>2 // third and fourth observation
                            expand 2
                            replace fips_missing = 22103 if _n>4 // third and fourth observation
                            drop if _n>6
                            save Impute3.dta, replace
                            ***********************************************
                            clear all
                            use Impute1.dta, clear
                            append using Impute2.dta, force
                            append using Impute3.dta, force
                            ************************************************** ********************
                            // After the execution of this code chunk I now have stacked data from the using counties
                            // Now all we need to do is replace fips_missing with fips
                            drop fips
                            ren fips_missing fips
                            save Impute_Missing.dta, replace

                            ************************************************** ********************
                            use Impute_Missing.dta, clear
                            merge m:m fips year using LF_Impute.dta
                            drop if _merge==1
                            gen LF_Imputed_Status = "."
                            replace LF_Imputed_Status = "Imputed using Nearest Neighbor" if _merge==3
                            replace LF_Imputed_Status = "NA" if _merge==2
                            drop _merge
                            drop state county FIPS_ST FIPS_COUNTY

                            cd "D:\Academic\1-UNL\1-Research\1-Projects\1-US Studies\1-Hurricanes and Crime\1-Data\4-Economic Data\1-Unemp_LAUS"
                            sort fips year
                            save LF-Master_Imputed.dta, replace

                            Comment


                            • #15
                              what if the nearest neighbor is 20% the size of the missing county?

                              Comment

                              Working...
                              X