Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Predict upcoming year data by country

    Hello, I'm returning to Stata after a long hiatus and have forgotten much of my previous knowledge. I have a data set of demographic information of every country for the past 10 years, and want to use it to predict the next year's data for one or two variables. I know this is panel data, but am unsure how to regress the demographic info, predict the near year's data, and then input it as a new variable (presumably by adding it to the previous year's value). A sample of my data is below. I used reshape long to reach it's current format:

    year= year between 2006 and 2015
    country_num = numerical country code
    sur = number of surviving infants for that country and year
    cbr = crude birth rate for that country and year

    country_num | year | sur | cbr

    2 | 2006 | 1044049 | 43.9
    2 | 2014 | 864078 | 34.83
    2 | 2015 | 844474 | 33.98
    4 | 2007 | 35421 | 12.12
    4 | 2013 | 34920 | 11.91
    4 | 2015 | 34839 | 11.87

    Thank you!

  • #2
    Something like this:

    Code:
    // CREATE NEW OBSERVATIONS FOR YEAR 2016
    expand 2 if year == 2015
    by country year, sort: replace year = 2016 if year == 2015 & _n == 2
    
    levels of country, local(countries)
    
    foreach v of varlist sur cbr {
        replace `v' = . if year == 2016
        foreach c of local countries {
            regress `v' year if country == `c'
            predict `v'_hat
        }
    }
    Note: Not tested. Beware of typos, etc.

    A word of caution (well, a sentence, actually): variables like birth rates and surviving infants do not really lend themselves to interpolation/extrapolation by linear regression over more than very short periods of time.

    Your example data is shown in a not particularly useful way. For future posts, please read FAQ #12, especially the part about how to use -dataex- to better post data examples.

    Comment


    • #3
      Thanks so much. I'm only planning to use this calculation to interpolate one year ahead using linear regression. I added the dataex code, using it for the first time so it may not appear correctly.. Right now that code is generating blank observations for the year 2016 (it generates them, but isn't filling them in).Output says the following:

      foreach v of varlist sur {
      replace `v' = . if year == 2016
      foreach c of local country {
      regress `v' year if country == `c'
      predict `v'_hat
      }
      }
      (232 real changes made, 232 to missing)
      no observations
      r(2000);


      Would very much appreciate any further tips!

      Comment


      • #4
        here's a corresponding section of data, one country's example following the code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float sur int year
        1044049.4 2006
        1049075.8 2007
        1022607 2008
        994552.3 2009
        965808.3 2010
        937593.1 2011
        910824.6 2012
        886200.5 2013
        864078.7 2014
        844474.3 2015
        . 2016
        end

        Comment


        • #5
          Thanks for using -dataex-. You basically got it right, although when you copied the output, you forgot to include the code delimiters that precede and follow what you did copy. Similarly, in posting your code, you should bind it in code delimiters.

          In any case, what you posted is serviceable. The difficulty here is that you have no country variable, and have not defined local macro country. Consequently when Stata sees -foreach c of local country {-, local country evaluates to an empty string, and there are no c's to loop over. So the loop is skipped entirely, which is why you get nothing.

          If we expand your data set to include the country variable that you originally said was there, and add back the statement to define a local macro containing the values of that variable, then we can actually loop over them and get some results.

          That said, had you not made that erroneous modification to the code offered in #2, you would have encountered the error that I made in writing it. The -predict- command would have failed the second time through the loop because variable `v'_hat would also exist. I have corrected that in the code below: I have -predict- create a new variable, and then copy that variable to `v'_hat, but only for the current value of country.

          One more note: although it is perfectly legal to have a variable and a local macro with the same name, it's not a good practice. If you mistakenly type country when you meant `country', Stata will retrieve the value of variable country in the first observation and use that instead of whatever is in local macro country. By choosing macro names that do not exactly match (but are suggestive of) variables they are associated with, you assure that only the macro or only the variable can be referred to in the appropriate context: if you mix them up, Stata will throw an error and alert you to the problem.

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float sur int(year country)
          1044049.4 2006 1
          1049075.8 2007 1
            1022607 2008 1
           994552.3 2009 1
           965808.3 2010 1
           937593.1 2011 1
           910824.6 2012 1
           886200.5 2013 1
           864078.7 2014 1
           844474.3 2015 1
                  . 2016 1
          end
          
          levelsof country, local(countries)
          foreach v of varlist sur {
              replace `v' = . if year == 2016
              gen `v'_hat = .
              foreach c of local countries {
                  regress `v' year if country == `c'
                  predict temp
                  replace `v'_hat = temp if country == `c'
                  drop temp
              }
          }
          Last edited by Clyde Schechter; 05 Oct 2017, 09:14.

          Comment


          • #6
            That's very helpful about the macro names-- thank you! The country variable is in ID form, unsure if that's a problem given the syntax structure. I'm currently still getting missing values and the "no observations" error message, with blank values for the generated sur_hat variable and the 2016 rows of the sur variable. Would very much appreciate any advice!

            . expand 2 if year == 2015
            (264 observations created)

            . by country year, sort: replace year = 2016 if year == 2015 & _n == 2
            (264 real changes made)

            . levelsof country, local(countries)
            1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 3
            > 2 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
            > 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
            > 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 11
            > 1 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 1
            > 32 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
            > 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
            > 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 19
            > 4 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 2
            > 15 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235
            > 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256
            > 257 258 259 260 261 262 263 264

            . foreach v of varlist sur {
            2. replace `v' = . if year == 2016
            3. gen `v'_hat = .
            4. foreach c of local countries {
            5. regress `v' year if country == `c'
            6. predict temp
            7. replace `v'_hat = temp if country == `c'
            8. drop temp
            9. }
            10. }
            (232 real changes made, 232 to missing)
            (2904 missing values generated)
            no observations
            r(2000);

            .

            Comment


            • #7
              associated dataex info here:

              input float sur int(year country)
              . 2006 1
              . 2007 1
              . 2008 1
              . 2009 1
              . 2010 1
              . 2011 1
              . 2012 1
              . 2013 1
              . 2014 1
              . 2015 1
              . 2016 1
              1044049.4 2006 2
              1049075.8 2007 2
              1022607 2008 2
              994552.3 2009 2
              965808.3 2010 2
              937593.1 2011 2
              910824.6 2012 2
              886200.5 2013 2
              864078.7 2014 2
              844474.3 2015 2
              . 2016 2
              845226.2 2006 3
              870343.8 2007 3
              863638.1 2008 3
              855842.6 2009 3
              847206 2010 3
              838129.4 2011 3
              828811.5 2012 3
              819761.8 2013 3
              811120.4 2014 3
              803333.4 2015 3
              . 2016 3

              Comment


              • #8
                So, the error message is self-explanatory. Stata has encountered a country where there are no observations to do the regression. Remember that a regression can only include those observations where all of the model variables have non-missing values. It is quite clear looking at country == 1 in your example data that sur is missing for all years for that country. My guess is that you will find other countries that have only missing values for sur (and perhaps for your other variables as well).

                So the question is whether this represents an error in the data management that created your data set, or whether that is actually OK and we need to modify the program to work around that limitation. Here is how you would work around it. But, please, do not use this code unless you are sure that the presence of only missing values for a variable for an entire country's data is not a reflection of incorrect data management up to this point. If that is the result of data management errors, you can't really trust any of the data. Where there is one mistake, there are usually others as well.

                Code:
                clear*
                input float sur int(year country)
                . 2006 1
                . 2007 1
                . 2008 1
                . 2009 1
                . 2010 1
                . 2011 1
                . 2012 1
                . 2013 1
                . 2014 1
                . 2015 1
                . 2016 1
                1044049.4 2006 2
                1049075.8 2007 2
                1022607 2008 2
                994552.3 2009 2
                965808.3 2010 2
                937593.1 2011 2
                910824.6 2012 2
                886200.5 2013 2
                864078.7 2014 2
                844474.3 2015 2
                . 2016 2
                845226.2 2006 3
                870343.8 2007 3
                863638.1 2008 3
                855842.6 2009 3
                847206 2010 3
                838129.4 2011 3
                828811.5 2012 3
                819761.8 2013 3
                811120.4 2014 3
                803333.4 2015 3
                . 2016 3
                end
                
                levelsof country, local(countries)
                foreach v of varlist sur {
                    replace `v' = . if year == 2016
                    gen `v'_hat = .
                    foreach c of local countries {
                        capture noisily regress `v' year if country == `c'
                        if c(rc) == 0 { // SUCCESSFUL REGRESSION; PROCEED
                            predict temp
                            replace `v'_hat = temp if country == `c'
                            drop temp
                        }
                        else if inlist(c(rc), 2000, 2001) { // NO OR INSUFFICIENT OBSERVATIONS
                            display "Insufficient or No Observations: country `c'"
                        }
                        else {    // UNANTICIPATED ERROR
                            display as error "Unanticipated Problem: country `c'"
                            err `c(rc)'
                        }
                    }
                }
                This code will attempt the regression for each country. If Stata finds no observations, or insufficient observations to do the regression, it will report back error code 2000 or 2001, respectively, identify the country that exhibited the problem, and then move on to the next country without halting. If an error condition arises other than insufficient or no observations, Stata will identify the offending country and then halt with whatever error message arose.

                Comment


                • #9
                  You're very correct that there is significant missing data, though this is unavoidable in my data set and I would like to predict the remaining countries' data regardless. This code worked for me, thank you so much!

                  Comment

                  Working...
                  X