Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Construct annual rainfall panel data (across countries and years)

    Dear All,
    1. This is more a climatic problem than a Stata problem. I hope that I do not violate the regulation of this Forum.
    2. The purpose is to use Stata to obtain a complete panel data of rainfall across all countries and over years (say, from 1960-recent).
    3. According to Rabah Arezki, and Markus Brückner (2012, JIE), https://www.sciencedirect.com/scienc...747?via%3Dihub: Our data sources for the estimation of the above equations are as follows. The annual rainfall data are from Terrestrial Air Temperature and Precipitation: 1900–2006 Gridded Monthly Time Series, Version 1.01 (Matsuura and Willmott, 2007). These rainfall data come at a high resolution (0.5×0.5° latitude-longitude grid) and each rainfall observation in a given grid is constructed by interpolation of rainfall observed by all stations operating in that grid. The rainfall data are then aggregated to the country level by assigning grids to the geographic borders of countries.
    4. I am an applied economist, and not so familiar with the data. I wonder if anybody happens to know the data and can help me with the construction of the rainfall data. Thanks.
    Ho-Chuan (River) Huang
    Stata 17.0, MP(4)

  • #2
    A little work with my favorite search engine (DuckDuckGo) finds the following links, which may prove helpful.

    Matsura and Wilmott's home pages, including a link to their data:
    A link to a paper by Dell, Jones, and Olken that appears to use the same dataset (see the link to the online appendix for details about their dataset):
    Added in edit: it was not clear to me if your need for assistance referred to the interpolation part of the description, which the links suggest is already done by Matsura and Wilmott in preparing their data, or if you need assistance with the geographic aggregation, which is more a geographic information systems issue.
    Last edited by William Lisowski; 06 Apr 2018, 07:19.

    Comment


    • #3
      Dear William, Many thanks for your replies. I have visited the website (http://climate.geog.udel.edu/~climat.../download.html) you suggested and download two zipped files (along with README as below. Terrestrial Precipitation: 1900-2014 Gridded Monthly Time Series (V 4.01)

      README Documentation
      precip_2014.tar.gz (243MB) Terrestrial Monthly Precipitation series
      precip_cv2014.tar.gz (234MB) Cross validation of precip_2014.tar.gz
      I unzip the files (I think these files are what I wanted) and find files precip.1900 to precip.2014 (each for a particular year). There are information of longitude and latitude (along with monthly rainages?). I suppose those can be aggregated into annual rainfalls data over countries. Now, the problem is how? Can Stata do that? Any suggestions are highly appreciated.
      Ho-Chuan (River) Huang
      Stata 17.0, MP(4)

      Comment


      • #4
        To close the loop for those who may read this topic in the future, this discussion moved to the topic below, coincidentally started by another user of the same data.

        https://www.statalist.org/forums/for...data-from-noaa

        Comment


        • #5
          When I try http://climate.geog.udel.edu/~climat.../download.html, here's what I get:
          Code:
          Forbidden
          You don't have permission to access /~climate/html_pages/download.html on this server.
          Apache/2.2.15 (CentOS) Server at climate.geog.udel.edu Port 80
          The same for

          Code:
          http://climate.geog.udel.edu/~climate/
          How did you get access?

          Comment


          • #6
            Absolutely no problems from my end, Robert. Just now I clicked directly on the link in your post, for the first one, and copied-and-pasted for the second. Both Safari and Firefox on my Mac on my home WiFi, and for good measure, Safari on iOS over my cell connection. I'm at a loss as to what is going wrong for you.

            Comment


            • #7
              Thanks for checking, I had assumed that this was just an .edu domain restriction (this is pretty frequent for subscription things where you have to use a university IP address). Must be something with my router then because all computers in the household can't get there. I can get through from my iPhone so I set it as a hot spot and my Mac can now get through. Again, this points to my router (or Comcast!) but the problem persists, even after restarting it. Bizarre.

              Comment


              • #8
                Dear Robert, I assume that there is no problem with copyright. I can send your the file in case you need. Just let me know.
                Ho-Chuan (River) Huang
                Stata 17.0, MP(4)

                Comment


                • #9
                  Hi, Robert, I don't known whether you have to fill out the following form (http://climate.geog.udel.edu/cgi-bin...5030252.pl?add) in order to use the data.
                  Ho-Chuan (River) Huang
                  Stata 17.0, MP(4)

                  Comment


                  • #10
                    Nope, that's not it, it just looks like my IP is blocked. Anyways, I got the precip_2014.tar.gz file from the site and this is in a very simple format that's easy to work with in Stata. I do not know where you are going with this but the following should get you started. First, download a shapefile of the world's country boundaries from:
                    Code:
                    http://thematicmapping.org/downloads/world_borders.php
                    by clicking on "TM_WORLD_BORDERS-0.3.zip". The following code will convert the shapefile to Stata datasets and then use the grid coordinates from the first precipitation dataset ("precip.1900") and perform a spatial join to match each grid point to a country _ID.
                    Code:
                    * the shapefile for the world was downloaded from
                    * http://thematicmapping.org/downloads/world_borders.php
                    * and clicking on "TM_WORLD_BORDERS-0.3.zip"
                    shp2dta using "TM_WORLD_BORDERS-0.3/TM_WORLD_BORDERS-0.3.shp", ///
                        data("world_data.dta") coor("world_coor.dta") replace
                    
                    * import weather data one time period 
                    infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ///
                        using "precip_2014/precip.1900", clear
                        
                    * reduce to grid coordinates
                    keep lat lon
                    order lat lon
                    
                    * match grid coordinates to countries
                    geoinpoly lat lon using "world_coor.dta"
                    
                    * drop a few grid points that did not match any country
                    drop if mi(_ID)
                    
                    * show the matching grid points
                    scatter lat lon, msize(vtiny)
                    graph export "world_grid.png", width(800) replace
                    save "world_grid.dta", replace
                    and here's what the "world_grid.png" map looks like:

                    Click image for larger version

Name:	world_grid.png
Views:	1
Size:	679.2 KB
ID:	1438603


                    You do not need to repeat the spatial join for each year since the grid coordinates remain the same across years. You can use merge to match the country _ID for any year.

                    The following code will input annual precipitation datasets (monthly data points in wide layout) into Stata and match each grid point to a country _ID. It also converts the data to a long layout, one observation per grid point per month. I'm using filelist (from SSC) to make a Stata dataset of files to process. I'm using runby (from SSC) to process each annual dataset. In a long layout, there's a bit over 1 million observations per year and there's 115 years of data so if all years are processed, the final data will contain over 115 million observations. I'm just doing 1900 to 1910 in the example and that produces a dataset that has 11,239,536 observations and takes about 1 minute to run on my computer:
                    Code:
                    clear all
                    
                    * get a list of all the files in the "precip_2014" subdirectory
                    filelist , dir("precip_2014")
                    
                    * extract the year from the file name
                    gen year = real(subinstr(filename,"precip.","",1))
                    assert !mi(year)
                    sum year
                    
                    * select a range a years to process
                    keep if inrange(year, 1900, 1910)
                    
                    program input_each_year
                        // copy file path and year to local macros
                        local filename = dirname + "/" + filename
                        local thisyear = year
                        
                        dis "---------------------------- processing `filename'"
                        infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ///
                            using "`filename'", clear
                        gen int year = `thisyear'
                        
                        // merge each grid point to get the _ID of the matched country
                        merge 1:1 lat lon using "world_grid.dta", ///
                            assert(master match) keep(match) nogen
                        
                        // convert to a long layout
                        quietly reshape long p, i(lat lon) j(month)
                        order lat lon year month p
                        isid lat lon year month, sort
                        
                        // what's left at this point is considered results and accumulates
                    end
                    runby input_each_year, by(year) verbose status
                    
                    isid lat lon year month, sort
                    save "precip_monthly_grid_1900_to_1910.dta", replace

                    Comment


                    • #11
                      See this parallel thread for a solution that extracts data for select countries only.

                      Comment


                      • #12
                        Dear Robert,
                        1. I cannot appreciate enough for your kind help. The result is pretty much what I need.
                        2. A quick question is that, because I need "annual" rather than "monthly" observations, what should I do in "program input_each_year" (to save, probably, space and time). I know that I can do that by using -collapse- on the final results. Thanks in advance.
                        Ho-Chuan (River) Huang
                        Stata 17.0, MP(4)

                        Comment


                        • #13
                          That makes it even simpler. Just add the monthly precipitation data using rowtotal()to get an annual total for the grid point. You do not need to reshape anymore and that cuts down the overall number of observations by 12. The following processes all years at once and runs in about 5 minutes on my computer:

                          Code:
                          clear all
                          
                          * get a list of all the files in the "precip_2014" subdirectory
                          filelist , dir("precip_2014")
                          
                          * extract the year from the file name
                          gen year = real(subinstr(filename,"precip.","",1))
                          assert !mi(year)
                          sum year
                          
                          program input_each_year
                              // copy file path and year to local macros
                              local filename = dirname + "/" + filename
                              local thisyear = year
                              
                              dis "---------------------------- processing `filename'"
                              infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ///
                                  using "`filename'", clear
                              gen int year = `thisyear'
                              egen p = rowtotal(p1-p12)
                              drop p1-p12
                              
                              // merge each grid point to get the _ID of the matched country
                              merge 1:1 lat lon using "world_grid.dta", ///
                                  assert(master match) keep(match) nogen
                              
                              // convert to a long layout
                              order lat lon year p
                              isid lat lon year, sort
                              
                              // what's left at this point is considered results and accumulates
                          end
                          runby input_each_year, by(year) status
                          
                          isid lat lon year, sort
                          save "precip_annual_grid_1900_to_2014.dta", replace
                          Just for the fun of it, here a map of the world with the grid points at the extreme of precipitation totals. There's a fair amount of grid points that have registered zero precipitation in a given year. There's only one whose at the max, with a whopping 19.68 meters of rain in a year. You can look it up in Google Maps by searching for 25.25, 91.75. The following plots grid points with more than 10 meters of rain in a year:

                          Code:
                          use "precip_annual_grid_1900_to_2014.dta", clear
                          sum p
                          list if p == r(max)
                          gen low  = p == r(min)
                          gen high = p > 10000
                          
                          keep if low | high
                          
                          bysort lat lon: keep if _n == 1
                          append using "world_coor.dta"
                          
                          line _Y _X if mi(p), lwidth(vthin) lcolor(gray) cmissing(n) ///
                          || ///
                          scatter lat lon if low == 1, msize(tiny) mcolor(blue) ///
                          || ///
                          scatter lat lon if high == 1, msize(tiny) mcolor(red) ///
                          legend(off) aspect(.5)
                          graph export "min_max.png", width(1500) replace
                          and the map:
                          Click image for larger version

Name:	min_max2.png
Views:	1
Size:	539.3 KB
ID:	1438820

                          Last edited by Robert Picard; 11 Apr 2018, 10:23.

                          Comment


                          • #14
                            Dear Robert,
                            1. Many thanks again for the help in constructing "annual" data. The code works just fine. When I check the final data (precip_annual_grid_1900_to_2014.dta), say I keep if _ID==209 (U.S.), I find that there are 511,635 observations across `lat/lon' along with `year's. So, I suppose that I further need to do the following to obtain (one-for-each-year) annual observation for the whole U.S.:
                              Code:
                              collapse (sum) p, by(year)
                              and
                              Code:
                              . sum
                              	
                              	    Variable |        Obs        Mean    Std. Dev.       Min        Max
                              	-------------+---------------------------------------------------------
                              	        year |        115        1957    33.34167       1900       2014
                              	           p |        115     3051408    175251.9    2529683    3427025
                              Do you think these numbers (basic statistics) are reasonable?
                            2. The graph is interesting.
                            Ho-Chuan (River) Huang
                            Stata 17.0, MP(4)

                            Comment


                            • #15
                              I'll assume that you are asking this because you want to benchmark these data with data from other sources. I don't have experience with weather data but I found a national time series for the contiguous US from NOAA from this page. You choose the following options:
                              Code:
                              Parameter: Precipitation
                              Time Scale: 12-Month
                              Month: December
                              Start Year: 1900
                              End Year: 2014
                              Options > Display Base Period 1900 - 2014
                              You can't simply add up the precipitation value for each grid point because the data was matched to grid locations defined using raw geographic coordinates. The problem is that the area around each grid point is not constant due to the fact that lines of longitude converge toward the poles whereas lines of latitude are parallel to each other. The distance between 0.5 degrees of latitude is the same all over the world (arguments for geodist are lat1 lon1 lat2 lone):
                              Code:
                              . geodist 0 0 0.5 0, sphere
                              Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km
                              
                              . geodist 0 99 0.5 99, sphere
                              Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km
                              
                              . geodist 50 33 50.5 33, sphere
                              Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km
                              The distance between 0.5 degrees of longitude depends on the latitude and becomes 0 at the poles:
                              Code:
                              . geodist 0 0 0 0.5, sphere
                              Great-circle distance (haversine formula, radius of 6371km) = 55.597463 km
                              
                              . geodist 45 0 45 0.5, sphere
                              Great-circle distance (haversine formula, radius of 6371km) = 39.313281 km
                              
                              . geodist 90 0 90 0.5, sphere
                              Great-circle distance (haversine formula, radius of 6371km) = 3.404e-15 km
                              So before you can calculate an overall statistic for the US from all these grid points, you must estimate the area around each grid point (the area of the grid point's Voronoi diagram). Around the equator, this will be more or less a square box of 55.6*55.6 square km. On a sphere, the distance as you walk along a line of latitude is proportional to the cosine of the latitude. So the width of the box for a grid point at 45 degrees latitude would be (trig functions require angles in radians):
                              Code:
                              . scalar d2r = c(pi) / 180
                              
                              . dis 0.5 * d2r * cos(45 * d2r) * 6371
                              39.313343
                              which is very close to the "as the crow fly" distance calculated by geodist above.

                              Since this is an interesting exercise, here's how I would replicate NOAA's time series using the gridded dataset. First, you will need a shapefile that you can use to select grid points that fall within the 48 conterminous (contiguous) US states. You can get this from the Census using this link. Once unzipped, you can create the Stata datasets using:
                              Code:
                              * source: http://www2.census.gov/geo/tiger/GENZ2017/shp/cb_2017_us_state_500k.zip
                              shp2dta using "cb_2017_us_state_500k/cb_2017_us_state_500k.shp", ///
                                  data("us_data.dta") coor("us_coor.dta") replace
                                  
                              * create a subset that includes only the 48 contiguous us states
                              use "us_data.dta", clear
                              list NAME if inlist(_ID, 17,19,39,42,46,52,56,48)
                              drop if inlist(_ID, 17,19,39,42,46,52,56,48)
                              save "us_cont_data.dta", replace
                              
                              use "us_coor.dta", clear
                              gen long obs = _n
                              drop if inlist(_ID, 17,19,39,42,46,52,56,48)
                              save "us_cont_coor.dta", replace
                              The next step is to calculate the area around each grid point (Voronoi diagram). I don't know how to do this in spherical geometry but you can do an planar approximation using rectangles. The height of each rectangle is the same and corresponds to the distance between two points at the same longitude but .5 degrees of latitude apart. The width of each rectangle is proportional to the cosine of the latitude. As before, the grid points are extracted from one of the time periods.
                              Code:
                              * import weather data one time period 
                              infile lon lat p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 ///
                                  using "precip_2014/precip.1900", clear
                                  
                              * reduce to grid coordinates
                              keep lat lon
                              order lat lon
                              isid lat lon, sort
                              
                              * decimal degree to radians constant
                              scalar d2r = c(pi) / 180
                              
                              * distance between .5 degrees of latitude, constant longitude
                              geodist 0 0 .5 0, sphere
                              scalar ydist = r(distance)
                              
                              * distance between .5 degrees of longitude, constant latitude
                              gen xdist = 0.5 * d2r * cos(lat * d2r) * 6371
                              
                              * double-check using shortest path distance
                              gen lon1 = lon - .25
                              gen lon2 = lon + .25
                              geodist lat lon1 lat lon2, sphere gen(dcheck)
                              gen diff = abs(xdist-dcheck)
                              sum diff
                              
                              * the area of the grid point's Voronoi diagram
                              gen grid_area = xdist * ydist
                              sum grid_area
                              
                              keep lat lon grid_area
                              save "grid_area.dta", replace
                              Here are the descriptive statistics (grid_area is in squared km)
                              Code:
                              . gen diff = abs(xdist-dcheck)
                              
                              . sum diff
                              
                                  Variable |        Obs        Mean    Std. Dev.       Min        Max
                              -------------+---------------------------------------------------------
                                      diff |     85,794    .0000378    .0000223   6.03e-08    .000069
                              
                              . 
                              . * the area of the grid point's Voronoi diagram
                              . gen grid_area = xdist * ydist
                              
                              . sum grid_area
                              
                                  Variable |        Obs        Mean    Std. Dev.       Min        Max
                              -------------+---------------------------------------------------------
                                 grid_area |     85,794      1714.6    997.3284   13.48733   3091.049
                              Now you can match these grid points to the shapefile of the 48 conterminous states. Because the shapefile contains data on land area, we can compare the total area as calculated using grid_area with the area of each state from the shapefile.
                              Code:
                              use "grid_area.dta", clear
                              
                              * match grid coordinates to contiguous US states
                              geoinpoly lat lon using "us_cont_coor.dta"
                              
                              * drop grid points that did not match
                              drop if mi(_ID)
                              
                              * show the matching grid points
                              scatter lat lon, msize(vtiny)
                              graph export "us_cont_grid.png", width(800) replace
                              
                              save "us_cont_grid.dta", replace
                              
                              * get the state's area from the shapefile
                              merge m:1 _ID using "us_cont_data.dta", assert(match) keepusing(NAME ALAND) nogen
                              
                              * the land area is in square meters
                              gen alandkm2 = ALAND / 1000^2
                              
                              * calculate grid area per state
                              isid _ID lat lon, sort
                              by _ID: gen one = _n == 1
                              by _ID: egen grid_area_state = total(grid_area)
                              
                              * the difference in % by state
                              gen diff = abs(grid_area_state - alandkm2) / alandkm2 * 100
                              sum diff if one
                              
                              * overall
                              sum alandkm2 if one
                              dis %12.0gc r(sum)
                              
                              sum grid_area
                              dis %12.0gc r(sum)
                              The map of the grid points that matched one of the 48 conterminous state:

                              Click image for larger version

Name:	us_cont_grid.png
Views:	1
Size:	149.5 KB
ID:	1439098


                              and here are the results from the descriptive stats:
                              Code:
                              . sum diff if one
                              
                                  Variable |        Obs        Mean    Std. Dev.       Min        Max
                              -------------+---------------------------------------------------------
                                      diff |         48    6.659204    9.395654   .0035655   42.27459
                              
                              . 
                              . * overall
                              . sum alandkm2 if one
                              
                                  Variable |        Obs        Mean    Std. Dev.       Min        Max
                              -------------+---------------------------------------------------------
                                  alandkm2 |         48    159443.2      121115   2677.998   676641.9
                              
                              . dis %12.0gc r(sum)
                               7,653,274.2
                              
                              . 
                              . sum grid_area
                              
                                  Variable |        Obs        Mean    Std. Dev.       Min        Max
                              -------------+---------------------------------------------------------
                                 grid_area |      3,266    2378.367    180.2581   2038.089   2795.741
                              
                              . dis %12.0gc r(sum)
                               7,767,746.4
                              The big difference (max 42.27%) when calculated by state is due to Delaware, a very small state. Overall however, the total area are pretty close.

                              We can now extract from the full grid dataset the data for these US grid points and calculate average precipitation per year:
                              Code:
                              use "us_cont_grid.dta", clear
                              merge 1:m lat lon using "precip_annual_grid_1900_to_2014.dta", ///
                                  keep(match) nogen
                                  
                              * use the grid area to weigh the precip data
                              gen p_area = p * grid_area
                              
                              collapse (count) pN=p (sum) p_area grid_area, by(year)
                              
                              * mean precip in millimeters
                              gen p_mean = p_area / grid_area
                              
                              * mean precip in inches
                              gen p_mean_in = p_mean * 0.03937007874
                              
                              list if year >= 2010
                              and the results:
                              Code:
                              . list if year >= 2010
                              
                                   +---------------------------------------------------------+
                                   | year     pN     p_area   grid_a~a     p_mean   p_mean~n |
                                   |---------------------------------------------------------|
                              111. | 2010   3263   6.02e+09    7760198   775.2827   30.52294 |
                              112. | 2011   3263   5.80e+09    7760198   746.9092   29.40587 |
                              113. | 2012   3263   5.26e+09    7760198   678.1751   26.69981 |
                              114. | 2013   3263   5.99e+09    7760198   772.0376   30.39518 |
                              115. | 2014   3263   5.92e+09    7760198   762.9057   30.03566 |
                                   +---------------------------------------------------------+
                              The following are NOAA's estimate from the web page reference at the top of this post:
                              Code:
                              201001 - 201012    31.37
                              201101 - 201112    30.10
                              201201 - 201212    27.53
                              201301 - 201312    31.06
                              201401 - 201412    30.85
                              Pretty close from my point of view

                              Comment

                              Working...
                              X