Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting Variable per Month

    Need some help guys,

    . dataex

    ----------------------- copy starting from the next line -----------------------
    [CODE]
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str8 date str18 source str19 target str4 cameocode int numevents long numarts byte quadclass float goldstein str2 EventRootCode str19 target_cloned str3 target_country_only str6 MonthYear
    "19790101" "BUS" "USA" "040" 1 4 1 1 "04" "USA" "USA" "197901"
    "19790101" "CHN" "USA" "017" 1 2 1 0 "01" "USA" "USA" "197901"
    "19790101" "CHN" "USA" "036" 5 16 1 4 "03" "USA" "USA" "197901"
    "19790101" "CHN" "USA" "042" 3 9 1 1.9 "04" "USA" "USA" "197901"
    "19790101" "CHN" "USA" "051" 2 2 1 3.4 "05" "USA" "USA" "197901"
    "19790101" "COP" "USA" "190" 1 4 4 -10 "19" "USA" "USA" "197901"
    "19790101" "CUBGOV" "USA" "017" 1 5 1 0 "01" "USA" "USA" "197901"
    "19790101" "CUBGOV" "USA" "050" 1 5 1 3.5 "05" "USA" "USA" "197901"
    "19790101" "CUBGOV" "USA" "111" 1 5 3 -2 "11" "USA" "USA" "197901"
    "19790101" "CUB" "USA" "017" 1 4 1 0 "01" "USA" "USA" "197901"
    "19790101" "CUB" "USA" "050" 1 4 1 3.5 "05" "USA" "USA" "197901"
    "19790101" "CUB" "USA" "111" 1 4 3 -2 "11" "USA" "USA" "197901"
    "19790101" "EDU" "USA" "040" 1 5 1 1 "04" "USA" "USA" "197901"
    "19790101" "EDU" "USA" "182" 1 5 4 -9.5 "18" "USA" "USA" "197901"
    "19790101" "JUD" "USA" "020" 1 5 1 3 "02" "USA" "USA" "197901"
    "19790101" "JUD" "USA" "051" 1 2 1 3.4 "05" "USA" "USA" "197901"
    "19790101" "JUD" "USA" "120" 1 7 3 -4 "12" "USA" "USA" "197901"
    "19790101" "LEG" "USA" "042" 2 4 1 1.9 "04" "USA" "USA" "197901"
    "19790101" "LEG" "USA" "120" 1 7 3 -4 "12" "USA" "USA" "197901"
    "19790101" "MIL" "USA" "0334" 2 9 1 6 "03" "USA" "USA" "197901"
    "19790101" "USAJUD" "USA" "046" 1 2 1 7 "04" "USA" "USA" "197901"
    "19790101" "USAPRIJUD" "USA" "080" 1 4 2 5 "08" "USA" "USA" "197901"
    "19790101" "UZB" "USA" "040" 1 1 1 1 "04" "USA" "USA" "197901"
    "19790102" "BUS" "USA" "010" 1 2 1 0 "01" "USA" "USA" "197901"
    "19790102" "BUS" "USA" "013" 1 5 1 .4 "01" "USA" "USA" "197901"
    "19790102" "BUS" "USA" "050" 2 7 1 3.5 "05" "USA" "USA" "197901"
    "19790102" "CHN" "USA" "036" 1 2 1 4 "03" "USA" "USA" "197901"
    "19790102" "CHN" "USA" "051" 2 7 1 3.4 "05" "USA" "USA" "197901"
    "19790102" "CHN" "USA" "110" 1 5 3 -2 "11" "USA" "USA" "197901"
    "19790102" "COL" "USA" "036" 1 9 1 4 "03" "USA" "USA" "197901"
    "19790102" "COL" "USA" "042" 4 18 1 1.9 "04" "USA" "USA" "197901"
    "19790102" "COP" "USA" "036" 1 1 1 4 "03" "USA" "USA" "197901"
    "19790102" "COP" "USA" "043" 1 2 1 2.8 "04" "USA" "USA" "197901"
    "19790102" "COP" "USA" "190" 1 7 4 -10 "19" "USA" "USA" "197901"
    "19790102" "CUBGOVMIL" "USA" "111" 1 5 3 -2 "11" "USA" "USA" "197901"
    "19790102" "CUB" "USA" "043" 1 3 1 2.8 "04" "USA" "USA" "197901"
    "19790102" "CUB" "USA" "111" 2 4 3 -2 "11" "USA" "USA" "197901"
    "19790102" "CVLGOV" "USA" "036" 2 3 1 4 "03" "USA" "USA" "197901"
    "19790102" "DEUCVL" "USA" "030" 1 2 1 4 "03" "USA" "USA" "197901"

    above is an example of my data.

    I want to get a count for all specific EventRootCodes in my data per month. therefore I thought creating new variables would do.

    egen ERC1=sum( EventRootCode==01),by(MonthYear)
    , would create erc1, merely taking the count of the amount of eventrootcode with code 01 per month. and I would do this for all eventrootcodes for example.

    . egen ERC1=sum( EventRootCode==01),by(MonthYear)
    type mismatch
    r(109);

    error.

    - how to fix this. both variable are string thus I thought this would not be a problem.

    - when trying to destring them to another data type I get:

    . destring EventRootCode,replace
    EventRootCode contains nonnumeric characters; no replace


    what am I doing wrong here. thank you



  • #2
    I didn't test, since the code delimiters were not used appropriately.

    But I gather you need to destring the variable beforehand.
    Best regards,

    Marcos

    Comment


    • #3
      Code:
      destring MonthYear EventRootCode, replace
      levelsof EventRootCode, local(ERCs)
      foreach ERC in `ERCs'{
      egen ERC`ERC'=sum( EventRootCode==`ERC'),by(MonthYear)
      }
      Also, what Marcos is referring to is please use dataex as is instructed, copying all the way to where it says ''copy up to and including the previous line '.
      If you didnt like that because your example would be too long, simply use, e.g.:
      Code:
      dataex in 1/20
      To generate an example with juts the first 20 obsrvations

      Comment


      • #4
        Here's your data example rewritten to focus on relevant variables, to include all needed code and to show properly between delimiters:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str2 EventRootCode str6 MonthYear
        "04" "197901"
        "01" "197901"
        "03" "197901"
        "04" "197901"
        "05" "197901"
        "19" "197901"
        "01" "197901"
        "05" "197901"
        "11" "197901"
        "01" "197901"
        "05" "197901"
        "11" "197901"
        "04" "197901"
        "18" "197901"
        "02" "197901"
        "05" "197901"
        "12" "197901"
        "04" "197901"
        "12" "197901"
        "03" "197901"
        "04" "197901"
        "08" "197901"
        "04" "197901"
        "01" "197901"
        "01" "197901"
        "05" "197901"
        "03" "197901"
        "05" "197901"
        "11" "197901"
        "03" "197901"
        "04" "197901"
        "03" "197901"
        "04" "197901"
        "19" "197901"
        "11" "197901"
        "04" "197901"
        "11" "197901"
        "03" "197901"
        "03" "197901"
        end
        Stata is already telling you the problem with your egen statement. I here use egen, total() rather than egen, sum() because the latter (equivalent) function sum() is undocumented as of Stata 9. The type mismatch arises because Stata expects quotation marks around literal strings.


        Code:
        egen ERC1 = total(EventRootCode == "01"),by(MonthYear)
        You could get there without repeating similar statements for each code.

        Code:
        bysort MonthYear EventRootCode : gen freq = _N 
        separate freq, by(EventRootCode)
        followed if necessary by mvencode to map missings to zeros.

        I see no reason to destring here, excellent command though that is.

        Incidentally, your monthly date variable is not fit for almost any Stata purpose. This has arisen several times recently, so I wrote up a Tip for the Stata Journal, which is in press.

        On Statalist and elsewhere people sometimes try to work with monthly
        date variables with values like 195201 or 201805. You get the idea:
        195201 is January 1952 and 201805 is May 2018. The advantages of such a
        representation are twofold. People can quickly grasp the convention.
        Such dates sort correctly into chronological order. (For that to work,
        01 to 09, rather than 1 to 9, are essential codings for January to
        September.) Despite those advantages, such dates are useless for any
        other serious statistical or Stata purpose. The point of this Tip is to
        explain precisely why that is so and then what else you should do.

        There is no homunculus or other intelligence inside Stata that sees such
        a variable and thinks "Oh! A monthly date''. Nor will trying to apply a
        monthly date format help. For more on why changing the display format
        does not work here, see Cox (2012). People trying that typically realise
        that they need something else. People not trying that often get stuck.

        To see the main problem, focus on what happens at the turn of each
        calendar year, say as 201712 (December 2017) turns into 201801 (January
        2018). You understood in reading that example that 201801 follows 201712
        immediately and that there were no months that would be 201713 to
        201799. On that point, and many others, you are better informed and
        smarter than Stata. Stata can see only a numeric gap of 89, as 201801
        - 201712 = 89. Hence such dates show, through any 12 month period,
        11 steps or gaps of 1 as you proceed from January to December and 1 big
        step or gap of 89 as you proceed from December to January. That big gap
        will mess up almost anything graphical or statistical using the date
        variable. In particular, graphs will look crazy and -tsset- or -xtset-
        in terms of the time variable will also flag (but not correct
        for) uneven spacing of your data, which will mess up any modeling or
        calculation using lags or leads that depends on either setting.

        There are various ways to map such variables to Stata monthly dates as
        usually understood. Such dates have origin (that is, are 0) at January
        1960. Typically a monthly date format is applied so that people see
        monthly dates they can understand. Let's imagine a toy dataset with
        those two run-together monthly dates as values.

        Code:
        . clear 
        . input bad_month_date
        195201
        201805
        end
        The best way to process such a numeric variable is to split values into
        year and month components and then feed those as arguments to -
        ym()-, a function that expects numeric year and month
        values. The year is the value divided by 100 and rounded down and the
        month is the reminder on dividing by 100:

        .
        Code:
         gen good_month_date = ym(floor(bad_month_date/100), mod(bad_month_date, 100))
        
        . format good_month_date %tm
        
        . list 
        
             +---------------------+
             | bad_mo~e   good_m~e |
             |---------------------|
          1. |   195201     1952m1 |
          2. |   201805     2018m5 |
             +---------------------+
        Here we just applied the default monthly date format. You can apply many
        other format, but see the help for -datetime display formats- for
        the details. For more on -floor()- or -mod()- if you want it,
        see their help files or Cox (2003, 2007).

        There are other ways to do that conversion. You might have thought of
        converting the date to string, extracting year and month components using
        -substr()- and then passing the results back through -real()-
        and then -ym()-. Those type conversions back and forth can be
        avoided, as just explained by keeping the conversion as an entirely
        numeric operation. Note that at the time of writing (Stata 15.1) the
        nested function call -monthly(string(), "YM")- is not yet smart
        enough to parse run-together monthly dates.

        Let's complete the circle and imagine that some purpose, bizarre or
        other, you need to export monthly dates in the form 195201 or 201805. A
        good reason for that would be if some other software understands (or
        even requires) such a format. If you have a variable like -
        bad_month_date- you already have what you need. If you have a variable
        like -good_month_date-, then push it through -
        string(good_month_date, "%tmCYN")-.

        It follows from this that daily dates that have the form 20180531 have
        the same advantages and disadvantages. In practice people seem to
        realise more quickly that they need to convert such dates to daily
        dates as understood by Stata. The example

        Code:
        . display %td daily("20180531", "YMD")
        31may2018
        shows that first conversion to string and then feeding the result to
        -daily()- will work fine. You need to follow with assignment of a
        daily date format.

        Cox, N. J. 2003.
        Stata tip 2: Building with floors and ceilings.
        Stata Journal 3: 446--447.

        2007.
        Stata tip 43: Remainders, selections, sequences, extractions: Uses of
        the modulus.
        Stata Journal 7: 143--145.

        2012.
        Stata tip 113: Changing a variable's format: What it does and does not
        mean.
        Stata Journal 12: 761--764.

        Comment


        • #5
          Oh, and your error:
          Code:
          . destring EventRootCode,replace
          EventRootCode contains nonnumeric characters; no replace
          most likely is due to some values containing e.g. "N/A" or any sort of text that you should investigate before destring.
          If these can just be set to missing, use the generate(newvarname) force or replace force option with destring.

          Edit: was typing this up as Nick was also working on it.
          Last edited by Jorrit Gosens; 15 Jun 2018, 06:36.

          Comment


          • #6
            Marcos Almeida thank you for the support

            Jorrit Gosens thank you for the support. thank you for the idea of a loop , will especially come in handy for later stages in my research and good that you clarify the dataex command.

            Nick Cox . thank you. the
            bysort MonthYear EventRootCode : gen freq = _N separate freq, by(EventRootCode)
            . was particulary handy and does the job too!

            Thank you for the refresher on the fit of a monthly variable was nice to read. for other readers/users the stata tips are full of nuggets of wisdom aswell as motivates to think about the fundamentals.

            lastly to add on Nicks command I found the youtube channel on formatting, managing dates and other actions for time series handy.

            youtube url: https://www.youtube.com/watch?v=SOQvXICIRNY


            Comment

            Working...
            X