Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question on stata post about "expanding datasets to all possible pairs"

    Dear all,

    I have a dataset with firms and individuals that are connected with these firms plus several descriptive variables. My original dataset is as big as 1,700,000 observations with 405,000 firms but I first try the code on a small sample of 30-40 obs. What I want is for each firm to construct all possible pairs of individuals that are connected with this firm (think of it as business network). I read on Stata site the post on title "How do I produce a dataset based on all possible pairs of identifiers within each group " (link posted) and seems that does exactly that.
    My issue is that the value of the variable constructed (id2 on the post, ID_pair to me) is slightly changed than the original (id or ID respectively). (e.g. instead of 81427461, ID_pair takes the value 81427464)

    Here is a data sample:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str10 DM_id_number long(ID firm_id) float(gfreq numid2 ID_pair)
    "P081427461"  81427461 1 6 1  39590112
    "P081427461"  81427461 1 6 2  74116440
    "P081427461"  81427461 1 6 3  81427464
    "P081427461"  81427461 1 6 4 206375584
    "P081427461"  81427461 1 6 5 223103504
    "P081427461"  81427461 1 6 6 247809792
    "P223103507" 223103507 1 6 1  39590112
    "P223103507" 223103507 1 6 2  74116440
    "P223103507" 223103507 1 6 3  81427464
    "P223103507" 223103507 1 6 4 206375584
    "P223103507" 223103507 1 6 5 223103504
    "P223103507" 223103507 1 6 6 247809792
    "P247809795" 247809795 1 6 1  39590112
    "P247809795" 247809795 1 6 2  74116440
    "P247809795" 247809795 1 6 3  81427464
    "P247809795" 247809795 1 6 4 206375584
    "P247809795" 247809795 1 6 5 223103504
    "P247809795" 247809795 1 6 6 247809792
    end
    label values firm_id firm_id
    label def firm_id 1 "DE2010198197", modify

    and here is my code:

    Code:
    sort firm_id
    by firm_id: gen gfreq = _N
    expand gfreq
    sort firm_id ID
    by firm_id ID: gen numid2=_n
    by firm_id: gen ID_pair = ID[gfreq*numid2]
    drop if ID == ID_pair
    drop if ID > ID_pair

    So, the questions are:
    - Does someone know why and how can be fixed?
    - Can this work with factor variables as well? (I tried but received missing values mostly)

    I need them because once I have constructed the person-to-person pairs I want to aggregate them to firm-to-firm pairs and match the postcodes for each firm. Here should mention that I also use the same code to construct directly the firm-to-firm pairs but need also to do it the other way because firm-to-firm pairs consider also the firms owned by the same person and I want to avoid that.

    Thanks in advance and for maintaining this forum that is super useful.
    How do I produce a dataset based on all possible pairs of identifiers within each group?
    Last edited by Panagiotis Vorgias; 11 Apr 2019, 03:14.

  • #2
    Hello again,

    I am coming back because I think the issue is confined to the generate command. I tried to simply generate the new variable like
    Code:
    gen ID2=ID
    and it keeps giving the same changed values. Thinking that maybe is something related to the type of the variable ID I went to excel and manually inserted the values, saved it and imported it to Stata. The result was the same. I also tried it to a different machine but in vain.
    It seems that the issue is only for big enough numbers. Up to 7-digit numbers it gives the right result.
    So, I thought to tell in case it helps someone to understand it.

    Thank you.

    Comment


    • #3
      Use clonevar instead of generate.

      Comment


      • #4
        Dear Nick, thank you so much for the help. I tried it but it returns me two errors: "clonevar may not be combined with by" and once I try without the "by" argument it returns "weights not allowed".
        I also tried (with the generate command) changing the type of the variable from double to long and it worked! However, it returns some missing values that I didn't understand the reason. Finally, I tried to do the same code but with variable as string and worked perfectly this time! I did some manual random checks and I have all the possible pairs.
        (with recast command to convert from double to long also didn't work).

        So, I thought to write this more or less arbitrary fix, in case it is useful for someone else in future and perhaps somebody can explain why and how this worked.

        Thanks again.
        Last edited by Panagiotis Vorgias; 22 Apr 2019, 12:25.

        Comment


        • #5
          ID variables are almost always better as strings. No issues with binary representations and exactness. I often stick a letter at the beginning of the string, like W so that any other program (like say Excel) don't try to read it as numeric if that is ever an issue.

          Comment


          • #6
            I don't really understand this, despite writing the original FAQ in 2005. I've forgotten the context long since and I don't remember the problem arising since. It seems that you want code not in the original FAQ. That's fine by me except that I don't understand why. What's clear is that with large integers you need a long to hold them exactly, or as Bryan Sayer suggests to switch to string identifiers.
            Last edited by Nick Cox; 24 Apr 2019, 02:10.

            Comment

            Working...
            X