Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Co-occurence matrix for patent technological classess

    Dear,

    I have a list of patents defined by an application id, each patent has number of technological classes, so called IPC4. I would like to calculate a simple co-occurrence matrix, i.e. how many patents are there with for example class A01N and class A61P. Below I include a part of my sample of patents with corresponding IPC4 codes as well as a resulting co-occurence matrix based on that sample. It has been calculated manually and as I have to repeat that exercise for much larger sample of 300K patents I am looking for more efficient way to tackle that task. Hence, I would like to kindly ask for any suggestions and clues as to how ( or if it is possible at all) to create such co-occurence matrix using Stata.
    Sample
    appln_id ipc4
    335751077 A01N
    458497114 A01N
    497 A61K
    1204 A61K
    58708 A61K
    159561 A61K
    16525572 A61K
    16684626 A61K
    16906855 A61K
    17420428 A61K
    55216987 A61K
    266933230 A61K
    335751077 A61K
    405325474 A61K
    417635173 A61K
    458497114 A61K
    458497114 A61L
    58708 A61P
    159561 A61P
    16684626 A61P
    16906855 A61P
    17420428 A61P
    266933230 A61P
    335751077 A61P
    417635173 A61P
    497 A61Q
    16684626 C07C
    17420428 C07C
    17420428 C07D
    335751077 C07D
    458497114 C07H
    2 C07K
    72 C07K
    1204 C07K
    159561 C07K
    16906855 C07K
    17420428 C07K
    458497114 C07K
    497 C11D
    55217042 C12M
    2 C12N
    72 C12N
    1204 C12N
    32352 C12N
    159561 C12N
    386134 C12N
    16906855 C12N
    55217042 C12N
    405325474 C12N
    458497114 C12N
    2 C12P
    159561 C12P
    16906855 C12P
    458497114 C12P
    159561 C12Q
    55217042 C12Q
    417635364 C12Q
    2 C12R
    159561 C12R
    2 G01N
    159561 G01N
    55217042 G01N
    co-occurence matrix
    ipc4 A01N A61K A61L A61P A61Q C07C C07D C07H C07K C11D C12M C12N C12P C12Q C12R G01N
    A01N 0 2 1 1 0 0 1 1 1 0 0 1 1 0 0 0
    A61K 2 2 1 8 1 2 2 1 5 1 0 5 3 1 1 1
    A61L 1 1 0 0 0 0 0 1 1 0 0 1 1 0 0 0
    A61P 1 8 0 0 0 2 2 0 3 0 0 2 2 1 1 1
    A61Q 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
    C07C 0 2 0 2 0 0 1 0 1 0 0 0 0 0 0 0
    C07D 1 2 0 2 0 1 0 0 1 0 0 0 0 0 0 0
    C07H 1 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0
    C07K 1 5 1 3 0 1 1 1 0 0 0 6 4 1 2 2
    C11D 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
    C12M 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1
    C12N 1 5 1 2 0 0 0 1 6 0 1 2 4 2 2 3
    C12P 1 3 1 2 0 0 0 1 4 0 0 4 0 1 2 2
    C12Q 0 1 0 1 0 0 0 0 1 0 1 2 1 1 1 2
    C12R 0 1 0 1 0 0 0 0 2 0 0 2 2 1 0 2
    G01N 0 1 0 1 0 0 0 0 2 0 1 3 2 2 2 0
    Best,
    Marcelina
    Last edited by Marcelina Grabowska; 03 Mar 2020, 02:59.

  • #2
    As explained in the StataListFAQ, please use -dataex- to display example data in the future. To make your example useable required some surgery (It appears fixed below if someone else wants to work with it.) Tabs in such material do not play nicely onscreen, as you can see in looking at your example above.

    One question first: Your co-occurrence matrix shows classes co-occurring with themselves (e.g., A61K x A61K occurs twice.) That confuses me. How can a classification co-occur with itself? Isn't a classification something that only happens once for a given application? Taking the class A61K: It occurs 14 times in your example data, but never more than once for a given appln_id.

    That difficulty being ignored for the moment, here's something that I think comes close to what you want. Your data appears first below, the code is last. You'll need to check the resulting matrix.

    I'd also note that your problem can be understood as a social network analysis, with classes as nodes linked by their "participation" in an application. On that conceptualization, I would suspect that the -nwcommands- user-written package (-search nwcommands-) can likely handle your problem faster and more nicely than my do-it-yourself solution below.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str20 appln_id str4 ipc4
    "335751077" "A01N"
    "458497114" "A01N"
    "497"       "A61K"
    "1204"      "A61K"
    "58708"     "A61K"
    "159561"    "A61K"
    "16525572"  "A61K"
    "16684626"  "A61K"
    "16906855"  "A61K"
    "17420428"  "A61K"
    "55216987"  "A61K"
    "266933230" "A61K"
    "335751077" "A61K"
    "405325474" "A61K"
    "417635173" "A61K"
    "458497114" "A61K"
    "458497114" "A61L"
    "58708"     "A61P"
    "159561"    "A61P"
    "16684626"  "A61P"
    "16906855"  "A61P"
    "17420428"  "A61P"
    "266933230" "A61P"
    "335751077" "A61P"
    "417635173" "A61P"
    "497"       "A61Q"
    "16684626"  "C07C"
    "17420428"  "C07C"
    "17420428"  "C07D"
    "335751077" "C07D"
    "458497114" "C07H"
    "2"         "C07K"
    "72"        "C07K"
    "1204"      "C07K"
    "159561"    "C07K"
    "16906855"  "C07K"
    "17420428"  "C07K"
    "458497114" "C07K"
    "497"       "C11D"
    "55217042"  "C12M"
    "2"         "C12N"
    "72"        "C12N"
    "1204"      "C12N"
    "32352"     "C12N"
    "159561"    "C12N"
    "386134"    "C12N"
    "16906855"  "C12N"
    "55217042"  "C12N"
    "405325474" "C12N"
    "458497114" "C12N"
    "2"         "C12P"
    "159561"    "C12P"
    "16906855"  "C12P"
    "458497114" "C12P"
    "159561"    "C12Q"
    "55217042"  "C12Q"
    "417635364" "C12Q"
    "2"         "C12R"
    "159561"    "C12R"
    "2"         "G01N"
    "159561"    "G01N"
    "55217042"  "G01N"
    end
    //
    // Make pairs of observations based on co-occurrence in an application.
    rename ipc4 ipc4_1
    preserve
    rename ipc4_1 ipc4_2
    tempfile temp
    save `temp'
    restore
    //
    // Pairs become network "edges"
    joinby appln_id using `temp'
    //
    // Count pairs
    bysort ipc4_1 ipc4_2: gen f = _N
    // Only 1 instance of a pair
    by ipc4_1 ipc4_2: keep if _n ==1
    drop appln_id
    //
    // Desired shape
    reshape wide f, i(ipc4_1) j(ipc4_2) string
    sort ipc4_1
    recode f* (. = 0)
    //
    // Stata matrix.  Mata might be better.
    mkmat f*, matrix(F) rownames(ipc4_1)
    Last edited by Mike Lacy; 03 Mar 2020, 08:42.

    Comment

    Working...
    X