All possible combination of countries

Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#1

All possible combination of countries

05 May 2021, 13:08

Hi all,

so I have 55 countries and I would like to perform some operations on each possible combination of them. Specifically, I would like to find the products in common among these countries. For sake of simplicity, since. the code for the latter part is long and tedious and I should be fine with that, I do not include it. If necessary, please let me know.
In any case, the idea is the following:

PART1: start with a country, say country1, append another country, say country2, and check if there are products in common and how many products are in common between the two. If the number of products in common is above a certain threshold t, then keep products in common of country1 and country2. With this new data, add country3, see if and how many products are in common between the three, and if this is above a certain threshold, then keep the data country1 country2 countr3 and so on. I don't know if it works but I was guessing something like:

Code:

local countries = "FRANCE" levelsof Country, local(paesi) foreach i in `paesi'{ local countries = `countries' + `i' foreach i in `countries'{ *perform code for checking if there are products in common; if tresshold> t{ *continue -->????? }

but am stuck at the if condition.
Furthermore, if possible, I would like to do it for every possible combination of the 55 countries...which is quite a time consuming task I guess.
Tags: combination, data, interaction
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

05 May 2021, 13:58

Every possible combination of 55 countries — taken 2-at-a-time, then 3-at-a-time, ..., then 54-at-a-time, then all 55 — will be 2⁵⁵-55-1 = 36,028,797,018,963,912.

Last edited by William Lisowski; 05 May 2021, 14:05.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#3

05 May 2021, 14:01

...which is quite a time consuming task I guess.

There are approximately 3 x 10¹⁶subsets from 55 countries. This is simply not viable.
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#4

06 May 2021, 00:52

Indeed. What if I reduce the number of countries to 10 maybe?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

06 May 2021, 09:29

Every combination of 55 countries — taken 2-at-a-time, then 3-at-a-time, ..., up to 10-at-a-time, will be 37,060,382,766.

Source: type

Code:

sum Binomial[55,i] for i = 2 to 10

into the input box at https://www.wolframalpha.com.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

06 May 2021, 09:43

With 10 countries you would be dealing with just over 1,000 possible subsets, so that's workable.

You don't post example data, and you have not shown your code for determining how many products are in common, so there is no information about where and and in what form this information is created and organized. So I'll respond in general terms only. I think the solution here is to use a frame, and to post the results to the new frame each time. (Frames were introduced in Stata version 16. If you are using an older version, you can do the same thing, a bit less efficiently, using a -postfile- instead.) The variables in the new frame would include:
a variable indicating which set of countries the information refers to (possibly generated by -egen, concat()-), let's call it country_combination

an integer or long variable specifying the number of common products, let's call it n_common_products

another variable that specifies what those common products are, again possibly generated by -egen, concat()-, let's call it product_list

So, before you begin the loop you use the appropriate command to create the frame (or postfile):
[code]
frame create results str2000 country_combination long n_common_products strL product_list
[code]
Note: You may want to put other variables in as well, I'm just giving the skeleton of the approach here. Note also that if you are using a postfile, you cannot have a strL there and will be limited to str2045.

Then inside the loop, assuming that your threshold number of products is contained in a local macro called threshold, and the actual number of common products for the combination considered in the current iteration of the loop is in a local macro called n_common_products, the current combination of countries is in a local macro called country_combination, and the list of common products for the current iteration is in a local macro called common_products:

Code:

if `common_products' > `threshold' { post results (`"`country_combination'"') (`n_common_products') (`"`common_products'"') }

Once you are done with the loop, the results will be in the data set in frame results. Change to that frame and -compress- the data set. (This will make it easier to work with the long string variables.) And go with it. If you were using a -postfile- rather than a frame, don't forget to close the postfile before you try to use it. You might also want to -split- the country_combination or product_list variables, depending on how you plan to work with that information later.

Assuming you are using Stata 16 or later, I urge you to read the documentation about the -frame- commands if you are not already familiar with them. They are amazingly useful, particularly for situations like this where you need to compile new information that is not readily compatible with the organization of the source data.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

06 May 2021, 11:56

Posts #5 and #6 expose the ambiguity in post #4.

Post #4 asks

What if I reduce the number of countries to 10 maybe?

Post #5 assumed what was meant was

What if I reduce the maximum number of countries in a combination to 10 maybe?

Post #6 assumed what was meant was

What if I reduce the number of countries in my dataset to 10 maybe?
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

06 May 2021, 12:05

William Lisowski is right. In #6 I assumed that the total number of countries in the data set is reduced to 10. If you just limit the number of countries in a combination to 10 but retain all 55 countries in the data set you are still looking at almost 3x10¹⁰combinations--and that is beyond feasibility with any code unless you have a supercomputer and access to something like a petabyte of memory. Even if you did have such equipment at your disposal, it is hard to see how you would be able to digest the results.

Last edited by Clyde Schechter; 06 May 2021, 12:08.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#9

06 May 2021, 12:20

Maybe you can take a step back and explain what it is you want to do with these common items? Maybe there is another way to tackle the problem, such as starting from the individual items instead of countries?
1 like
Comment

Federico Nutarelli

Join Date: Sep 2018
Posts: 430

#10

11 May 2021, 03:39

Leonardo Guizzetti thank you all for your useful replies and sorry for the delay.

First of all, I am using STATA 13 unfortunately.

You don't post example data, and you have not shown your code for determining how many products are in common, so there is no information about where and in what form this information is created and organized.

Indeed the point is that I don't know how many products (and which) are in common. Let's say I have just three countries: FRANCE, ITALY, GERMANY. Each database looks like this (the data example is from Italy but is the same for each country):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 quarter float id_mole_ITALY str43 Molecule str25 Country_ITALY
"2008Q4" 1 "ACARBOSE"             "ITALY"
"2009Q1" 1 "ACARBOSE"             "ITALY"
"2009Q2" 1 "ACARBOSE"             "ITALY"
"2009Q3" 1 "ACARBOSE"             "ITALY"
"2009Q4" 1 "ACARBOSE"             "ITALY"
"2010Q1" 1 "ACARBOSE"             "ITALY"
"2010Q2" 1 "ACARBOSE"             "ITALY"
"2010Q3" 1 "ACARBOSE"             "ITALY"
"2010Q4" 1 "ACARBOSE"             "ITALY"
"2011Q1" 1 "ACARBOSE"             "ITALY"
"2011Q2" 1 "ACARBOSE"             "ITALY"
"2011Q3" 1 "ACARBOSE"             "ITALY"
"2011Q4" 1 "ACARBOSE"             "ITALY"
"2012Q1" 1 "ACARBOSE"             "ITALY"
"2012Q2" 1 "ACARBOSE"             "ITALY"
"2012Q3" 1 "ACARBOSE"             "ITALY"
"2012Q4" 1 "ACARBOSE"             "ITALY"
"2013Q1" 1 "ACARBOSE"             "ITALY"
"2013Q2" 1 "ACARBOSE"             "ITALY"
"2013Q3" 1 "ACARBOSE"             "ITALY"
"2013Q4" 1 "ACARBOSE"             "ITALY"
"2014Q1" 1 "ACARBOSE"             "ITALY"
"2014Q2" 1 "ACARBOSE"             "ITALY"
"2014Q3" 1 "ACARBOSE"             "ITALY"
"2014Q4" 1 "ACARBOSE"             "ITALY"
"2015Q1" 1 "ACARBOSE"             "ITALY"
"2015Q2" 1 "ACARBOSE"             "ITALY"
"2015Q3" 1 "ACARBOSE"             "ITALY"
"2015Q4" 1 "ACARBOSE"             "ITALY"
"2016Q1" 1 "ACARBOSE"             "ITALY"
"2016Q2" 1 "ACARBOSE"             "ITALY"
"2016Q3" 1 "ACARBOSE"             "ITALY"
"2016Q4" 1 "ACARBOSE"             "ITALY"
"2017Q1" 1 "ACARBOSE"             "ITALY"
"2017Q2" 1 "ACARBOSE"             "ITALY"
"2017Q3" 1 "ACARBOSE"             "ITALY"
"2017Q4" 1 "ACARBOSE"             "ITALY"
"2018Q1" 1 "ACARBOSE"             "ITALY"
"2018Q2" 1 "ACARBOSE"             "ITALY"
"2018Q3" 1 "ACARBOSE"             "ITALY"
"2018Q4" 1 "ACARBOSE"             "ITALY"
"2019Q1" 1 "ACARBOSE"             "ITALY"
"2019Q2" 1 "ACARBOSE"             "ITALY"
"2019Q3" 1 "ACARBOSE"             "ITALY"
"2019Q4" 1 "ACARBOSE"             "ITALY"
"2020Q1" 1 "ACARBOSE"             "ITALY"
"2020Q2" 1 "ACARBOSE"             "ITALY"
"2020Q3" 1 "ACARBOSE"             "ITALY"
"2008Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2009Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2009Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2009Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2009Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2010Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2010Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2010Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2010Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2011Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2011Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2011Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2011Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2012Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2012Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2012Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2012Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2013Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2013Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2013Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2013Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2014Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2014Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2014Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2014Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2015Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2015Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2015Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2015Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2016Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2016Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2016Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2016Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2017Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2017Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2017Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2017Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2018Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2018Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2018Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2018Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2019Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2019Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2019Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2019Q4" 2 "ACETYLCYSTEINE"       "ITALY"
"2020Q1" 2 "ACETYLCYSTEINE"       "ITALY"
"2020Q2" 2 "ACETYLCYSTEINE"       "ITALY"
"2020Q3" 2 "ACETYLCYSTEINE"       "ITALY"
"2008Q4" 3 "ACETYLSALICYLIC ACID" "ITALY"
"2009Q1" 3 "ACETYLSALICYLIC ACID" "ITALY"
"2009Q2" 3 "ACETYLSALICYLIC ACID" "ITALY"
"2009Q3" 3 "ACETYLSALICYLIC ACID" "ITALY"
end

i.e. is a quarterly panel where each Molecule (product) is repeated over time. Each country has a different pool of Molecule but some of them are in common. Now, to answer Leonardo:

Maybe you can take a step back and explain what it is you want to do with these common items? Maybe there is another way to tackle the problem, such as starting from the individual items instead of countries?

what I would like to do starting from these "country databases", is to intersect the major number possible of countries so that a pool of at least 100 molecules emerges. So, taking the example of 4 countries rather than 3, say ECUADOR, ITALY, GERMANY, CONGO. As you can see from here, I'd expect ECUADOR not to have enough molecule in common with ITALY and FRANCE but with CONGO. This is just an example. I don't know "a priori" which countries has the most molecules in common. Now what I would like to do is to check howm many molecules have in common say:
1) ECUADOR-ITALY. Let's say they have 30 molecules in common. This is not enough --> then discard and go on;
2) ITALY-GERMANY --> they have 102 Molecules in common --> then form the appended dataset ITA_GER;
3) take ITA_GER and interact it with CONGO (since we already know that ECUADOR and ITALY intersection does not produce enough molecules in common) --> they have 20 molecules in common. Then discard

And so on. So, to reply to Willian and Clyde a little bit deeper, I don't think I'll need all the possible combinations of 10 countries since I do not re-use the ones that I discard in the first instance. Does this make sense?
Please, do not hesitate to ask me for further details.

Thank you again!

Last edited by Federico Nutarelli; 11 May 2021, 03:42.

Comment

Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#11

11 May 2021, 04:17

Clyde Schechter I also tagged you in the previous post but the tag seems not to appear
Comment
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#12

12 May 2021, 16:15

Hi Frederico,
As you can see numbers are staggering. However, I can simultaneously run 55 instances of Stata in my set up. So, I am open to lending you my computing resources for up to a day or two if you can reduce the size of your problem to that type of time frame running 55 instances concurrently.
Comment

Announcement