Free site for storing data that will work with use command

Richard Williams

Join Date: Apr 2014

Posts: 4992
#1

Free site for storing data that will work with use command

30 Apr 2022, 09:08

I'm submitting a paper and including a listing of my Stata programs in an Appendix. The one problem is that the use command potentially gives away my identity. Is there any free site where I could put the data where a use command would work?

Alternatively, can you put data on Google Drive and make it available via a use command? If so would it still be easy to identify who was providing the data?

As a sidelight, if I download a dataset for free from some site, am I potentially violating some law if I make it available on my website? Or, if the paper gets published and I am offered the opportunity to place my data on some depository, is it ok for me to do so?

I am super-big on applicability. I wouldn't make proprietary data available to everyone, but if it can be downloaded for free anyway I hope I wouldn't get anyone too upset. My paper provides information on the data citation and where you can go to download it yourself. But, I want to make it as easy as possible for somebody to replicate my work if they so choose.

Perhaps it would be ok if I created an extract that contained only the variables needed for replication.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#2

30 Apr 2022, 10:47

Hi Richard,

Two options that may work to preserve anonymity of your computer accounts while sharing code and data would be to include a zip package (or individual files) as supplemental data with your article. Equally, you may use OSF to achieve the same effect. GitHub is very useful for this (and can be used with a verbatim -use- command), but the URL is transparently obvious it is from your account. I doubt there would be any solutions to allow you to include a verbatim -use- command and have it work without revealing your accounts, but what I have commonly seen is for the code file to include a simple instruction to the user -cd- to the directory of downloaded files, and then everything following in the script works with relative paths.

As for the dataset, you may consider checking with the journal or authors about potential copyright issues in sharing an excerpt of the data. Many journals now require a statement of data sharing or otherwise have a policy regarding re-use/re-publication of materials. I don't imagine that it's an issue but you could simply reduce the dataset to just the variables needed as an easy compromise.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

30 Apr 2022, 11:07

To address your issue of author identifiability, you could put the data onto an external drive - say, a flash drive - plugged into your computer. Just don't name it "Soc73994"
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

30 Apr 2022, 12:06

As a sidelight, if I download a dataset for free from some site, am I potentially violating some law if I make it available on my website?

Maybe, maybe not. The website itself should let you know about any restrictions on the use of the data it is providing. They will often refer to this as terms for a license to use the data, which you should read so you know what you can and can't do with it.

The other thing you may need to think about is whether the data was acquired under the terms of some data-use agreement to which you or your institution are a party. If so, you need to adhere to whatever restrictions are set out there. Much of the data that I use in my research is restricted in this way. I sometimes find myself in a situation where I am perfectly happy to share my code with other researchers but am prohibited from providing them with crucial input data. It is my impression that as the scientific community has recently been striving to move towards transparency, more and more data sources are imposing more and more restrictions on the use of their data. :-(
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#5

30 Apr 2022, 12:11

I echo Leonardo Guizzetti's comments on GitHub, but if this were my problem, I'd likely still use GitHub but save it under a pseudonym. I mean yeah, github has access to your email and stuff, but presumably they can't know your true identity.

In fact, GitHub (as a Ph.D researcher anyways) has helped me quite a lot since it's directly straightforward with the use command, so I likely would go with that just in the interest of pure transparency. I don't know, maybe I'm mistaken, but there's no reason that my github has to be jgreathouse9, for example. I imagine that I could just as easily rename it (or create one) called.... Sulla, or Aemilius if I so chose.

On the flash drive solution, this is how all my master do files look, with minor edits

Code:

*********************************************************** version 17.0 /* Stata MP, 4 Core */ * Python 3.9.0 set more off **** Change your working directories FIRST **** gl MP "E:/Class/Prog Eval/SJ_scul/SCUL_Paper" **** Change your working directories FIRST **** *********************************************************** * Programmer: JAG // ususally this is [REDACTED] * Institution: Georgia State University * Contact: [REDACTED] * Created on : 2/19/2022 * Last Edited: * Contents: I. Introduction * 1. Purpose * 1.1 Overview * 1.2 Master Do files * 4. Run do files * 4.1: Clean Data * 4.2: Perform Analysis ***********************************************************

Many ways to skin a cat, I suppose. Me personally, I'm big on transparency. Almost all my datasets I use in my work are public, although this always can't be the case. So I try and structure my projects in a way such that they are as public as possible, too.

Clyde Schechter Yeah different fields will be different. You're in (epidemiology, right?) so it makes sense that lots of your data will be private, especially given your specific research interests. Me, I'm a public policy person, so all I need is usually outcome data, some covariates, all of which can easily be extracted directly from Stata or Python (if needed), and I proceed accordingly. Same with marketing research- whenever I've asked for the data from some folks in marketing (a synthetic control paper about Air b n B I think it was), quite naturally, this would mean I need to get agreements from ABNB or other relevant parties, making true transparency a challenge, unfortunately.

Last edited by Jared Greathouse; 30 Apr 2022, 12:16.
1 like
Comment
Richard Williams

Join Date: Apr 2014

Posts: 4992
#6

30 Apr 2022, 23:14

Thanks everyone! The study documentation does say I need written permission to share, but also expresses interest in sharing for replication purposes. What I've done for now is create a bare-bones extract with only the variables needed for replication. I've also included information on how to get the entire dataset and codebook. If the paper gets accepted I'll ask for permission to share the extract.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment

Announcement

Free site for storing data that will work with use command

Comment

Comment

Comment

Comment

Comment