Moving towards GitHub: a discussion about version control, package archiving, etc.

haghish

Join Date: Aug 2014

Posts: 199
#1

Moving towards GitHub: a discussion about version control, package archiving, etc.

04 Nov 2016, 06:43

Dear All

I have developed a new Stata package for installing Stata packages from GitHub, allowing:
installing previous versions (releases) of a package

Installing package dependencies (with a particular version) for each of the released versions

The package is hosted on GitHub, and can be installed as follows:

Code:

net install github, replace from("https://raw.githubusercontent.com/haghish/github/master/")

Based on this package, I'm also writing a brief article about the benefits of using GitHub for archiving Stata projects. So I appreciate any comment or suggestion how can we further improve analysis reproducibility when it is done by user-written packages.

So far there have been several threads of discussions concerning the problems of SSC in reproducing research projects.
The first concern is the package dependencies. SSC does not install the package dependencies and the user is required to install all of the required packages manually. The github command allows automatic installation of the package dependencies.

For the sake of reproducibility, it is absolutely crucial to be able to install the previous versions of a package, but SSC simply hosts the latest version and does not archive the previous versions. This problem can be solved by GitHub since it can create a version with a single mouse click. The github command allows installing any of the previous versions from GitHub.

Another concern is the dependencies in Stata ado packages. It is in everyone's favor to rely on other people's functions instead of spending more time reinventing the wheel. However, software evolve by time and the main concern is that how can we ensure that the future updates of the dependencies will not cause trouble. The github command can also specify a particular version for the package dependencies. For example, installing an older version of a package would also install the particular versions of the dependencies required by that version...

Do you have any other points/concerns in this regard? any suggestion what else can be added to github command?

Last edited by haghish; 04 Nov 2016, 06:47.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Tags: None

2 likes
daniel klein

Join Date: Mar 2014

Posts: 3821
#2

04 Nov 2016, 08:13

I think the points you are making are in general valid and they are very important. I have a couple of related general concerns that I would like to share and discuss.

First, reproducibility, as desirable and necessary as it is as a basic concept in science, might have reasonable limits regarding software, I think we should not spent any effort to reproduce bugs. That is, if an older version produced the wrong results, I would not want them to be reproduced. This is also the policy of StataCorp concerning version control, as far as I can tell. So, in my opinion there should be some mechanism to at least make reproducing bugs as hard as possible if not impossible at all. The very least that should be implemented is some warning that one is about to install an outdated piece of software if an old version is requested.

Second, you and many others may well disagree with me on the above. Who do you imagine gets to make the final call in such a situation? If I released software on GitHub that has a bug and I, as the author and probably copyright owner, decide to remove it from there, who is to say I cannot do this? This raises a serious problem that has little to do with the technical possibilities. Personally, I reinvent wheels all the time precisely because I do not trust others (and myself) to not change or remove their code. Natrurally, your very welcome command cannot solve these issues.

Best
Daniel

Last edited by daniel klein; 04 Nov 2016, 08:21.
1 like
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#3

04 Nov 2016, 08:14

I would support this even if just to save Kit Baum the hassle that I'm sure the whole ssc-business represents to him. I do have one question - one nice thing about the ssc-framework is that you can keep an eye on how popular your command are, e.g. I can use ssc hot, author(Wursten) to tell me that my timeit-command has been downloaded 42 times, showing that I didn't fully waste my time in making it work. Is a similar functionality possible with github?
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#4

04 Nov 2016, 08:37

Originally posted by Jesse Wursten View Post

Is a similar functionality possible with github?

It's not (Unless you count the internal "traffic stats" that only go back two weeks at a time). Of course, an intermediate tool like haghish's could keep track of that independently, but for things like "net from" you are a bit on your own.
Comment
haghish

Join Date: Aug 2014

Posts: 199
#5

04 Nov 2016, 08:46

daniel klein

I agree with both points. However, I'd add that version control is not merely about bugs. If a software receives updates, it doesn't necessarily mean it has fixed a bug. A software receives updates, perhaps to add more functionality or make the syntax friendlier, or make the GUI more appealing, etc... So long story short, for the sake of reproducibility, if we want to evaluate sb 'd results , we must first and foremost test his code with the same software and versions he has used. If later on, new software is released and it's been made clear that the previous version had a particular bug/mistake, naturally that'd be taken into account. But still, we must have access to the version of the software that the author has used, and this is simply not optional, it's necessary.

But there is one thing I disagree with you. It takes simply a single email to Kit Baum to remove your own package from SSC. I have done it before! So publishing on SSC does not dismiss the authors' rights for removing the package from SSC.

Last edited by haghish; 04 Nov 2016, 08:48.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment
haghish

Join Date: Aug 2014

Posts: 199
#6

04 Nov 2016, 09:04

Jesse Wursten
I cannot agree more with you for saving trouble for Kit...

It is possible to get the traffic from GitHub and add it to the github query command that lists all of the previous releases of the package. or simply write another subcommand.

The beauty of using GitHub is, that you can Fork the repository on GitHub and add this functionality yourself. This allows collaboration on the software. I personally do not think the download information is useful and instead, tend to think "citation" counts as "usefulness" of the package. But, I won't have any concern if somebody writes a new subcommand and adds the traffic information at all.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#7

04 Nov 2016, 09:24

I don't know much about GitHub. I had a look once and it seemed much more complicated than I wanted to learn and for what I want to do. I am very familiar with SSC, so I didn't want to spend time learning any stuff that didn't appeal in the first place. That's not a well-informed comment!

The ideal of being able to access all previous versions of something is better than being indifferent about version and reproducibility. But consider this example:

Today I announced an update to stripplot (SSC). The code starts

Code:

*! 2.5.2 NJC 23 September 2014 * 2.5.1 NJC 9 September 2014 * 2.5.0 NJC 14 August 2014 * 2.4.7 NJC 28 June 2012 * 2.4.6 NJC 30 August 2011 * 2.4.5 NJC 2 December 2010 * 2.4.4 NJC 10 March 2010 * 2.4.3 NJC 16 February 2010 * 2.4.2 NJC 4 February 2010 * 2.4.1 NJC 30 November 2009 * 2.4.0 NJC 21 April 2009 * 2.3.3 NJC 8 November 2007 * 2.3.2 NJC 2 November 2007 * 2.3.1 NJC 17 July 2007 * 2.3.0 NJC 21 June 2007 * 2.2.0 NJC 28 November 2005 * onewayplot 2.1.3 NJC 27 October 2004 * 2.1.2 NJC 11 August 2004 * 2.1.1 NJC 21 July 2004 * 2.1.0 NJC 13 February 2004 * 2.0.3 NJC 17 July 2003 * 2.0.2 NJC 7 July 2003 * 2.0.1 NJC 6 July 2003 * 2.0.0 NJC 3 July 2003 * 1.2.1 NJC 18 October 1999 * 1.1.0 NJC 27 April 1999 * 1.0.0 NJC 23 April 1999

I have to tell you that almost all of these versions are lost to history. I don't care about keeping them myself and mostly they aren't on SSC. If I were starting now and SSC didn't exist and it seemed a good idea to learn GitHub and really good habits, then I would consider the GitHub way. But in 17.5 years of the history of this program, which started under another name. I don't recollect a single instance of anyone ever being bitten by the version thing or asking for a copy of a previous version.

That is a story, not an argument. But I have one more.

A few years ago, someone took a Stata program of mine, changed it, indeed in some ways improved it, but kept the same name, regardless of other versions in existence, and put it on GitHub, and then threw away the help file on the grounds that programmers can always just look at the code. That person then seemed surprised, indeed angry, when I reproached them for this, but did take the program down. That's a story of one person and one program, but I would be interested in what safeguards there are about protecting one's programs whenever authorship is perceived also as ownership.

More benignly put, there is a long tradition in the Stata community of always changing the program name if you adapt someone else's program, unless as I have done, the program is just given to someone else to maintain. I'd appreciate whether there is a GitHub angle on this, because my impression is that the philosophy is much more: Here's some code; Do improve it if you can.

The SSC philosophy is a mix of generous and possessive. Once my stuff is on SSC, you can download it freely, but you can't change the code on SSC yourself.

Perhaps most importantly, I should add that the Stata Technical Bulletin and Stata Journal have a 25 year history of archiving all versions ever published through them.
1 like
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#8

04 Nov 2016, 10:19

Nick Cox GitHub handles the ownership/authorship issue fairly elegantly. For example, I started working on a project here in Kentucky that you can see in the GitHub repository located here. If you click through things a bit and look at the network graph, you can see another user had "forked" or made a copy of the repository under their account. When you view their copy of the source code here you'll see a little link under the title header that reads "forked from wbuchanan/kentuckyStateReportCards". So the source location from which the source code was copied is permanently embedded in the other user's fork. There are other ways to remove that history, but that would negatively impact some of the core functionality of Git and GitHub (e.g., automatically merging changes from multiple or single author over time, etc...).
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#9

04 Nov 2016, 10:32

My current thinking on reproducibility is to install all user-written programs in a subdirectory called "ado" within the project directory and add this "ado" directory to the adopath. I even make sure that there are no conflicts with other versions installed in the standard way by including the following code in the master do-file:

Code:

* All user-written programs used in this project are located in the ado * subdirectory. We make sure of this by removing the following system * directory (these are restored at the end of the build). adopath - PERSONAL adopath - PLUS adopath - OLDPLACE

The net effect is that the code will run as if this was a fresh install of Stata with no user-written program installed. Calls to user-written programs will generate an error unless there's a copy in the "ado" subdirectory.

I use project (from SSC) for all my projects. Each project is completely self-contained and portable. A project directory can be zipped and submitted to any journal in support of a published article and anyone who downloads the zip archive will be able to run the whole thing without any alterations whatsoever. With project, any user can do a replication build to confirm that the code generates exactly the same results as before. Everything is compared, datasets, log files, tables, figures, etc.

There are of course limits to reproducibility. Identical results are only guaranteed if run on exactly the same Stata version. It's a good idea to keep your old versions of Stata (e.g. rename your Stata directory to Stata13 before installing Stata 14).

The current beta of project automatically adds the "ado" subdirectory to the adopath and restores the previous adopath at the end of the build. The beta also includes a relax option that will be welcomed by anyone who uses project with big data. This option can be used to skip the checksum calculation on files over a certain size when checking for changes in dependencies.

For those interested, here's a zip archive of a demo project. The included "_READ_ME.txt" file explains how to install project using the version included in the "ado" subdirectory. The help file has not been updated yet to include descriptions of the new features. I'll eventually get my act together and post this new version to SSC. I'm holding off because I haven't quite decided on how to deal with dependencies within the "ado" subdirectory.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#10

04 Nov 2016, 12:52

Billy: What you describe evidently works well for forking from GitHub parent to GitHub child, so to speak. I can't see that there is any protection when GitHub originals are clones of SSC or Stata Journal originals. In fact, it's hard to see that there could be, as GitHub can't know, presumably, about any other repository.

So, what I guess this boils down to is that there can't be safeguards here unless people behave responsibly. I'd certainly appreciate publicity for the difference in set-up between SSC and GitHub, to the effect that forking within GitHub preserves information on authorship but Stata programmers using GitHub should realise that program names should be changed if they base their programs on originals elsewhere.

In fact that's important for using any of these programs within Stata, which is the whole point, regardless of authorship or ownership.

Stata will allow user-written programs with the same name to be stored in different places, but it is not version checking which decides which will run; it's precedence along the adopath.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#11

04 Nov 2016, 13:03

Nick Cox I think the GitHub model is a bit different from what you may be thinking about. One of the major purposes is to allow individuals to collaborate around a project using a toolset that facilitates the group work side of things. So in the case that you mentioned earlier, what would traditionally happen is someone would first fork the repository. Then they would do some work independently test things out, and then submit a pull request. A pull request is basically analogous to someone emailing an original author and telling them that you just added a potentially useful feature and/or fixed a bug they identified and wanted to share it back with the original owner, who would then have the opportunity to merge the pull request (accept the proposed change) into the project's repository.

Your point about awareness of source related to SSC and/or Stata Journal is definitely valid and would require some care/forethought before any systemic migration. It would take a bit of work, but there should be some way of setting up some kind of a "listener" that would grab changes from SSC/SJ and merge them with a GitHub repository.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#12

04 Nov 2016, 13:20

I don't think I am asking for that. I just want a code of practice to keep distinct versions of programs utterly distinct.
1 like
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#13

04 Nov 2016, 14:19

Perhaps I'm old fashioned but I don't get Git. I don't usually collaborate with others on my projects and I'm perfectly happy with my workflow (including archiving) so Git has no appeal to me.

I've put a lot of work on some of my Stata programs and I don't particularly like the idea that anyone can fly in and "fork" my work and offer it to others as an improvement. I do not want my programs moved, aliased, or copied to any Git repositories. I do not want older versions to linger there either. I'm perfectly happy with users installing my programs from SSC and use them as they see fit on their computers, and I do not mind if they include a copy if they eventually publish their research project.

Just in case I did not make it clear enough in my previous post, I think the reproducibility and version control features of Git as a Stata-user-written-program-delivery-mechanism not compelling. Having said this, I don't understand why we can't have both SSC and Git installation vectors.
3 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#14

04 Nov 2016, 16:04

I agree with Robert, although contrary to his general pattern we have collaborated on some programs, chiefly my making small suggestions on the margins of his excellent ideas.

I'm reminded of what is a quite different matter, but there are some echoes: which text editor one uses. I have come across numerous different positions on this but find that my own personal preference for Vim is arbitrary but real and in no sense in competition or conflict with anybody else's.
Comment
haghish

Join Date: Aug 2014

Posts: 199
#15

06 Nov 2016, 08:59

Nick Cox

Nick you are talking about several interesting points.
I personally did not include the previous versions of my packages - that I always keep in a separate zip file after submitting them to SSC - on GitHub, simply because it was too much work...

Your experience with sb's technically "stealing" your code is unfortunately unavoidable as long as the source is open and public. Althugh the project can be "Forked" or "Cloned" in GitHub, but it can also be downloaded! But we should also think who actually steals code without giving any credit? Mostly learners. I don't think that should stop us from open-access sharing because the benefit is so much more than the potential cost. Besides, I also hear it a lot that "It's better to see the code is stolen than simply ignored!"

Regarding the tradition of renaming others' packages after making changes, I know who we should blame! So far, there has been no possibility to collaborate on a package. If, for example, I wanted to add a functionality to a package of yours, I could email you and ask you to kindly program that, or program it myself and email it to you to see if you are interested to include it in the update, or publish it myself. GitHub makes collaboration easy and I think it will solve so many of the difficulties that Stata programmers can experience.

The thing is that collaboration is vital. For example, in my next job I might be using another stats software which makes updating my own packages very costly for me. Encouraging the community to collaborate with one another will increase the life-time of user-written packages...

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment

Announcement

Moving towards GitHub: a discussion about version control, package archiving, etc.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment