Moving towards GitHub: a discussion about version control, package archiving, etc.

Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#31

08 Nov 2016, 07:44

haghish wrote:

any suggestion what else can be added to github command?

How do you easily uninstall Stata packages from GitHub? It seems that there is no github uninstall syntax or similar. That would be useful to have.
1 like
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#32

08 Nov 2016, 08:16

A few comments:
In the last year, I received 2-3 questions about reghdfe where keeping previous versions on github helped: someone writes a paper and submits it, then the referee sends it back for extra columns, but in the meantime I changed the code and now the *previous* columns have different T-Stats due to e.g. a better DoF algorithm. In that case, the authors were quite concerned, but pointing to the previous versions helped them in case the referees asked about the differences. That said, I think versioning is mostly useful for regression commands (ivreg2, rd, etc.) and not so much for data transformation commands.

This is a quick-and-dirty way of finding stata projects on github: https://github.com/search?p=2&q=stat...utf8=%E2%9C%93 (basically, searchs for a .toc file)

I can see the repos of Michael Stepner, George Vega, Matthieu Gomez, Brian Quistorff, Thomas Grund, WillB, Haghish, etc. Interestingly, most all of them are current grad students (I was one until May), so some of the differences might be generational.

I know of one project with multiple authors: https://github.com/gvegayon/parallel...s/contributors But in most cases there is only one author, and this is true of open source in general

I also treat SSC as a stable release, and the github versions the "dev" releases

I use github a lot, both for public and private projects, but I definitely agree with the feeling that the mental model behind using git is quite confusing (that's why I use github desktop). Others also agree: [1] [2] [3]

About Haghish's command itself, I already shared with him some thoughts here (in github, coincidentally). I think this package is *very* useful, but i) should be more general (wrap up both github as well as net from and ssc), ii) more robust to corruption of trk files, and ii) dependencies should be handled in a more robust way (what happens if I have two packages in the same github folder? then I can't have two dependency.do files)
2 likes
Comment
haghish

Join Date: Aug 2014

Posts: 201
#33

08 Nov 2016, 09:25

Anders Alexandersson

Using ado uninstall will do!

Sergio Correia
I agree with all of the above. The trick about the Stata.toc would also be a quick shortcut! I could try that. thanks for that.

Regarding the 7th point, there are many technical questions about how github install should work. For example, what if the name of the name.pkg is not identical to the name of the repository? This is already solved. But if you put your packages in separate directories but within a single repository, things will get messy. Every single archive that GitHub makes will be for all of the scriptfiles in the repository so in my opinion this would be generally a bad practice... You won't be able to release a new version for one software only and GitHub will treat all of the files as a single software.

Again, regarding the 7th point, basically the "force" option of the net command can mess up the trk file. I used the force option "by default" for installing my all of packages and soon realized that I cannot uninstall them... I believe that shouldn't happen with github install. Regarding dependencies the best I could come up with was creating a do-file that is executed and ideally, includes the required software with a particular version. But I am open to any suggestion to improve this procedure. If your concern is only the "dependency.do" file, I can alternatively include the information for installing the dependencies from name.pkg file. But I fear that is not a good idea because it makes writing the dependencies more complicated and also the usual tradition of SSC is that only the required packages are named in the pkg file...

Could you explain what do you mean by making more general?
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#34

08 Nov 2016, 10:04

One thing most users dislike is having to use multiple packages for the same thing:

Install from SSC: ssc install foobar
Remove it: ado uninstall foobar
Install from github: net from ... (or github ...)

It doesn't make sense to use different commands for something like this. Further, under no circumstances should the trk files get corrupted (which is why every time I want to try out a github install instead of a SSC install I have to uninstall first).

Instead, picture this:

Code:

pacman install foobar // defaults to SSC pacman install foobar, stable // same pacman install foobar, from(ssc) // same pacman install foobar, from("https://github.com/someuser/somerepo") pacman install foobar, dev // Github pacman install foobar, from(github) // same (this assumes there is an index somewhere that matches packages to repos) pacman install foobar, version(3.2) // This loads a specific version from a github branch or release pacman uninstall foobar // calls ado uninstall

(pacman and foobar are random names, but my point is that this feels more transparent and easy to learn)

Now, the above is just an idea, but my point is that a wrapper would be very useful and is something I could start suggesting everyone to use.

---

About the "force", I still think that just uninstalling before installing is cleaner (and in any case you can have an "ensure" suboption that just checks if the ado exists and if not it installs it)

Best,
S
2 likes
Comment
Anders Alexandersson

Join Date: Apr 2014

Posts: 203
#35

08 Nov 2016, 11:56

I agree with Sergio, and therefore retract my suggestion to add an uninstall option.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#36

08 Nov 2016, 12:06

Sergio Correia & haghish,
There are other challenges as well. For example, how do you manage packages that have more files than the .pkg specification allows? The package for libhtml has to work around this by downloading/compiling the individual mata programs from the repository (a precompiled version of the library could also be used, but I'd rather give end users the source code and compile it for them so they have additional flexibility). Also, how do you manage the issues with the Windows file system? For example, if the JVM has already spun up and you're using one of the packages I've put together that uses some compiled Java code how do you uninstall the package? On Windows you'll get an error about the .jar being in use.

Also, I wish on a fairly regular basis that I was still in grad school. Even Jeff Pitblado (StataCorp) has some stuff on GitHub (Vim configuration files that are Stata specific).

In general, some more robust form of dependency management could be nice/useful. One of the challenges might be finding a continuous integration solution to test packages in an automated way for issues that could come up from dependency issues.
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#37

08 Nov 2016, 12:17

wbuchanan I've also went with the download+compile route for my code.

For instance, the ftools package has an ftools compile subcommand that gets called if the mlib if it's not installed, if its version doesn't match the version stated in the ado, or if it has a different Stata version (I borrowed that from David Roodman's boottest).

In an ftools_dependencies.do file, I would then just do:

Code:

install moremata ftools compile

One of the main reasons why I like a .do file instead of a requirements.txt file (like Python does) is that it allows me to execute arbitrary code, which might help in the JVM case you mention.

About the CI solution, I agree definitely, and I already use one for my Python projects. However, licencing is a problem with Stata, which is why I don't think it will happen.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#38

09 Nov 2016, 02:43

Sergio Correia the issue with Java based packages is that that classpath marks any installed .jar files as "in use" in the Windows OS. That ends up preventing the end user from deleting the file or making other modifications that would be saved. In Stata 14, once the JVM spins up it is active for the remainder of the session, so the the only way to remove the file is to restart Stata an uninstall it then, or to delete it after closing Stata.

Nick Cox the other benefit is to prevent breaking changes in the syntax of user-written commands from affecting dependencies. For example, if in a future release of tuples you decided to get rid of some options to make the code base easier to maintain, it could cause breaks in the packages of other users that depend on tuples. Because there is no versioned system in place, the seemingly innocuous change that may not be related to a bug would require others to constantly track and test their code against any/all other releases. In that sense, this incentivizes duplication of effort or at least places it on a continuum of duplication of effort and being constantly vigilant of any changes that any other user programmers make to their source code to catch bugs earlier on.
I think everyone can agree that GitHub and other distributed VCS are not the exclusive solution for collaborative work, but using a versioned packaging system does provide more benefits than harms overall. For example, R's equivalent of the SSC - called CRAN - maintains a mirror of all the packages on GitHub. Why not take that type of approach? So a new version of the package gets pushed into SSC and that new version is also recorded as a commit some VCS which is then mirrored elsewhere that the prior version can be installed if need be? It doesn't need to be GitHub (although I think that would be the preference of folks advocating for this approach), but could use other platforms like Subversion, BitBucket, Mercurial, CVS, GitLab, etc...

Personally, the biggest benefit that I've found from using a VCS is a combination of "saving my bacon" and creating a clear audit trail. When I was working on programming the accountability system for schools/districts in Mississippi I would commit any changes to the programs I was writing using Git. On a few occasions I introduced major bugs during the development process and was able to roll back to any previous version of the program needed to prevent disaster. It also provided a powerful mechanism that we were able to use to provide legislative research types with a full and complete change history of all the programs from their earliest stages until the final version used in production. Later, when I wanted to add features to the programs I could create a new branch and work on that branch to isolate and protect the version of the programs that were used in production. It definitely isn't the right tool for everyone and using and VCS requires some form of behavioral change and learning that some people may/may not be comfortable with.
Comment
haghish

Join Date: Aug 2014

Posts: 201
#39

09 Nov 2016, 04:00

wbuchanan it seems that is something to discuss with Stata technical support to make uninstalling Java programs more convenient. I personally never came across that problem.

Sergio Correia It looks good! Such a general command would be really terrific. However, we will need a complete database that includes the address of each package on GitHub (sth I mentioned in #22) which requires some manual work or authors' cooperation. For example, if you want to install a package from GitHub only using its name, you should know the name of the username and the repository or retrieve it from somewhere...

Nevertheless, I agree that knowing where is the latest version of a package would be very helpful. That is something I can check for after getting the release date of a package from SSC and comparing it with GitHub. Getting the version from SSC is rather difficult since authors have different ways of specifying the package version. Some write it as comments at the top of the script files and I, for example, only rely on GitHub for the package versions. What I am trying to say is that there will be some difficulties in practice to compare a package on GitHub and SSC. I will have to give it a thought...
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#40

09 Nov 2016, 06:28

wbuchanan (indeed anyone): I am most reluctant ever to change the syntax of any program I release publicly. I sometimes do that when other changes break the syntax. I sometimes do that when I think I have a much better syntax.

But, frankly, there has to be an agreement that user-programmers are not under contract to users who download their stuff never to cause them small problems as a side-effect. Sometimes you might have to change your habits or your do-files if a program changes its syntax. This happens with Stata too, on occasion.
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2594
#41

09 Nov 2016, 06:41

Indeed, it would be nice if everybody would follow the same convention for specifying the version of their ado-files:

Code:

*! version x.x.x ddmmmyyyy

In addition, it would be even better if Stata could introduce an option version() to be specified with program define that could then be returned as a local macro by the which command, say:

Code:

program define foo, version(1.2.3) // ... end

Code:

. which foo version 1.2.3 . return list macros: r(version) : "1.2.3"

This would allow to easily write version-dependent code.

https://www.kripfganz.de/stata/
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#42

09 Nov 2016, 06:53

Sebastian: This is an interesting direction, but r(version) defined when and only when a program is defined is easily overwritten -- and it's not always going to be evident which program it came from. And any command emitting r-class results might mess up other programs using and/or producing r-class results. It might make more sense to let which emit something much more distinctive and much less damaging.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#43

09 Nov 2016, 08:24

Nick Cox I completely agree with the use at your own risk philosophy. Versioning just alleviates some of the pain points. I think there are solutions that wouldn't require any modifications to nearly everyone's work flow, but we could probably come up with a solution that works with the SSC archive is all I was trying to get at. I can appreciate the challenge with changing a workflow and think there are some benefits to exploring ways to make everyone happy with things with minimal - if any - disruption to anyone's current method for doing things.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#44

09 Nov 2016, 08:47

SSC's philosophy is more or less Stata's philosophy.

1. Old Stata code can only be found for certain in old versions of Stata, but not necessarily otherwise.

2. New Stata code supersedes old Stata code.

These principles don't rule out others of lesser importance, including

a. Allowing old behaviour in new versions when that is, on some views, desirable.

b. Giving new names to newer versions of Stata programs so that old code remains accessible.

When things go wrong, it often seems to be that programmers are not banging hard enough on their programs before public release. Naturally I am not immune to bugs any more than any body else, but it seems to me that more testing before publication would lessen the need for public versioning.

As said by some others too, I typically don't miss not being able to access uncorrected or superseded previous versions of a program.
Comment
haghish

Join Date: Aug 2014

Posts: 201
#45

09 Nov 2016, 10:52

This is very correct Nick. But say, if an update to a Stata command, for example regress, will change the output or the calculated numbers, the argument will remain valid, i.e. we will need to have access to the old and buggy version that the author used to replicate the "false" results and evaluate it as it was done. Then we can repeat it with the new software and show what went wrong. Otherwise we will simply conclude the analysis was not reproducible and the authors' claims could not be reproduced from the data. Which is not necessarily correct.

So it simply follows the same principle. Analysis should be tested using identical software. Why? To ensure that the claims can be reproduced using the same data, code, and software. What if the software was buggy? then we still can show that the authors' claims were reproducible. Then we show the results with the new updates to show what went wrong with the buggy software. On top of that, reproducibility never meant to be a proof for a sound analysis. Using the authors' data and code with identical software, along with a dynamic document package, you can technically reproduce the analysis section of an article. Then, it'd be the time to check if the analysis is doing what it supposed to do, etc... In other words, the minimum criteria should be that we can obtain what the author has claimed independently and then we can begin the review.

That said, the authors cannot be forced to publish open-source software (although some journals are trying to force that), release their analysis code and data, archive their software and make all of versions publicly accessible. And I don't think this can happen with a intensive discussion either. It's very hard to convince people with years of experience that using a version control can have a lot of benefits. It's a process, and it seems we are slowly making this change.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment