Moving towards GitHub: a discussion about version control, package archiving, etc.

Jesse Wursten

Join Date: Jan 2016

Posts: 915
#46

09 Nov 2016, 11:11

I personally feel Github solves an issue that doesn't really exist. In half a decade of intensive Stata use, I have yet to encounter the issue you mention, i.e. that I cannot reproduce certain results or that certain programs stop working due to a changed dependency.

That said, I think it is great if people publish their work through Github - I remember raising some issues on Sergio Correia's reghdfe command on github, which were dealt with in a very clear and efficient manner. I do think this was easier than the emailing authors it would have required if the command was only available on ssc.

I just do not feel it is worth it for most authors. I tried it this afternoon, but was quickly discouraged by the technicalities one must satisfy before it actually works. I uploaded the .ado and .sthlp file to the github repository and tried to install it using the github command. This raised an error. I then tried using the net install command. This too raised an error. The solution was straightforward after some searching - I had created neither a .pkg nor a .toc file. I've never in my life written one. I don't even know what they have to contain (though I did start adapting the github.pkg file to mine, but eventually found it to be too much effort). If I submit to SSC, then the Kit Baum's machinery takes care of all that and ensures that at least in that regard no errors occur.

On the other hand, after emailing the updated version to Kit, I discovered another bug. Fixing this meant sending another mail to Kit, which is less than optimal. In Github this would've been solved faster. Additionally, on github the new version is available immediately, Kit Baum understandably does not work on this 24/7.

The bottom line is that I think for most programmers submitting to SSC remains the most convenient. For those working on more elaborate commands, the start-up costs of Github might be acceptable, especially if they are working on multiple projects (splitting the sunk cost of learning how to work with Github). Hence, I find them to be very complementary platforms and fail to see why we need to chose one or the other.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35417
#47

09 Nov 2016, 11:27

In principle, I am very positive about reproducible research. Like motherhood and apple pie, its virtues need no boost from me.

In practice, life is too short to spend more than a short time trying to find out why other people's results are not reproducible. Crudely put, it's really their problem not mine! Often such problems arise when researchers never documented exactly what they did any way, so having access to numerous different versions of what they might have used is not as helpful as it might seem.

The pendulum is already swinging too far in many quarters. Soon, some sub-fields will be so obsessed with checking others' work that attempting original research will stop being first priority.

I have to say that I don't think the intransigence of experienced users is the major problem here, but I would say that.
1 like
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#48

09 Nov 2016, 11:50

My take from the feedback is to forget versioning (even if for now). I still think that two things are really valuable:
Having one command to manage install/uninstall from ssc/github/folders

Being able to list and run dependencies (dependencies.do or something like that)
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2575
#49

09 Nov 2016, 15:46

Originally posted by Nick Cox View Post

Sebastian: This is an interesting direction, but r(version) defined when and only when a program is defined is easily overwritten -- and it's not always going to be evident which program it came from. And any command emitting r-class results might mess up other programs using and/or producing r-class results. It might make more sense to let which emit something much more distinctive and much less damaging.

That's a valid objection. I obviously did not think it to the end.

Originally posted by Jesse Wursten View Post

I personally feel Github solves an issue that doesn't really exist. In half a decade of intensive Stata use, I have yet to encounter the issue you mention, i.e. that I cannot reproduce certain results or that certain programs stop working due to a changed dependency.

I actually often experience that I find empirical research not to be reproducible. Here is a recent discussion on Statalist about the non-reproducibility of the results in a classic dynamic panel data paper: Replicating Blundell and Bond (1998) using -xtdpd-.
The problem was actually not with Stata but with a different software used for the original results. It was discovered later that there was a bug in the original software that was corrected in a later version. My argument is that it is not needed to access this original version. If it is documented that there was a bug that lead to incorrect results, then that is everything we need to know. Why should we still try to reproduce the original results when we know that they are wrong due to a bug in their code? I actually find it dangerous if old versions with bugs are still hanging around because people might accidentally continue to use them without being aware of the bug.

That said, I believe it is important that program changes that lead to different results (in particular due to a bug fix) are properly documented. Otherwise, you might endlessly try to reproduce the original results and wonder what you might have done wrong when it is actually not you who did the mistake.

https://www.kripfganz.de/stata/
Comment
haghish

Join Date: Aug 2014

Posts: 199
#50

10 Nov 2016, 03:53

Jesse Wursten your points are valid. Examples of packages that rely on other packages in Stata are rather minor. Authors pack all of their ado files and publish them at once. However, often we develop smaller engines that can be simply used by others. In that case, releasing the smaller engines on their own allows other users to simply implement them in their own packages. For example, MarkDoc package requires Statax package, which is a syntax highlighter (both JavaScript and LaTeX). Statax was developed to provide a syntax highlighter for Stata, but releasing it as an independent engine, allows others to implement it if they need a syntax highlighter for a document, etc. I could simply avoid publishing Statax and just include it in MarkDoc. But I thought that package can be useful on its own. But that comes at a cost. Those who install MarkDoc from SSC must read the help files to realize other packages are also needed. I made it "more clear" by introducing errors in case the dependencies are not installed to guide the user that there are other packages that should be installed. But why not installing the dependencies right at the first place? So I think a part of the reason that requiring dependencies is not so common among Stata packages is that all authors agree they do not make a good "first impression."

I agree with all of your points. I also agree that you should get used to making pkg and toc files (i mentioned it in #21). I can, for example, make a dialog box that generates these files. I can also develop a very bad option for github command to install the files of the repository even if there is no toc and pkg file (i.e. create them on the fly). I call it a bad option because it is not disciplined enough and very casual. But it can work just fine for users who are not familiar with GitHub. Actually, most of the information stated in these two files can be obtained from the repository (description, release date, list of files, author, etc...). So yeah, I can make this happen, although experienced users won't favor this, I assume.

Sebastian Kripfganz Thanks for this. At least we agree that the problem is very real. quite frankly, I don't think that github is our savior. There are endless possibilities that a research might go wrong. As we are only discussing the process that begins with the computer (i.e.from the time the data is digitized or the program is being written), one might reasonably ask about the other aspects of the research that can mess up everything and are hardly traceable. Nevertheless, I think almost all people who have argued in this post so far - whether pro or anti- github - agree that a disciplined practice is always better than a loose workflow, and if I generalize it, the more disciplined the better. And I think GitHub is a step forward, although it can be a pain to learn, which I agree.

That said, I taught it to my undergrad students (applied math students who take a lot of courses in computer science too) in an hour and we actually used it as a platform for putting assignments, etc. It wasn't very convenient for them but we managed. I think writing a useful tutorial about using GitHub can be very insightful, because when users install GitHub they will be very confused about how the software works... So that is a challenge for itself.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment
haghish

Join Date: Aug 2014

Posts: 199
#51

14 Nov 2016, 11:37

Thanks for all of your comments. Following the suggestions made by many of you, I have updated GitHub to address the critiques mentioned in your posts.

Sergio Correia
github is now very secure about protecting the trk file, but the idea of the "wrapper" is not really doable, in my opinion... As suggested, I also added the uninstall subcommand .

Nick Cox
So as you suggested, github now allows for searching for Stata packages and repositories and the process is completely automated. The GitHub API had a lot of limitations, but it seems that searching for a package or keyword and installing them has become much easier.

Soon i will add another subcommand that can list all of the packages that are released after any particular date, which allows to build an archive list of the packages.

It was also mentioned in the forum that some of the users are unaware that their modules require the toc and pkg files. The majority of Stata modules that are hosted on GitHub suffer from this problem, unfortunately. so it seems that needs to be addressed in a systematic way... Further suggestions are very welcome.

PS. Some users had the impression that the github command itself makes archives of users' modules. This is simply incorrect. The point is that the authors make releases of their own software on GitHub, whenever they fix a bug or add a new functionality. the github command can access all of the previous releases of the package and allows the user to install an older version, if requested. But, it does not archive software itself.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment
Sergio Correia

Join Date: Apr 2014

Posts: 420
#52

15 Nov 2016, 07:19

.Further suggestions are very welcome.

Is there a template package on ssc some where? If not, something like that can exist on github (just a sample package that does nothing but follows best practices,)
1 like
Comment
Charlie Joyez

Join Date: Dec 2014

Posts: 418
#53

15 Nov 2016, 09:35

I've just read the entire conversation, and even if most have been said, I'd like to add some points:
(I won't enter the debate on the pro and the cons of moving archives from SSC to Github.)
Although my short experience in programming Stata, I know both systems, since I've had an intensive use of nwcommands by T. Grund, which are on Github, and not on SSC

1) The github package is useful, because github already hosts Stata packages that were not directly accessible from Stata (e.g; nwcommands I told you about).

2) However, make it easier to diversify user-written command storage personally bothers me, for several reasons:
-Stata users (and also beginners who don't know much about the SSC architecture) will be force to check in SSC and in Github if a command exist. And I can already image the case when ssc istall pgrname1 and github install pgrname1 won't install the same program. Then you could run the same program name than your neighbor/collegue/student/teacher, and do not have the same output. Moreover, I already see my student's face when telling them for this program use ssc install, but for the second ssc github.

-From another perspective: I use Stata on a secured (i.e. "offline") server for data confidentiality. The administrator of this server by default pre-download SSC archives, for us to be able to use user-written commands. I'm not sure they would agree to do the same from Github archive, since it would be much less supervised, and the size would rapidly become an issue if all versions of all programs are available, they would simply not know which version to download. At last, I'm not sure they'll be happy to deal with a second source today, and perhaps a third later on.

Again, that's my story, but I though it might be informative.
So concerning further suggestions:
-Make sure that their is not name conflict between ssc install and github install. (Suggestion : within github install pgrname , run ssc install prgname; if it does return something, download the SSC version, and warn the user of that; if it doesn't return anything, install from github). I know this consists in given SSC the priority, but it's the only way I see, other might have better ideas.

-Encourage authors in github to maintain into SSC their latest official version of their program, to still have a "reference" archive, and perhaps keep Github for collaboration until a new version comes to date.

Best,
Charlie
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35417
#54

15 Nov 2016, 09:41

Further to Charlie's thoughtful remarks:

It's long been true that user-programmers are well advised to check that any program they wish to make public has a new and different name.

Code:

search newprogname

is not guaranteed to know everything, but it should find programs in the usual places.
Comment
Charlie Joyez

Join Date: Dec 2014

Posts: 418
#55

15 Nov 2016, 09:50

Nick Cox And that's one more reason to keep "usual" places.

This also makes me wonder:
Does anything on GitHub prevents from creating two programs with the same name?
How does the github package deals with that?

I think Kit Baum won't let me upload into SSC a program whose name already corresponds to an SSC program.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35417
#56

15 Nov 2016, 09:54

On the bottom line: Let's confirm absolutely:

There is no way that two programs can have the same name on SSC.

This is a feature to whose who favour simplicity and a key limitation to those who want to keep multiple versions in sight.
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#57

16 Nov 2016, 01:27

You could theoretically have an infinite number of repositories all with the same name, but that is only half of the information needed. You also need to know the username and the combination of username/repository should be unique since it forms the basis of the URL. It also provides a bit of flexibility in terms of installing a version where someone else has forked and added features needed to the program that might not be incorporated into the original author's main repository yet.
Comment
haghish

Join Date: Aug 2014

Posts: 199
#58

16 Nov 2016, 08:33

Charlie I am very familiar with that problem... I also think if Stata users develop their software on GitHub and publish it on SSC (while ideally archiving all releases on GitHub), we have made a step forward. And I believe it is a big step forward. SSC is almost the "natural" way to install packages on Stata. I have tried to make github command as close to SSC as possible, which I think the latest version has succeeded, to some extent.

Sergio Correia
very good suggestion. I will make a repository that includes all of the basics for an installable repository, that others can fork or download. That said, I released a new version of MarkDoc package today to solve this problem. The new version adds a build option, which generates the pkg and toc files automatically. Naturally, the users would be required to write their Stata help files using MarkDoc package as well. But MarkDoc seems to be now a complete tool not just for creating helpfiles, but also making the package ready to go on GitHub. I also added the option to the MarkDoc's dialog box... I hope that would help! Yesterday, I added the github subcommand for listing all of Stata packages on GitHub. it turned out that GitHub has been hosting plenty of Stata modules, but the majority were not installable, which makes the archived versions not installable as well... unfortunately.

Originally posted by Nick Cox View Post

Further to Charlie's thoughtful remarks:

It's long been true that user-programmers are well advised to check that any program they wish to make public has a new and different name.

Code:

search newprogname

is not guaranteed to know everything, but it should find programs in the usual places .

Very good point, indeed. luckily , yes! The newest version of github package adds several new functionalities for searching GitHub within Stata . For example, you can search all the repositories' names, their description, and the README . md file of the repository. Let's say I want to check if there is a Stata command on GitHub named log2html:

Code:

github search log2html, all

The all option also tells github to show repositories which are not installable (not having the pkg and toc file), which do not show up by default. This will return all the "Stata" repositories that have the keyword log2html in their repository name and description. adding the option in(all) will also search the readme file.

However, I am aware that your point was more general, i.e. what if there is a package called myprog that has log2html.ado in it? That you can also search in GitHub website, for example, you can search all ado files that are named "markdoc" and are Stata repositories:

Code:

markdoc extension:ado language:stata

It'd be just easier to do it in the web browser because GitHub API does not allow you to search files unless you specify a particular repository. They have limited their API search engine... Therefore, to add this functionality, I should get the list of all repositories for Stata (already available via github command and search for that file name). If you think it would be useful, I can add a command that checks all of the Stata repositories for a particular filename. Using the available list of packages, this would be rather fast to execute. Although, building a fresh list of all Stata repositories (the command is available in the newest release of github package) takes about 10 minutes:

You can build the complete list of Stata packages by executing:

Code:

github list stata, language(all) in(all) all save(archive) append

The list of the packages can also be accessed in a data set that I keep updated frequently, were I simply run this command and save the "archive.dta" file, which is available here:
https://raw.githubusercontent.com/ha...ta/archive.dta

Just to inform those who are following the discussion, the github hot command is a new subcommand for viewing popular Stata packages on GitHub. It allows the users to figure out how well a repository is doing, since it also calculates a hits score for each repository. That was also mentioned somewhere in the discussion.

Last edited by haghish; 16 Nov 2016, 08:37.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35417
#59

16 Nov 2016, 13:49

Haghish: Sorry, but your reply to my simple point was far too complicated for me to follow.

One of your problems here in this thread is, I guess, that people who don't use GitHub don't understand most of what you're saying. That is a fact, not a criticism; it's on all fours with the fact that I don't understand Chinese, which is manifestly not the fault of people who do, and no reason for them not to speak Chinese.

My point again: any Stata programmer wishing to put a Stata program in the public domain needs to know if the same name has been used anywhere in the Stataverse.

If you're telling me that I can use GitHub to do this, either wholly or partially, that's good for people already using GitHub, but I can already (try to) do this using search in Stata.

My interpretation is that

* if search can find Stata programs on GitHub, that's great, but then my using GitHub to do this is unnecessary.

* if not, that is not ideal, but I really won't try to learn GitHub just to find what's on GitHub, I don't even want to learn a new Stata package to do this because everything should be accessible via search. That's been the Stata standard for many years.

You seem to be imagining that lots of Stata programmers can just make a small jump and start learning GitHub and all will be fine, but I've heard nothing in this thread that tempts me in the slightest!

Last edited by Nick Cox; 16 Nov 2016, 13:57.
1 like
Comment
haghish

Join Date: Aug 2014

Posts: 199
#60

16 Nov 2016, 15:51

yes, github search can search for Stata command names on GitHub, similar to the search or findit commands.

Most of my explanation was about ensuring that no similar "filename" is used in different packages on github, which complicated my explanation.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment