Moving towards GitHub: a discussion about version control, package archiving, etc.

haghish

Join Date: Aug 2014

Posts: 199
#16

06 Nov 2016, 09:27

Robert Picard

Robert the point is not that "the author should be happy." The point is that "the reviewer should be happy." And the reviewer can only be happy if he can openly access your code (packages/analytic code) without contacting you. I make it bold because it makes a lot of difference. We should not require additional information from the author to reproduce an analysis or claim. Therefore, we need all versions of a software package available to be able to install them and redo an analysis using identical software. Many Stata programmers program in Mata and only include the Mata libraries on SSC. The question is, how the community can be assured that the package is computing what it promises? Isn't that the whole point of open-source publishing?

Version control might not seem so crucial if the programs you write are small. The bigger the project and the more complicated it become, the higher is the developers' need in a version control software. It has happened to me many times that I do bad mistakes once I am trying to add a new feature or debug the software. With a version control software, it takes a mouse click to go to any previous save state instead of navigating through several directories without knowing where that "recent" bug was added.

There are other benefits in using GitHub, especially when it comes to teaching computational programming to students. GitHub encourages students to practice computational statistics, read others' programs, comment and document their own programs better, and write better programs, simply because they expect others to read their code. For that to take place, we should be there and serve as examples!

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#17

06 Nov 2016, 13:09

hagish (full name please as per the FAQ), your argument that Git-type version control is crucial to serious work is no more compelling than proclaiming that everybody should use a Mac because of Time Machine. Beside, do you really want to make the my-programs-are-bigger-than-your-programs argument?

All my code is accessible on SSC. A lot of it is written in Mata that will be compiled on the fly, with no libraries used. My code is there for anyone to peruse and I'll be happy to provide to anyone all the steps I use to certify that my code generates the correct results. If an author wants to withhold their source and only provide compiled libraries, that's their choice. Most of Stata's code is proprietary and not available for review. Since Stata is not open source, then your argument is that no one should trust the results generated by Stata!

None of what you propose has to do with replicability. It has to do with accessibility. The code is available from SSC and the user is ultimately responsible for keeping around the data they used, the code they wrote, and the version of the user-written programs they used.

You seem to have a perverted notion of collaboration. In real life, collaboration is consensual (please look up the meaning). You can't force yourself on someone. If rejected, you can't simply steal their work and proceed anyway. It doesn't matter that you think that your contribution is vital and that the world would be a better place with it. Similarly, you can't make copies of the paywalled Stata Journal articles you receive and put them on a GitHub repository and claim that this Robin Hood type act will benefit the research community.

Maybe you think that it's OK to take my code because it's available for free in human readable form. That's not what open source means. Here's the first sentence of the Wikipedia entry on open-source software (as of this moment for the version control obsessed):

Open-source software (OSS) is computer software with its source code made available with a license in which the copyright holder provides the rights to study, change, and distribute the software to anyone and for any purpose.

What's missing is the explicit licence from me to change and redistribute my code.

Since you appreciate the ability to review source code, may I suggest that you look at "ssc.ado" in your Stata installation: you'll find the following line:

Code:

qui net from http://fmwww.bc.edu/repec/bocode/`ltr'

The first paragraph of http://repec.org is:

RePEc (Research Papers in Economics) is a collaborative effort of hundreds of volunteers in 89 countries to enhance the dissemination of research in Economics and related sciences. The heart of the project is a decentralized bibliographic database of working papers, journal articles, books, books chapters and software components, all maintained by volunteers. The collected data are then used in various services that serve the collected metadata to users or enhance it.

Each of my programs on SSC counts as a publication and I can use the RePec Author Service to manage them. So I'm a small pawn in an open and collaborative effort.

I think that SSC works well enough in making it super simple to install user-written programs parked in a RePec repository and I see no compelling reason to move to a GitHub repository. At the risk of repeating myself, why can't you and other similarly inclined authors develop your own method of installing your wares and be satisfied with that. Why do you insist in co-opting the collective works on SSC?
2 likes
Comment
Sebastian Kripfganz

Join Date: May 2014

Posts: 2575
#18

06 Nov 2016, 14:46

Let me chip in my two cents. I agree that replicability / reproducibility is a cornerstone of academic work. But I also tend to agree with Daniel that it is not helpful if older versions with bugs are retained and continued to be used. It is however desirable that a documentation of a version history exists (publicly or provided by the author upon request). If there was a bug in an old version that got fixed such that earlier results cannot be reproduced, this is absolutely fine if the information is available that there was a bug. If I cannot reproduce earlier research, I am happy as long as I know the reason for it.

For my programs, everybody can find a brief version history with documented changes at the end of the main ado-file of the package. In addition, I keep old versions on my own computer that I could provide if necessary (which has never been the case so far).

Regarding open source code, I personally do not provide the Mata code but only compiled libraries because I actually do not want others to modify my code. The main reason is that it is very likely that somebody else who modifies my code will not fully understand my programming logic and potentially introduces errors, but eventually I will be held responsible because I was the original author and subsequent users may not realize that somebody else has made some changes to the program.

Also, I usually do not even feel the desire to look into others' code. The time I need to understand the logic behind somebody else's code is probably more than sufficient to just reprogram a procedure by myself (as I have done in the past). That is still the best way to replicate others' work. If you obtain the same results with a different code, the chance that both did the same mistake is very tiny. If I replicate some work with the original code, the replication would just be subject to all the same errors that might have been in that code.

https://www.kripfganz.de/stata/
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#19

07 Nov 2016, 01:48

I have looked at and tried Git twice a couple of years apart and in both cases it did not work for me. At its core reproducibility requires old fashioned conscientious work. The bottle neck is not technical but human. Each of us needs a work flow that is is thorough but at the same time comfortable enough that we actually do it, and keep doing it. That is the point where Git failed for me.

As to collaboration, I have had no problem with the current way of working. My first steps as a Stata user programmer were me emailing Nick and Stephen offering to add functionality to betafit. They were (and are) very nice and we had a short discussion, they improved my suggested improvements, and we sent it to SSC. I am sympathetic towards the ideas behind Git, but I don't think it would have worked that well in this case. The technical layer would get in the way of simple human interaction. As to my real work: I collaborate with people who will never use Git, so that is pretty much the end of the story. I don't like switching between workflows, so that is another reason why Git won't work for me.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#20

07 Nov 2016, 03:24

We all write from our experiences, and in the face of our ignorance of other habits.

I am bemused at the idea that I need access to all previous versions of any command, even my own commands. I have worked with Stata for 25 years without feeling this need. That's not to say that there aren't lots of small puzzles when people are using different versions of a command. But most of the time on Statalist, this is because old and new versions are currently accessible in different places. The problem arises because naive users are not aware of the possibility of different versions. That's not going to be solved by telling people to learn how to use GitHub.

Similarly, any implication that you need GitHub to collaborate is absurd. People can send each other code and comments by email, which is simple and works well. What Maarten describes was a case in point. I'll bet that most jointly written programs in the Stata community arose this way and that's been so for a long time.

Prejudices aside, I am also interested in hard facts here:

1. Are there any Stata projects on GitHub in which two or more people have collaborated? Can they report jointly on their experiences?

2. Is there a way of finding easily about all the Stata packages on GitHub?

There's an easy way to find out about everything on SSC:

Code:

foreach letter in _ `c(alpha)' { ssc desc `letter' }
2 likes
Comment
haghish

Join Date: Aug 2014

Posts: 199
#21

08 Nov 2016, 05:02

Robert Picard
No accessibility no reproducible. Or else, what's the point of making the source and data available online?

Did I say I don't respect license and authors' efforts? I said the thief will find his way, whether on SSC or GitHub.

I reinforce my idea of collaboration. Because a collaboration on software requires traceable contributions of each author. I don't argue that collaboration via Email or Dropbox is impossible. I am arguing that it is much more convenient to do it in a platform where every change is documented and you can backup any committed change by any user. Does this mean I am forcing my contributions to others?! Besides, the original author can always accept or reject a contribution on GitHub unless the other contributor is also in the project

Again, I do not think it is Ok to take your code without having a written agreement from you. Did I say anything that implied anything else?

And well, did I say my programs are larger/better than yours?! I said "The bigger the project and the more complicated it become, the higher is the developers' need in a version control software"

If a software release on SSC counts as a publication, I can't understand why publishing "identical material" on other platforms does not count as a publication. Only because it doesn't appear on Google Scholar automatically? There are other alternatives, for example, https://arxiv.org/archive/stat. I am not trying to argue that publishing on SSC is not good and we should not do it. not at all. I think SSC is great, especially for beginners. they write ado files and help files and Kit takes care of adding the other required files. My argument is that versions should be archived and remain accessible, a long with many other points that I shouldn't repeat.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment
haghish

Join Date: Aug 2014

Posts: 199
#22

08 Nov 2016, 05:23

Nick Cox

By collaboration I do not only mean "writing a software together", but also bug fixes. I just received a bug fix from Mikko Rönkkö today on Weaver package and had very good experience with MarkDoc as well, were some users suggested bug fixes. So yes, I have seen the benefit in practice.

The second question is so important though. SSC already has built a terrific archive and users can search for different packages. I can create an online form were developers "submit" the information of their packages. This would solve the problem if authors cooperate. For software that do not receive very frequent updates, SSC is till the way to go. I don't see what could possibly go wrong with releasing the latest version of a package on SSC! I argue in favor of GitHub for all I have previously mentioned... i.e. SSC is not enough!

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#23

08 Nov 2016, 05:28

E.F.: Thanks for your clarifications and extra comments in #21.

As an over-arching remark, I think we should try to focus on technical agreements and disagreements without feeling that there is any personal element to the discussion.

I make a few further points:

1.

a collaboration on software requires traceable contributions of each author.

It really doesn't. If authors want to document that, they are always free to do so, but there is no such requirement beyond what co-authors impose on themselves.

On SSC and in the Stata Journal, all that anyone requires is that the authors put themselves in whatever order they desire. No-one at either place objects, naturally, if people want to comment: X did this, Y did that, as is standard in some journals.

My own convention, which is widely shared, is that the first author named in Stata help files is the lead author. That usually means historically the largest single contribution to code. It certainly means that the first author is now leading program development and maintenance and support.

2.

I think SSC is great, especially for beginners

I will add: for all levels of user-programmers.

3.

I am not hostile to GitHub. That would be asinine when I know so little about it. But I've taken a look and my first, and second, impressions are that it's much more complicated than I need. If you're right, then Stata user-programmers will migrate to GitHub quickly.

As yet, no one has even given a single example from their experience in response to my query in #20.

Do you, E.F., act as joint author of any of your Stata projects? That's the nub. If collaboration is easier through GitHub, that can be documented from experience.

EDIT: Thanks also for #22.

People send me problem reports and code suggestions for my programs published through SSC and the Stata Journal, so I am hearing nothing different. That's been standard in the Stata community at least since the Stata Technical Bulletin started in 1991.

It seems that there is no way of finding Stata projects on GitHub except in so far as people tell you and you tell us. That sounds a fragile process. My sense is that people using GitHub like the directness of being able to post themselves. They would be less likely to want a further level of paperwork. Do I misunderstand?

Last edited by Nick Cox; 08 Nov 2016, 05:39.
Comment
haghish

Join Date: Aug 2014

Posts: 199
#24

08 Nov 2016, 05:52

Originally posted by haghish View Post

requires traceable contributions of each author

I really did not meant it regarding author contribution. The reason I think it is important is merely for finding bugs. If changing the software result to a bug, you will always know what exactly has changed compared to the previous saved state. All of the information regarding the changes in the code will be preserved. That's what I meant by traceable contributions. Perhaps I didn't use the proper term.

I am currently collaborating on a couple of projects in our department where version control is used from beginning. And we also have Git installed on our server.

But I also agree with you that Git has a learning curve and it requires breaking a lot of habits. It's not going to be friendly the first week especially for Linux users, but I think it's worth it. The GitHub GUI that is available on Windows and Mac makes learning it much simpler.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35418
#25

08 Nov 2016, 05:59

Thanks for this. I guess what this means is that my point 1 in #23 can be ignored, because you did not mean what you said, which often makes discussion more difficult.

I've think I've said all I want to here, except that I want to emphasise a point made earlier. The STB and SJ have been archiving successive versions of programs published through them throughout their history, so that is a leading example of good practice.
1 like
Comment
haghish

Join Date: Aug 2014

Posts: 199
#26

08 Nov 2016, 06:23

Maarten Buis

It actually would. Let's imagine the package that you wanted to improve is on GitHub. You would:
First Fork the package. The author would receive a notification that someone has cloned the project.

You make the changes in your copy of the code

You may make a pull request, suggesting the author to "accept" the changes you made in his project and explain why your changes matter or what bug you have fixed.

The author can click on your request and within his browser he can view all of the changes you have made (added code, removed code, etc).

It'd be up to him whether he wants to accept your changes or not!

The same process happens on email, right? the only difference is that the whole history of the new change is documented. It will always remain clear what you changed in the program and if your change causes other bugs, the author can always backup to the previous version and ignore your the changes. The difference is the transparency provided by the version control.

Sebastian Kripfganz
On GitHub you can access the history of the bug fixed. This is something the author writes within each new version release. For example, you can see all of the versions of Rcall package on GitHub https://github.com/haghish/Rcall/releases. For each version, I have written what bug was fixed or what new feature was added. Usually you will only install the latest version but all of the previous versions are automatically archived and you can download them (or install them using github install).

There are different occasions you need to install older software.
testing an analysis. In that case, having access to the previous version - although buggy - is crucial. Reproduciblity does not mean correct results. It simply means reproducing what has been done. And there are many arguments about its benefit, but I think it makes things more transparent and transparency in research is generally a good thing! But in either case, you need to have access to archived packages.

if it is required by another package. This is something we should really discuss. It is not very common in Stata to require other packages. But doing so helps to rely on other people's works. The drawback is that they might change their programs which makes it unreliable because you need to ensure you package is working flawlessly with any change made to the dependencies. However, installing a particular version of the dependencies can solve this problem and for that, we need t have access to the archived releases.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1361
#27

08 Nov 2016, 06:27

Nick Cox I actually make the staff in my office use Git to avoid the crazy file system hell that can occur quickly/easily when there are no established standard practices in place. In addition to allowing me to direct one of the folks in my office to take on a specific part of the project, I was able to work concurrently on other components and leverage the merge/pull capabilities of Git to manage all of the changes to the same code base. What I found especially beneficial in my particular use case was the pedagogical utility that I was able to get from code reviews of the pull request. For example, the person in my office had never worked with Stata previously and was unaware of style conventions both in Stata and more broadly in other programming languages (e.g., white space usage, naming conventions, etc...). Using this tool I could quickly and easily comment on any of the proposed changes, provide guidance on appropriate changes, identify misused function calls, etc...

For those concerned with authorship credit and things of that nature, I would argue that GitHub helps to facilitate this as well. For example, you can include a file with any one of several open source licenses in your repository upon creation with trivial effort. This license is then included with your code as a legally binding method of controlling usage, attribution, etc..., but the package can be distributed from GitHub with out needing to install the license text on an end users machine if you choose not to. Conversely, while a similar file could be provided to SSC, the mechanism for distribution and viewing is not set up to accommodate this use case in the same way.

In terms of authorship, I think one of the more important/useful benefits is being able to know who to direct questions to. For example, Robert Grant did all of the leg work to create the Stata interface to Stan. I submitted a pull request a while ago to create a project page for the package. If someone had a question about the project page they could easily see that that was the contribution I had made (and it is an extremely small contribution) and ask me questions about things. By looking through the code base they could also see that I would not be the person to email with questions about the core functionality of the package. Another way to think of things in terms of more well known packages in Stata would be if someone had specific questions about the ivreg2 program that Kit Baum and a few others worked on. There may be parts of the codebase that were entirely authored by one of that team of individuals and being able to identify who made a specific contribution could be useful for cases where a bug is discovered, to ask questions/learn more about programming practices, etc...

I don't think GitHub and SSC need to be competitors or mutually exclusive and see things in a slightly different way. Rather than emailing Kit tons of updates as new bugs are discovered when working on a new program or adding functionality to an existing program, I treat the SSC more like a production server; in other words, only after I've done my best to identify and correct any and all major bugs would I submit something to SSC. GitHub, I tend to treat more like a development/test server; so I can put code there that is currently being worked on in order to distribute things and allow people to test things out and help find bugs where I can more effectively track any bugs, feature requests, etc...

In terms of search capabilities, GitHub uses one of the more sophisticated text search engines and could easily outperform other search functionality (in the case of calling things from Stata it would just be a matter of supplying the appropriate API request).
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3821
#28

08 Nov 2016, 06:53

This is something we should really discuss. It is not very common in Stata to require other packages. But doing so helps to rely on other people's works. The drawback is that they might change their programs which makes it unreliable because you need to ensure you package is working flawlessly with any change made to the dependencies. However, installing a particular version of the dependencies can solve this problem and for that, we need t have access to the archived releases.

Let us assume my foo package requires your bar package and has been tested with version 1 of the latter. Therefore I explicitly state that foo requires bar version 1. Later you find and fix a bug in bar and release version 2. Will I receive a notification then? Or am I expected to check once in a while whether any of the packages foo relies on has been fixed and then adapt my code? If the latter, then I would rather see my package break instead of continuing to give the wrong answer. But this can be done easily with SSC. All I need to do is include the lines

Code:

capture which bar if (_rc) { ssc install bar , replace }

in my program. That seems not to be more effort than stating the dependencies (maybe in a file format I first need to learn about). I should say that I would not like to see such lines in programs that I install, because I do not like some user-written software to decide what to install when and where on my machine without explicitly asking to do so. Even Stata asks my permission to update, but this might be another topic.

The point is that the problem seems not to be a technical but a human one, as Maarten already mentioned. As long as authors keep changing the syntax and how their commands work, there seems to be no easy solution.

While I do get the basic concept of GitHub and can imagine it to be a great tool for developing software that starts as a collaboration, I fail to see how it improves the way we handle things now with regards to the problem discussed above.

Just to repeat: Thanks for providing a convenient program that helps us download stuff from GitHub, anyway.

Best
Daniel
1 like
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3426
#29

08 Nov 2016, 07:01

haghish The whole selling point would be that with collaboration you can do the same things as with email, but now you have version control. I am sympathetic towards that, which why I have tried it a couple of times. However, the single most important thing about a workflow is that you continue to use it. The technical layer just got in the way. Then I realized I have never needed to get back to a previous version in a way that was impossible with my current way of working (just creating new folders with the new version number...), and I have never felt the urge to see the older versions of programs written by others. So for me the costs are higher than the benefits.

On a practical level, if part of the benefit is due to collaborating, then that benefit won't occur if the people with whom you want to collaborate will absolutely refuse to use GitHub. This is the case for my "real job". That makes it for me pretty much impossible to work with GitHub, as I really don't want multiple work flows.

Obviously, GitHub works for you. Great, continue using it and be happy. Obviously, you are enthusiastic about it and want to share it with the rest of the world. Great, you have done so. However, don't preach.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
1 like
Comment
haghish

Join Date: Aug 2014

Posts: 199
#30

08 Nov 2016, 07:04

wbuchanan I can't agree more. The last point was brilliant! It would be very useful if we could search GitHub via Stata and view a list of packages and possibly install them the same way we do from SSC. The problem is that not all Stata repositories on GitHub have the name.pkg and stata.toc files, but that's something I can check for. Alternatively, I can define particular keywords that make the packages easy to find and organize. But yeah, now it seems doable! Thanks.

——————————————
E. F. Haghish, IMBI, University of Freiburg
[email protected]
http://www.haghish.com/
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment