Reproducible workflows write-a-thon

timtroendle · 17 April 2018 13:39

EDIT: This is now a write-a-thon for people wanting to share their experience with reproducible workflows. If you want to learn how to make your workflows reproducible, consider the learn-a-thon.

Let’s talk about reproducible workflows! Reproducible workflows allow others (including your future self) to reproduce all or parts of your work. Technically, it’s mainly about making others able to copy your working environment and run all tasks in the same manner as you did.

In this write-a-thon we collect experiences with workflow systems and compile them into a wiki entry. The structure of the wiki entry could be along those lines:

requirements for workflows in energy modelling
overview of tools in particular focusing on advantages and disadvantages for energy modelling
discussion of usefulness for energy modelling
link to examples
how-tos

In a follow-up session, we could even start building a cookiecutter-template.

tom_brown · 19 April 2018 10:02

@jonas.hoersch or I could do some evangelism for snakemake in this context. It has a steep learning curve, but everyone I know who got over the initial barrier loves it.

timtroendle · 19 April 2018 10:44

Yep, I also switched from make to snakemake and I am not regretting it. My original thought was to collect and document experience people had with different systems, but it really depends on the background of people joining. If the background is heavily snakemake centered we could also write a “snakemake in energy science” tutorial or software carpentry style lesson (and maybe use it already in a later session?).

robbie.morrison · 26 April 2018 20:46

Hi Tim. The merits of docker containers might warrant a mention. I have not used containerization, but they are occasionally mentioned as a way of distributing energy models. TEMOA has considered using them. I am not sure containers strictly fall within the rubric of reproducible workflows though. Robbie

timtroendle · 27 April 2018 07:41

Good point Robbie. I see replicating the computational environment as a necessary condition for reproducibility (or should I rather say replicability?) and certainly would like to discuss this here. Personally I am not too convinced of the added benefits of containers, but I’d love to discuss with anyone having used them effectively.

robbie.morrison · 27 April 2018 08:45

Hi Tim. Usage varies but “replicability” often means (wiktionary) to repeat the experiment or trial. In the context of numerical modeling, this implies independently reimplementing the code and collecting the data afresh. So lets stick with the lower bar of “reproducibility” for this discussion. Robbie

ludwig.huelk · 22 May 2018 15:32

Yet another highly interesting Do-A-Thon for this meeting. Thanks @timtroendle

To add to @robbie.morrison comment, I recently heard this (simplified) definitions on a conference (without citable source):

Repeatability -> same team / same experiment
Replicability -> different team / same experiment
Reproducibility -> different team / different experiment

Perhaps it is too simple and doesn’t fit the workflows and tools of computer models (experiments).
But perhaps we can adjust and specify this to our needs.

Unverified thoughts:

When you run your model on the same computer and get the same results you achieved Repeatability.
When your colleague runs the same model on your computer and gets the same results you achieved Replicability.
I’m not an expert on this, but if you put your “experiment” in a container (e.g. Docker) and include all data, it stays the same “experiment”. You can distribute the same experiment to different people (without giving access to your computer).
When someone downloads your model and prepares all data and runs the same simulation (following your perfect documentation) then you achieved full Reproducibility.
But does this also mean that using a container you cannot achieve Reproducibility, just Replicability?

Happy to hear your opinions on this.

timtroendle · 23 May 2018 09:00

All good points, and there is a lot of discussion going on already on the terminology.

For this do-a-thon I would like to keep it pragmatic though. It should be about hands-on issues of being able to repeat the exact same analysis using the same code. You could say that this is a continuation of moving from closed source to open source: open source is nice, but doesn’t ensure that others can repeat your analysis.

robbie.morrison · 23 May 2018 10:45

Container images

It is worth noting that docker containers and similar products have a terrible reputation in open source law circles for two reasons: licensing and security.

First, license compliance is a legal nightmare because the entire container becomes a single entity under copyright law. And perhaps even a single program for software licensing purposes, given that the image is just one file that you “run”, albeit in a specialized environment. Moreover docker, for reasons of space, normally strips out every license file (as well as every documentation file) it can locate. In addition, there is currently no license‑hardened Linux distro which can be legitimately packed and shipped as a base for building container images. I imagine there will be civil litigation down this track.

Second, with regard to security, docker normally drags in material from all over the internet, much of it of unknown status. Hohndel (2018) provides several salient examples.

Finally, as @ludwig.huelk indicates, the use case for containerization in the context of reproducibility and related objectives remains somewhat unclear. What exact research need is being served by their use?

My recommendation: make a proper assessment of the merits and risks of creating and distributing container images before using the practice. HTH, Robbie.

References

Hohndel, Dirk (April 2018). Don’t ship that container: on the challenges of compliance of container images — Presentation. Vmware. Licensed under a Creative Commons CC BY 4.0 International License (although neither the content or the metadata say so). This version has been shrunk using ghostscript.

2018-hohndel-challenges-compliance-container-images-presentation.shrunk.pdf (1.7 MB)

timtroendle · 6 June 2018 14:47

We now scheduled this for tomorrow, but I am unlikely to be able to attend unfortunately. Is there anyone willing to host this write-a-thon?

simnh · 7 June 2018 06:08

@cswh and/or me could be doing this. If you have anything prepare please share. If not, it no problem, we will come up with a method for writing and discussion.

timtroendle · 7 June 2018 06:25

@simnh Nothing to share really unfortunately. Depending on the group one thing that could be very worthwhile is that everyone is sharing briefly their way of working, e.g. by showcasing a current project. A very valuable outcome would be a how-to with some best practises in my opinion.

simnh · 7 June 2018 06:35

Yeah, we thought of something similar!

simnh · 7 June 2018 13:59

We can collaborate in this document:

simnh · 7 June 2018 15:46

We only discussed and figured that we need a further discussion on that in the forum with regard to a wiki entry on reproducible workflows. I think we should integrate results from previous discussion on workshops and the forum listed above.

This paper was mentioned in the discussion:

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285

robbie.morrison · 12 June 2018 07:14

Here is that same reference in full:

Sandve, Geir Kjetil, Anton Nekrutenko, James Taylor, and Eivind Hovig (24 October 2013). “Ten simple rules for reproducible computational research”. PLOS Computational Biology. 9 (10): e1003285. ISSN 1553-7358. doi:10.1371/journal.pcbi.1003285.

robbie.morrison · 17 June 2018 08:39

Software Heritage source code archive

Another part of the puzzle is provided by the Software Heritage project, which launched officially on 7 June 2018. Software Heritage is a large‑scale source code archive, led by French computer science research institute INRIA. The project runs the following website, wiki, and internal code repo:

Although long (1.5 hours), the Di Cosmo video (listed below) is well worth watching.

The project aims to collect “all the world’s software development history in a single graph”. That graph uses a generalization of Merkle trees to DAGs. And, like the git version control system, each file, sub‑directory, codebase revision, and release receives its own cryptographic hash. For example, here is the directory hash and associated URL for the abandoned deeco energy model I once maintained:

swh:1:dir:02f3429487debca77371aec67ac1d483d8ff3f96
above as URL

As of June 2018, the project contains 4.6 billion source files and over 5TB of metadata. For fun, here is some assembler from the Apollo 11 moonshot:

swh:1:cnt:41ddb23118f92d7218099a5e7a990cf58f1d07fa
above as URL with highlighting

The Di Cosmo video also discusses the limitations of DOIs as persistent identifiers. Aside from being non‑free (the right to assign DOIs starts at several thousand euros), there are no guarantees that the digital object being referenced has not subsequently changed. The hashed references used by Software Heritage, of course, provide this assurance. Indeed the hash alone, possibly truncated, can be used to search for and recover any particular resource.

The Software Heritage project is designed to work in a three‑way partnership with scientific data hosts (like Zenodo) and open access publishing repositories (like arXiv). Note that Zenodo currently offers only DOIs.

Energy modelers now need to investigate how best to integrate this service into their modeling workflows. But I have no doubt this service will become an indispensable asset for us within a short time.

Because we mostly code in established interpreted languages (like python), the issue of reproducing a suitable execution environment is not yet often problematic. But the Software Heritage may be able to assist with this issue because most of the resulting dependencies will also be databased, hashed, and directly recoverable. Nonetheless, Software Heritage does not and will not support virtual machines, often used to emulate superseded platforms — that theme is one for another project.

References

Di Cosmo, Roberto (8 November 2016). Preserving software and data: ensuring availability and traceability — Presentation. Rocquencourt, France: INRIA and IRIF.

Di Cosmo, Roberto (8 November 2016). Preserving software: ensuring availability and traceability (MP4) (video). Duration 01:37:39.