Software Heritage source code archive
Another part of the puzzle is provided by the Software Heritage project, which launched officially on 7 June 2018. Software Heritage is a large‑scale source code archive, led by French computer science research institute INRIA. The project runs the following website, wiki, and internal code repo:
Although long (1.5 hours), the Di Cosmo video (listed below) is well worth watching.
The project aims to collect “all the world’s software development history in a single graph”. That graph uses a generalization of Merkle trees to DAGs. And, like the git version control system, each file, sub‑directory, codebase revision, and release receives its own cryptographic hash. For example, here is the directory hash and associated URL for the abandoned deeco energy model I once maintained:
- swh:1:dir:02f3429487debca77371aec67ac1d483d8ff3f96
- above as URL
As of June 2018, the project contains 4.6 billion source files and over 5TB of metadata. For fun, here is some assembler from the Apollo 11 moonshot:
The Di Cosmo video also discusses the limitations of DOIs as persistent identifiers. Aside from being non‑free (the right to assign DOIs starts at several thousand euros), there are no guarantees that the digital object being referenced has not subsequently changed. The hashed references used by Software Heritage, of course, provide this assurance. Indeed the hash alone, possibly truncated, can be used to search for and recover any particular resource.
The Software Heritage project is designed to work in a three‑way partnership with scientific data hosts (like Zenodo) and open access publishing repositories (like arXiv). Note that Zenodo currently offers only DOIs.
Energy modelers now need to investigate how best to integrate this service into their modeling workflows. But I have no doubt this service will become an indispensable asset for us within a short time.
Because we mostly code in established interpreted languages (like python), the issue of reproducing a suitable execution environment is not yet often problematic. But the Software Heritage may be able to assist with this issue because most of the resulting dependencies will also be databased, hashed, and directly recoverable. Nonetheless, Software Heritage does not and will not support virtual machines, often used to emulate superseded platforms — that theme is one for another project.
References
Di Cosmo, Roberto (8 November 2016). Preserving software and data: ensuring availability and traceability — Presentation. Rocquencourt, France: INRIA and IRIF.
Di Cosmo, Roberto (8 November 2016). Preserving software: ensuring availability and traceability (MP4) (video). Duration 01:37:39.