Referencing archived software using Software Hash identifiers

This is a heads‑up about the Software Heritage project if you need to archive and then reference your software generally — or perhaps refer to a particular release, individual file, or specific few lines of code using a robust identifier.

Indeed, your favorite software repository (repo) might already be there — having been earlier scraped by Software Heritage as part of their routine activities.

Introduction

Software Heritage is a dedicated project, first developed by the French Institute for Research in Computer Science and Automation (inria), and now constituted as a non‑profit organization based in France. The project seeks to archive all publicly available historical and contemporary software as a service to humanity — with a focus on human readable source code. The project is also designed to be permanent and enduring and operates several geographically diverse mirrors.

Software Heritage supports the Git, Mercurial, Subversion, and Bazaar versioning systems. And, as indicated, Software Heritage invests considerable effort scraping from known code hosting sites, including GitHub and GitLab.com. If not, you can manually add a codebase if your repo is public and visible and accessible via either HTTPS or SSH.

Identifying all your forked repos

Enter the name of your project (say “oemof” or “pypsa”) in this search bar:

And count how many forked repos are known to Software Heritage (100 entries per screen). You may well find 300 or so forks? Seek your fork to drill down into the details.

Using SWHID persistent identifiers

A very useful feature is each artifact receives a unique persistent identifier called a Software Heritage identifier or SWHID. This is somewhat like a DOI for documents but finer grained. For an overview:

An artifact, also know as an object, can be some kind of content, a directory a revision, a release, or a snapshot. So‑called subparts of objects can be identified, such as a range of lines in a particular source code file.

Here is an example (from Di Cosmo 2020, slide 8):

Figure 1: An example of a Software Heritage Identifier (SWHID). The scheme is quite hierarchical as one drills down to the few lines of assembler code that, in this case, ignited the main rocket motor for the Apollo 11 moonshot (harvested from here).

The underlying presentation is this:

The different fields are separated by colons, as indicated in the figure. And the third field indicates the various classes of artifact that can be referred to. There are currently efforts to formalize the SWHID system as a public standard.

Archiving energy systems models

I am talking to Software Heritage about good practice when archiving an energy systems model — a model being a particular codebase instance plus the populating data and usually the raw and interpreted results, generated graphics, and any associated reference scenario. This last point makes things a bit more complicated because a reference scenario is normally a model in its own right. More soon, hopefully.

Additional sources

For in‑depth background, this hour‑long YouTube is worth watching:

Closure

I intended to develop this post with additional technical details, when I get the chance.

1 Like

Interesting project, thanks for sharing @robbie.morrison !

Would you recommend to use this software archive over archiving e.g. GitHub repositories via Zenodo and referecing using a DOI?

How persistent can one expect the project to be, i.e. will it be available in 10, 20 years?

Regarding unique identifiers, Software Heritage IDs are derived from a Merkle directed acyclic graph of artifacts, thereby robust and fine-grained.

Regarding longevity, Software Heritage is now supported by UNESCO and official French science.

1 Like

SWHID scheme as public standard

The Software Heritage project is embarking on a process to formalize the SWHID identifier scheme and also plans to submit the resulting public standard to the International Organization for Standardization for consideration as an ISO standard. I am contributing to that effort and we had our first kickoff meeting on 27 March 2023.

Kickoff meeting slidedeck and video record

The following records are publicly available:

The first 00:32 of the recorded meeting is well worth watching, where Roberto explains the basic ideas behind the SWHID scheme.

Intrinsic identifiers

A central concept is that of extrinsic versus intrinsic identifiers. In this context, extrinsic identifiers are necessarily maintained in a registry and issued using blocks of sequential numerical strings or similar — examples included the ISBN and DOI systems. Whereas intrinsic identifiers compute unique identifiers from an object’s properties, often using collision‑protected cryptographic hashing — examples include the blockchain system and SWHIDs.1  Intrinsic identifiers do not require that a persistent registry be maintained. Slide 06 from Di Cosmo (2023) summarizes these ideas:

As you can see, DOIs and SWHIDs have markedly different foundations and thereby divergent strategies for handling longevity. A central issue for SWHIDs is to develop a complete, precise, and non‑ambiguous specification for computing identifiers (hence this current exercise).

Hierarchy of artifacts

The SWHID scheme is centered on source code, but can handle binary files too. It supports a hierarchy of artifacts as follows, as extracted from slide 09 in Di Cosmo (2023):

The nesting of snapshots, releases, revisions, directories, and contents is core. This diagram also captures the directed acyclic graph (DAG) on which the SWHID hashing scheme operates. Contents artifacts — typically source files — can be further referred to by line number by appending that information using a specified format (as indicated in the top posting).

Other core concepts

The hashing algorithm used by Software Heritage is the same as that employed by the git version control system — denoted  sha1_git. That algorithm will persist for major version 1 of the SWHID standard, but may change for later major versions as cryptographic technologies develop. The major version number is encoded in the second colon‑separated field in the SWHID identifier designated  schema_version, as slide 08 from Di Cosmo (2023) shows:

The filename is also included in the hashing, so identical contents with different filenames will yield different SWHIDs — clearly sensible in the context of software.

Metadata is not explicitly handled at present.

Governance

governance model is part of the specification development process. Those contributing do so under the project’s contributor agreement — which currently stands at version 1.0 (beta).

Potential extensions

I was interested to know if the SWHID scheme could support datasets and so asked:

Is an extension to datasets on the radar? My use‑case is numerical analysis to support public policy development. A mix of codebase and database therefore.

And the reply was that this could be possible but that there would need to be a canonical representation of what constitutes data.

Closure

Citing SWHIDs variously for snapshots, releases, revisions, directories, source files, and even specific line numbers seems more preferable to citing a DOI for a single tar file resident on Zenodo or similar and then requiring users to drill down manually to locate specific objects. Nor can users move upwards or sideways for more context.

Notes

  1. The blockchain and SWHIDs differ in two respects. The blockchain in linear whereas the SWHID scheme operates on a directed acyclic graph. The underlying trust models also differ — blockchain relies on a distributed consensus algorithm, whereas SWHIDs do not. See 00:54 on the video.

Thank you for sharing this information. Is there a more effective approach than associating the model’s source code at Software Heritage by including a link to the Zenodo record in the repository’s README file, which directs users to access the data, results, and interpretations?

Is there a more effective approach than associating the model’s source code at Software Heritage by including a link to the Zenodo record in the repository’s README file, which directs users to access the data, results, and interpretations?

@alexkies: sorry for the delay in responding. Software Heritage is intended to support just the archiving and downstream study of source code. Equivalent projects have been attempting to do the same for public datasets —be these input or output data — by offering linked open data (LOD) infrastructures and allied cataloging systems. Software Heritage differs in that the project is monolithic — collecting everything within one single physical portal, albeit mirrored across the planet for security.

Stepping back, the particular suite of infrastructure sought will depend on whether the aim is repeatability — so that the original model can be rerun to produce the exact same results (at least to floating point precision). Or reproducibility — so an independent team can produce scientifically identical results, but not necessarily with the same software or even the exact same input data.

Neither objective — repeatability or reproducibility — are easily met under today’s workflows and infrastructures. Indeed, both requirements remain major challenges.

For work within this community on LOD, see this recent YouTube: Publishing open, annotated, and FAIR data with the OEFamily

Version 1.0 of the SWHID specification was released today: specification v1.0

Note too that “SWHID” now stands for “software hash identifier” and not “Software Heritage identifier” as previously.

Version 1.1. now current and out for comment until 2 November 2023: specification v1.1.

Note that the expansion of “SWHID” is now SoftWare Hash Identifier. It was previously Software Heritage Identifier, a reference to the underlying project. The new terminology is thought to be more descriptive.

The specification document is under an open documentation license, the Community Specification License 1.0 with SPDX identifier Community-Spec-1.0.