CC‑BY‑4.0 licensing pitch


I was recently asked to provide an elevator pitch for the explicit open licensing of structured data of public interest that has been or can be made public.

Single floor pitch

  • public licenses are required to create legal certainty for users, noting that much of this information is unlikely to be protected as intellectual property in any case (more later)

  • the Creative Commons CC‑BY‑4.0 license provides the best legal interoperability and less strongly protected material, including that under US public domain provisions, is inbound‑compatible

  • the associated metadata should be released under Creative Commons CC0‑1.0 public domain waivers to minimize downstream obligations

Multi‑floor pitch

The primary reason for adding public licenses to published structured data is to give researchers and other users legal certainty. If such licensing is absent, then copyright in collections, database protection (within the EU and UK), and possible copyright in any underpinning data model (including any glossary, database schema, reference architecture, or ontology) can potentially “attach”. And unlike trademarks and patents, these forms of intellectual property do not require explicit registration and examination. More often than not (as indicated earlier), such rights are either strictly absent or not legally enforceable, but downstream users cannot know the prevailing legal status for sure. Depending on the jurisdiction, this lack of legal enforceability derives from either exceptions listed in statute (such as carve‑outs for science) or fair use defenses (such as transformative usage).

The primary reason for selecting the Creative Commons CC‑BY‑4.0 attribution license is that this license is the most legally interoperable of the public licenses in widespread use. Such legal interoperability is prerequisite for collaborative research and open science and a necessary condition for the formation of a knowledge commons. A number of institutions (including Copernicus, EC JRC, IIASA) have opted for bespoke public licenses instead and have accordingly created legal silos. Also the associated metadata should be licensed using the Creative Commons CC0‑1.0 public domain waiver, again to provide legal certainty and additionally to remove all downstream overhead, including the need to track attribution.

The key right granted by any open license is the right to “reuse” — in other words, a legal grant to copy and distribute said material in original or modified form (the latter classing as a “derivative work”) and for any purpose (including commercial applications). And for most open licenses, the need to track attribution is normally, but not necessarily, included as an obligation (the “BY” attribute for Creative Commons).

Additional background

Labastida and Margoni (2020:205) support the idea that such material is often not protected:

It should be clarified that in many instances there will be no copyright or related rights on data.

Notwithstanding, the likelihood of legal encumberance varies by jurisdiction — with the United Kingdom being among most likely to attract protection and the United States being relatively liberal in this regard. The United Kingdom uses intellectual effort as the threshold for copyright and non‑content substantial investment for the related 96/9/EC database right which additionally protects systematically arranged information.

We know this system works from the open source software revolution. The only real difference between data and software in licensing terms is that software licenses range from permissive to copyleft and the concept of legal interoperability is therefore somewhat more involved

Some clarifications on usage. The phrases “published data” and “data that can be made public” are intended to take individual privacy (say GDPR protections) and commercial privacy (trade secrets) considerations off the table. The term “public license” is one that is offered without the need to form a bilateral agreement. The term “natural rights” refers to intellectual property rights that attach without the need for registration, examination, and payment — and includes copyright and 96/9/EC database rights. The terms “publish” and “make public” are effectively synonyms. And the terms “collection” and “compilation” refer to the same concept.

After much thought, I settled on “structured data” as the best descriptor for the target material — this, for instance, includes lists of information (say power plant assets), time‑series (say historical solar farm outputs), and numerical facts (say parameters covering financial and technical performance). The “public interest” qualifier used here also raises the need to treat this as a separate class of data relative to both public policy considerations and intellectual property legislation.

Those scientific institutions that create bespoke licenses do so mostly because they do not trust users to behave responsibly. Hence open licenses, such as CC‑BY‑4.0, are as much a social model as a legal instrument. And again the open source software community has demonstrated conclusively that this social arrangement works in practice.

The current situation of both unlicensed material and the use of bespoke licenses can be viewed as “the tragedy of the absent knowledge commons”, to misquote Garrett Hardin. And we surely need an actively maintained and used knowledge commons if we are to achieve rapid and equitable decarbonization.


Labastida, Ignasi and Thomas Margoni (1 January 2020). “Licensing FAIR data for reuse”. Data Intelligence. 2 (1-2): 199–207. ISSN 2641-435X. doi:10.1162/dint_a_00042. Creative Commons CC‑BY‑4.0 license. :open_access:

