Public licensing ontologies

robbie.morrison · 23 July 2021 12:59

Release 06 27 October 2021 Finalized status

This topic reviews web‑based ontologies that support public licensing information.

Such ontologies can aid license identification and potentially assist with compliance assessments spanning multiple licenses. That said, the underlying matters of license interoperability and the tracking of legal metadata to expedite compliance processing are not the focus here.

This topic is written with data processing in mind, but many of the observations offered should also apply to source code and written and diagrammatic content. The focus here is on energy system analysis too, but likewise, most of the remarks made could equally well apply to other domains.

A number of projects that either list common public licenses or machine process public licenses but do not otherwise qualify as ontologies are also summarized.

Two analysts separately asked me for background on this theme and it therefore seemed sensible to blog for a wider audience.

This posting will remain a wiki‑posting for about four weeks — so feel free to add corrections and extensions. I will duly bump the release number as required. Alternatively reply and comment below.

Context

This section provides background. Please skip over the material if you are familiar with the concepts being covered.

Ontologies

A strong theme in knowledge management at present are ontologies. By defining a set of concepts and categories within a given domain and recording their important relationships, an ontology can support more formalized reasoning and analysis within that domain. In addition, ontologies often regularize the terminology, which may then assist the quality of discourse within that particular domain as well. It should be noted that agreed or controlled vocabularies alone are not sufficiently deep to classify as ontologies.

Ontologies and their domain‑specific development are part of a wider research landscape that includes semantic triples, linked open data (LOD) and knowledge graphs (KG). These various strands, taken together, often fall under the rubric of the semantic web (Hitzler 2021). Ontologies in the context of the semantic web therefore rely heavily on W3C web standards.

As an aside, the domain of energy system modeling and analysis is also developing its own ontology, the Open Energy Ontology (OEO) (Booshehri et al 2021). More here under the oeo tag.

Public licensing

This posting is restricted to ontologies that cover public licensing. The specific motivation is that ontology‑oriented data processing should be able to recognize and consider any associated public license as part of the processing logic and react accordingly.

Public licenses are standard licenses that are nominated by the originator of the data, code, or content and duly define the obligations and permissions that apply to downstream users. Contact between the originator and the user is not required and no formal offer and acceptance of the terms is required. Legal jurisdictions differ as to whether a public license also qualifies as a legal contract or not.

Open licenses are a subset of public licenses that necessarily grant the user the unrestricted right to modify and republish material, albeit usually with legal conditions that specify and thereby constrain the downstream terms‑of‑use. As indicated, the discussion here is aimed at data but ontologies that cover the public licensing of software and content should not be fundamentally different.

It is presumed here that the material in question is under copyright and related rights protection — if not, any public license so applied can be legitimately disregarded. National intellectual property statutes and case law also provide various provisions for fair use, lawfully permitted use, and specified exceptions — such relaxations are jurisdiction-dependent and normally highly fact‑specific and are not considered here. Depending on the jurisdiction, related rights may include so‑called sui generis database protection and moral rights intended to protect the personal interests of the original creator.

The practice of dual licensing is occasionally encountered. The provision of material under two or more public licenses does not substantially alter the core discussion.

The ScanCode project maintains a database of known public licenses and also assigns them SPDX identifiers (more below). That database currently contains about 17 000 entries, which clearly represents serious license proliferation.

Ongoing modifications

Legal questions regarding repeated incremental modification by multiple parties are not traversed here to any degree — refer instead to Chestek (2017) for an in‑depth discussion related to source code development under United States law. Much of what Chestek writes should nonetheless broadly apply to collections of datasets under community curation.

One issue requiring more attention is the keeping track of contributors and their contributions as these ongoing modifications progress. That tracking may be designed to support provenance, honor licensing requirements, or both.

Another oft‑encountered form of modification is the combining of datasets under differing public licenses. If the material being mixed cannot be kept distinct and separate, then the various licenses involved must be legally compatible and the most legally onerous applied to the composite. Software is more complicated in this regard, due to matters like runtime linking and binary distribution, but the underpinning legal principles are not dissimilar.

Legal jurisdictions

Legal jurisdictions vary greatly as to how much intellectual property protection, if any, is accorded a given dataset, collection of datasets, managed database, or semantic web architecture. The United States is relatively lax in this regard and the United Kingdom quite stringent. The United Kingdom, for instance, has Crown copyright, a sweat‑of‑the‑brow threshold for copyright in works and collections, sui generis database protection, and a limited concept of public domain.

This posting presumes that intellectual property protection naturally applies. The alternative approach requires case‑specific analysis to determine if such protections do indeed apply and is fraught with uncertainty. One advantage of public licensing is that it removes much of that uncertainty, albeit at some risk of inappropriate protection.

Web standards and rule-based languages

Web standards

The World Wide Web Consortium (W3C) develops standards for the world wide web. Some of those standards are relevant to web‑based ontologies and include:

Rule Interchange Format (RIF)
Web Ontology Language 2 (OWL2)
Resource Description Framework (RDF)
Resource Description Framework in Attributes (RDFa)

RIF is a rule processing scheme (Morgenstern et al 2012) and as such, not particularly applicable to the processing of public licenses (more later). OWL2 is an ontology authoring language based on the concepts of classes and class hierarchies. RDFa provides support for semantic triples which, among other things, allow certain relationships between different licenses mapped by IRIs to be recorded — with an important relationship being that of directed compatibility between two licenses.

Automated rule‑based processing

Another rule processing scheme, commenced in 2000, is RuleML, listed here for completeness.

Practically speaking, I rather think that the rule‑based processing of public licenses will be necessarily limited for two reasons. First, entire licenses rather than individual terms will doubtless remain the principle unit of legal analysis. And second, the useful part of the license compatibility digraph, for data at least, is unlikely to have more than about ten nodes, even when accounting for different versions of the same instrument. That suggests that hand‑analysis remains entirely feasible and in many respects preferable for reasons of acceptance (more later).

The converse is likely to apply to trusted data brokerage platforms such as the United Kingdom Icebreaker One Open Energy project. Individual parties sharing datasets will doubtless create a raft of instruments that may well benefit from automated rule‑based processing to determine compatibilities and generate the resulting downstream terms‑of‑use on the fly.

As an digression, the Creative Commons four rights and seven licenses approach represents rule‑based processing as a design criteria and not after‑the‑fact analysis.

Projects that catalog public licenses

This section outlines some projects that catalog and/or process public licenses but that fall short of providing fully‑fledged ontologies.

SPDX identifiers for public licenses

SPDX is a file format originally used to document open source software licenses (Joint Development Foundation 2020). The scheme now extends to content and data‑capable licenses. For instance, the Creative Commons CC BY 4.0 attribution license has been assigned the following SPDX identifier: CC‑BY‑4.0. The SPDX project maintains a list of common public licenses

https://spdx.org/licenses/

The SPDX community has also been using these SPDX identifiers to generate IRIs for license classes and then arranging these various resources into an ontology (more later).

Open Knowledge Foundation

The United Kingdom‑based Open Knowledge Foundation (OKF) maintains a list of licenses that their Advisory Council considers conformant with the the current version of the Open Definition (version 2.1 as of July 2021 and with 21 approved licenses):

https://opendefinition.org/licenses/

Data License Clearance Center

The Data License Clearance Center (DALICC) located in Austria does not offer an ontology as such. Rather a DALICC project sought to encode and process various individual terms‑of‑use using machine logic to implement an automated clearance of rights (Pellegrini et al 2018). The project completed in 2018 but the website remains live at www.dalicc.net. (But be aware that some conclusions offered on that site may not be correct.)

As indicated earlier, most analysts working with open data are unlikely to encounter more than a handful of data‑capable licenses — possibly a maximum three or so international licenses (CC0‑1.0, CC‑BY‑4.0, ODbL‑1.0) and three or so national licenses (OGL‑UK‑3.0, dl‑de/by‑2.0, Licence Ouverte), legacy licenses and future licenses notwithstanding. This implies that a manual assessment, recorded in a simple hand‑crafted digraph, should be quite sufficient. But (as noted earlier) non‑public licenses are probably more amenable to rule‑based processing. Moreover, in the context of data brokerage schemes, the next generation of consortium‑based licenses may well be designed afresh with automated analysis in mind.

Dublin Core

The Dublin Core standard for metadata has a license namespace, as here. Recommended practice is to identify the license document with an IRI. And if that is not possible or feasible, to add a literal value that unambiguously identifies the license in question. Using the Creative Commons public domain waiver provides an example of an IRI: https://creativecommons.org/publicdomain/zero/1.0/. Providing a controlled location for recording and processing license details does not constitute an ontology however.

DCAT

The Data Catalog Vocabulary (DCAT) system is designed to support the decentralized publishing of data catalogs and to facilitate the federated searching of same. DCAT supports a list (of around 34) of data‑capable licenses at www.dcat-ap.de/def/licenses/ but falls short of providing an ontology with class relationships.

Associated projects

Three projects warrant mention, even if only obliquely related to the main theme. Several other nascent projects, such as the European GAIA‑X project (DE‑CIX Management et al 2020), could have been mentioned here as well.

FAIR data

The FAIR data principles provide a set of governing principles for findable, accessible, interoperable, and reusable data (Jacobsen et al 2020). FAIR can be and often is limited to explicitly consenting parties and, in that regard, less ambitious than linked open data. Principle R1.1 requires that “(meta)data are released with a clear and accessible data usage license”. But the scheme is otherwise silent on what that license might be or whether it even need be a public license. Indeed, data consortiums founded on FAIR data principles often utilize explicit bespoke non‑disclosure contracts to prevent public disclosure.

Wikidata project

The Wikidata project also processes large amounts of data but uses the Creative Commons CC0‑1.0 public domain waiver exclusively (Vrandečić and Krötzsch 2014). In which case, very little of the discussion presented here is applicable to that material. The author knows of no data‑capable license that preclude material under CC0‑1.0.

DBpedia Databus

The DBpedia Databus project provides an intelligent layer between analysts and the web at large in order to compensate for some of the shortcomings of the web — for instance, temporary or permanent link rot (Lehmann et al 2015). The Databus project plans to manage public licensing although it offers no such functionality as of July 2021.

Ontology projects that map public licenses

This section turns attention to known ontologies. The DBpedia Archivo database of web‑accessible ontologies was also consulted but provided no useful information (Frey et al 2021).

SPDX ontology of open source software licenses

As indicated earlier, the SPDX project began by databasing common public software license texts and assigning them unique but recognizable identifiers. Later content and data‑capable licenses were included. That project has now grown to provide a fully‑fledged web‑based ontology for open source software licenses. IRIs for individual licenses can be generated as follows, in this particular example pointing to the widely‑used so‑called permissive MIT software license with SPDX identifier MIT:

https://spdx.org/licenses/MIT

The IRI just shown is of class License in the SPDX ontology. The SPDX model for open source licenses has an associated RDF/XML OWL ontology available at spdx.org/rdf/terms/spdx-ontology.owl.xml. A more human readable version of that ontology (and well worth studying) is available at spdx.org/rdf/terms/. The class of interest for open source licenses is spdx.org/rdf/terms#License. The model supports RDFa (see earlier) and thereby various forms of defined relationship between the IRI‑identified classes. The ontology is released under a CC0‑1.0 public domain waiver.

As of July 2021, those relationships do not include interpretations of the license terms‑of‑use, obligations, or restrictions, nor for any directed interoperability between common licenses. Nor does the scheme currently cover data‑capable licenses. But one would imagine that extending the range of licenses supported to other types of information would be straightforward.

The project maintains a mailing list at spdx-tech@lists.spdx.org and a specification repo at github.com/spdx/spdx-spec for those wishing to become more involved.

Closure

Domain‑specific ontologies often inherit from more fundamental ontologies or aggregate from operational ontologies developed in allied areas. Ontologies that cover public licensing clearly fall into this latter camp.

Ontologies are intended to support sophisticated logic. In the case of public licensing though, it is more likely that user needs will be more readily served by access to simple hand‑maintained digraphs describing cross‑license compatibilities than resorting to rule‑based processing. The combinatorial space is actually quite limited with perhaps ten widely‑used licenses (nodes) and about five non‑trivial compatibility relationships (directed edges) to consider (as indicated earlier).

Moreover, the process of developing the underlying compatibility digraphs for data, code, and content will likely be contentious. Indeed some widely‑held views on inbound and outbound‑compatibilities may well fail to stand up to careful scrutiny. (My efforts elsewhere suggest that this will indeed be so.)

The only ontology under active development as of July 2021 is the SPDX ontology which covers open source licenses. At this juncture, the project does not extend to data‑capable public licenses but it is reasonable to expect that it could do so in the future.

While not traversed in any depth here, the creation, communication, and processing of legal metadata for specific datasets and similar will require conscientious attention and development. With a particular focus on the need for standardized structures and procedures. For instance, the EERAdata project is undertaking work in that context (Wierling et al under review).

Automated license compliance within software stacks constitutes a major effort within the open source domain (Riehle and Harutyunyan 2019). This issue is less pronounced for data because the range of public licenses in play is much reduced, contributors do not tend to add bespoke conditions, the combination process is simpler, the legal metadata required is generally not supplied or retained, and the commercial stakes are often lower.

On the assumption that semantic web concepts are progressively adopted, the legal status of the knowledge graph itself — rather than just the contained objects — will also need resolution. In a European context, this graph would doubtless classify as a 96/9/EC database and thereby be subject to all the uncertainties contained in that legislation unless specifically licensed otherwise. Whether current public licenses are up to that particular task remains unexplored at the time of writing (to the author’s knowledge).

Suggestion

Does this community wish to engage with the SPDX ontology project to see if support for data‑capable licenses can be included within the SPDX ontology? That will require extending the data model behind the ontology in addition to adding the individual licenses.

Acknowledgments

The author wishes to thank Gary O’Neall, Malcolm Bain, and VS for their various inputs. Gary kindly supplied most of the background on the SPDX ontology project.

References

Booshehri, Meisam, Lukas Emele, Simon Flügel, Hannah Förster, Johannes Frey, Ulrich Frey, Martin Glauer, Janna Hastings, Christian Hofmann, Carsten Hoyer‑Klick, Ludwig Hülk, Anna Kleinau, Kevin Knosala, Leander Kotzur, Patrick Kuckertz, Till Mossakowski, Christoph Muschner, Fabian Neuhaus, Michaja Pehl, Martin Robinius, Vera Sehn, and Mirjam Stappel (1 September 2021). “Introducing the Open Energy Ontology: enhancing data interpretation and interfacing in energy systems analysis”. Energy and AI. 5: 100074. ISSN 2666‑5468. doi:10.1016/j.egyai.2021.100074. Open access.

Chestek, Pamela S (2017). “A theory of joint authorship for free and open source software projects”. Colorado Technology Law Journal. 16: 285–326. Open access.

DE‑CIX Management, Günter Eggers, Bernd Fondermann, Google Germany, Berthold Maier, Klaus Ottradovetz, Julius Pfrommer, Ronny Reinhardt, Hannes Rollin, Arne Schmieg, Sebastian Steinbuß, Philipp Trinius, Andreas Weiss, Christian Weiss, and Sabine Wilfling (June 2020). GAIA-X: technical architecture — Release June 2020. Berlin, Germany: Federal Ministry for Economic Affairs and Energy (BMWi).

Frey, Johannes, Denis Streitmatter, Fabian Götz, Sebastian Hellmann, and Natanael Arndt (10 September 2020). DBpedia Archivo: a web‑scale interface for ontology archiving under consumer‑oriented aspects. Leipzig, Germany: Institut für Angewandte Informatik (InfAI). YouTube video 00:10:38.

Hitzler, Pascal (February 2021). “A review of the semantic web field”. Communications of the ACM. 64 (2): 76–83. ISSN 0001‑0782. doi:10.1145/3397512. PDF download available.

Jacobsen, Annika, Ricardo de Miranda Azevedo, Nick Juty, Dominique Batista, Simon Coles, Ronald Cornet, Mélanie Courtot, Mercè Crosas, Michel Dumontier, Chris T Evelo, Carole Goble, Giancarlo Guizzardi, Karsten Kryger Hansen, Ali Hasnain, Kristina Hettne, Jaap Heringa, Rob WW Hooft, Melanie Imming, Keith G Jeffery, Rajaram Kaliyaperumal, Martijn G Kersloot, Christine R Kirkpatrick, Tobias Kuhn, Ignasi Labastida, and Barbara Magagna (January 2020). “FAIR principles: interpretations and implementation considerations”. Data Intelligence. 2 (1–2): 10–29. ISSN 2641‑435X. doi:10.1162/dint_r_00024. Licensed under Creative Commons CC‑BY‑4.0.

Joint Development Foundation (2020). OpenChain Specification — Version 2.1. Joint Development Foundation. This Specification is functionally identical to ISO/IEC 5230:2020. Licensed under Creative Commons CC‑BY‑4.0.

Kapitsaki, Georgia M, Frederik Kramer, and Nikolaos D Tselikas (1 September 2017). “Automating the license compatibility process in open source software with SPDX”. Journal of Systems and Software. 131: 386–401. ISSN 0164‑1212. doi:10.1016/j.jss.2016.06.064.

Lehmann, Jens, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer (2015). “DBpedia: a large‑scale, multilingual knowledge base extracted from Wikipedia”. Semantic Web. 6 (2): 167–195. ISSN 1570‑0844.

Morgenstern, Leora, Chris Welty, Harold Boley, and Gary Hallmark (11 December 2012). RIF Primer — W3C Working Group Note (2nd ed). Cambridge, Massachusetts, USA: W3C.

O’Neall, Gary (19 June 2014). Accessing SPDX licenses — Technical report SPDX‑TR‑2014‑2 — Version 1.0. San Francisco, California, USA: SPDX. Publication date from PDF metadata.

Pellegrini, Tassilo, Victor Mireles, Simon Steyskal, Oleksandra Panasiuk, Anna Fensel, and Sabrina Kirrane (April 2018). Automated rights clearance using semantic web technologies: the DALICC framework. ISBN 978‑3‑662‑55433‑3. doi:10.1007/978-3-662-55433-3_14. Chapter 14 in Semantic Applications, pages 203–218.

ScanCode Project (ongoing). ScanCode Toolkit Documentation. A database of known code and data licenses with about 17 000 entries.

Riehle, Dirk and Nikolay Harutyunyan (2019). Chapter 5: Open-source license compliance in software supply chains. In Brian Fitzgerald, Audris Mockus, and Minghui Zhou (editors). Towards engineering free/libre open source software (FLOSS) ecosystems for impact and sustainability. Singapore: Springer. doi:10.1007/978-981-13-7099-1_5. Closed access. Preprint available.

Vrandečić, Denny and Markus Krötzsch (October 2014). “Wikidata: a free collaborative knowledgebase”. Communications of the ACM. 57 (10): 78–85. ISSN 0001‑0782. doi:10.1145/2629489.

Wierling, August, Valeria Jana Schwanitz, Sebnem Altinci, Maria Balazińska, Michael J Barber, Mehmet Efe Biresselioglu, Christopher Burger-Scheidlin, Massimo Celino, Muhittin Hakan Demir, Richard Dennis, Nicolas Dintzner, Adel el Gammal, Carlos M Fernández-Peruchena, Winston Gilcrease, Pawel Gladysz, Carsten Hoyer-Klick, Kevin Josho, Mariusz Kruczek, David Lacroix, Malgorzata Markowska, Rafael Mayo-García, Robbie Morrison, Manfred Paier, Giuseppe Peronato, and Mahendranath Ramakrishnan (22 October 2021). “Advancing FAIR metadata standards for low carbon energy research”. Energies. 4 (20): 6692. ISSN 1996-1073. doi:10.3390/en14206692.

▢