Data portals everywhere and not a drop to link

robbie.morrison · 18 October 2023 10:18

Apologies for the oblique reference in the title to the famous Samuel Taylor Coleridge poem:

Water, water, every where,
And all the boards did shrink;
Water, water, every where,
Nor any drop to drink.

There have been several webinars over the last weeks to promote data portals in the energy domain. Some are new projects, less than a year old — others have been through any number of iterations. These projects aim to provide easy public access to energy‑related datasets with an emphasis on GUI‑based usability.

But normally with little apparent attention to genuine information reusability — and certainly with no real consideration of open and interoperable data licensing. Moreover few, if any, provide API access, serve metadata, support sophisticated versioning, embed license notices (likely difficult to determine in most cases), and subscribe to semantic standards. Some portals actively ban other portals from using their data. Other portals complain but don’t have sufficient legal leverage to prevent this practice.

I am not going to name individual data portal projects in this posting, nor single out any primary data providers. Except to say such portals are offered variously by multilateral institutions, projects funded by public science, civil society organizations supported by philanthropic foundations, industry, including system operators, and statutory bodies established under law to disseminate mandatory reporting.

While providing ready public access to datasets has clear use‑cases, it falls far short of contributing to the data commons that would collectively benefit all. Indeed, there may well be perverse incentives for data portals to intentionally develop information silos and seek network effects. With their backers doubtless viewing the resulting engagement as success.

On the other hand, there are projects designed to provide genuinely reusable data infrastructure to meet the needs of energy system analysts — based on the paradigm of linked open data (LOD). And again, I am not going to cite individual examples. But these initiatives would seem the right places to leverage community participation to best advantage.

Linked open data also needs to be placed under suitable licensing to support legal interoperability. Practical considerations limit the choice of public license, in my view, to Creative Commons CC‑BY‑4.0, CC0‑1.0, or something inbound compatible.

That collective activity also needs to be backed up by a consensus on open standards covering semantics, metadata, and review. I intentionally omitted technical interoperability from that list because I don’t believe this to be a problem — please correct me if not so

If you are involved — or can influence — the development of these various data portals, please explain these wider community objectives and values.

Otherwise we modelers, like the ancient mariner, may be destined to wander the earth, dead albatrosses in tow, unable to do the kind of systemic analysis and repeatable science so necessary, and eternally condemned to harass wedding guests, passers by, and unsuspecting cybercitizens too.

Again, sorry for the lousy metaphors, R

Barton · 20 June 2024 22:05

Hi Robbie,

Thanks for your very insightful post!

I am currently working on a work package for integrating multiple models in a research project (refuel.ch). In the project, we need to collect the data from different partners and models, and, therefore, we are considering following the FAIR principles and making a data portal (or linked open data?) for the project (or even the energy modelling community in Switzerland or the world).

However, it seems many people/projects were trying to have a platform of Open Data, and it seems there is no good solution so far. Do you have any suggestions for us if we have the chance to build a new data portal? or maybe we should use the existing data portal and improve it? If yes, any recommended platform? (e.g. OEP?)

Regards

Barton

robbie.morrison · 23 June 2024 06:46

Hi @Barton. Your question spans several dimensions. This is the way I like to look at things:

This diagram covers only information that is not encumbered by personal or commercial privacy concerns — as derived from human rights and intellectual property considerations, respectively.

And for clarity, this diagram also excludes content of the type used to train large language models (LLM). Nowadays, things like the complete works of George Orwell are routinely described as AI data. Unlike the numerical analysis data considered here, this AI content is routinely under copyright and perhaps other intellectual property rights.

As the diagram suggests, metadata and standards are the glue that hold this entire enterprise together. Standardization can cover all the aspects shown in the diagram and may range from consensus‑based to formally constituted. Proprietary standards should be avoided.

Linked open data (LOD) is certainly a great design concept. Those who are best placed to generate and maintain specialized datasets do so. Primary data collection is not routinely duplicated. Workflows can be automated and repurposed to good effect. Communications upstream are facilitated. Data users have incentives to contribute back on multiple levels. Information integrity is facilitated.

Issues covering legal compatibility and open licensing are material. And there can be quite some gulf between the open science rhetoric from major research institutions and the inadequate public licensing that they provide in practice. This rift is particularly problematic in Europe where the Database Directive 96/9/EC applies automatically. More here on the necessary legal basis for a knowledge commons.

In terms of semantic alignment, I think the Open Energy Ontology (OEO) provides an excellent foundation for data semantics (noting that I am on the steering committee). You allude to the use case of deploying a single data repository to service the data needs of multiple framework applications (models), statistical methods (analysis), and perhaps even machine learning (AI) projects.

I know of one energy/climate think‑tank that uses well‑known energy system frameworks and recently tried to populate their various models from the one in‑house databank (sorry that I cannot provide names). They looked at the OEO but opted for their own simpler in‑house data model derived from a rough superset of the frameworks they utilize. This process also demonstrates that data semantics apply equally to software design and data collection. In the end, I believe their efforts were successful but are unfortunately limited to in‑house usage at this juncture. Incidentally, I attempted something perhaps similar back in 2002 crossing the TOP‑Energy (previously EUSEBIA) and deeco frameworks: doi:10.5281/zenodo.6619604.

The question of technical interoperability between data nodes is not as challenging, I think. A core issue is the satisfactory retention and processing of metadata and the creation and inclusion of new metadata. Note also that RLI are developing a metadata standard for energy modeling.

Returning to your question about data platforms. And I try to be as impartial as possible about different initiatives. The Open Energy Platform (OEP) from the Reiner Lemoine Institute (RLI) is probably your best first point of call. The OEP was designed from the outset to support LOD and community curation and specifically intended for systems modeling. I suppose Wikidata could be another option, but I imagine it would take a major effort to develop the necessary semantic architecture.

The Wikipedia page on open energy system databases is probably worth a look.

I would like to credit much of development work at RLI to the vision of @ludwig.huelk. There are also some YouTubes by him on the openmod channel (see URL at the bottom of this page) that may be of interest.

Some closing remarks. I think the push for a knowledge commons based on linked open data under appropriate public licensing is receding. Similarly , the expectations surrounding the processing of big data for social benefit have not materialized to any extent. And the European Commission is pushing for data to be treated as an economic good in its own right and retreating from the ideal of genuinely open information serving public interests (but the Commission are now perversely moving in exactly the opposite direction for source code and software).

A mention of data brokerage is also relevant. These services act as intermediaries and some run on open source software. But they are necessarily focused on closed data and bilateral agreements. And they normally regard open data licenses (such as CC‑BY‑4.0) as just another type of legal instrument. I disagree with that view. The principle characteristic of a knowledge commons is social — and data brokerage does not naturally offer support for broad communities and common property (that foundational difference is widely recognized for proprietary versus open source software but not so for non‑open versus open data).

I guess a knowledge commons can be semi‑private — in the sense of being made up of partners who agree to share but not publish. I’ve not come across this halfway house much in practice, but I imagine there are examples.

Finally, synthetic data, whether conventional or generated by AI, is often presented as a way to address matters of personal and commercial privacy. In the case of AI methods, there is a tacit assumption that the trained data cannot reveal household or facilities-specific information and that the underlying training data cannot be approximately regurgitated. See Chai and Chadney (2024) for a recent example. My concern is that synthetic data in general does not yet capture important correlations relative to external contexts (such as real‑time weather) and may provide spurious conclusions when used to study complex systems.

The reason I have covered so much territory is that the FAIR data concept requires policy on most of the issues raised in this post. Good luck with your projects and explorations, R.

Barton · 23 June 2024 21:38

Hi Robbie,

Thanks for much for your detailed and comprehensive feedback - it is super helpful!

There are many materials and references that I have to spend some time to look into each of them. I will be definitely interested in the “metadata standard for energy modeling” - it seems we do not have good guideline for that.

Just for your reference that Swiss energy research community has a open data platform (CROSSDat), but it is not as capable as OEP. There is an ongoing project to improve the platform and to add ontology for models and data, but still long way to go.

The work on synthetic data using AI is also interesting since we are collecting smart meter data of a municipality for demand modelling in another project. It will be good to make it open data in a certain way.

At the end, thanks again for your input and wish you all the best.

Barton

robbie.morrison · 24 June 2024 06:51

Here is some information on the OEMetadata standard (the first is buried deep within a long video):

Hülk, Ludwig (16 April 2023). Publishing open, annotated, and FAIR data with the OEFamily. Berlin, Germany: Open Energy Modelling Initiative. YouTube video. Duration 01:52:49. OEMetadata section starts at 00:17:26 and continues for 29 minutes.
Hülk, Ludwig (8 November 2022). Open Energy Metadata (OEMetadata): publishing energy data enriched with ontology references — Poster. Germany: Reiner Lemoine Institut, Fraunhofer IEE, Öko-Institut, and Otto-von-Guericke-Universität Magdeburg. doi:10.5281/zenodo.8026862. Subtitle: Development of a state‑of‑the‑art metadata standard for energy, climate and modelling data.

Quoting from the poster:

The number of energy scenario and data publications is growing rapidly. However, the data utilized is often not well described. Therefore, accessible, reusable, and interpretable metadata is required. We believe the strength of the OEMetadata facilitates better reproducibility and productivity in energy research.

▢