Do-a-thon: Towards a common data standard for integrated assessment and energy systems modelling

stefan.pfenninger · 6 November 2019 13:30

Proposal for: Do-a-thon
by Daniel Huppmann (IIASA) and Stefan Pfenninger (ETHZ)

Session Title
Towards a common data standard for integrated assessment and energy systems modelling

Session Description

It is clear from previous discussions that not only a common standard and potentially conversion tools between data formats would make sense, but also, multiple efforts are now starting up or underway to develop such standards and tools.

Jointly hosted by the Horizon 2020 projects SENTINEL and openENTRANCE, the aim of this session is to coordinate ongoing efforts on common (or at least inter-operable) data exchange formats for energy system and integrated assessment (i.e., human-earth-climate systems) models/frameworks.

Background : There are multiple ongoing projects in Europe aiming to develop the technical infrastructure (e.g., online databases) and required data standards (i.e., templates and formats) to facilitate integration and model linkage across different frameworks and tools. Each of these projects includes several (up to a dozen) research teams across Europe, working with different methodologies, focusing on different sectors, and modelling varying spatial and temporal scales. Within each project, the infrastructure and formats should enable efficient collaboration and data exchange while supporting the FAIR principles and open, collaborative science.

Aim : Compare currently used implementations of data exchange formats and determine the scope for harmonization and/or development of conversion tool across these projects.

Definition/scope of “data exchange format” : The discussion should encompass both the technical specifications and the application/implementation aspects, i.e.:

Which file type is used?
What is the schema structure?
Example: in a tabular format, what are required/optional columns?
What is the required scope?
For example, are aggregates required to be included in the dataset, or is there an expectation that a user computes aggregates herself?
What are the naming conventions (ontology) to describe the data?
Example: for the spatial dimension, which region identifiers are used?
Which metadata fields/tags are mandatory/optional?

The intended outcomes are:

A brief session summary which can inform further work within the projects working on this topic, including the Horizon 2020 projects openENTRANCE, SENTINEL, and Spine, as well as the OpenEnergyDatabase and Open Power System Data.
Establishment of an ongoing discussion forum for further exchange between interested parties working on related projects, similar to the Scientific Working Group on Data Protocols and Management of the IAMC.

Would you like to be responsible for this Session?
Yes

Do you need any special infrastructure for this Session?
A projector and if possible, sufficient space for up to 30 participants to split into smaller groups of 4-5 people.

Do you have any recommendations who could be part of this Session?
As many representatives as possible from other ongoing or planned projects involved in developing or operating tools and platforms for sharing data and code related to energy modelling.

stefan.pfenninger · 6 November 2019 13:35

@ludwig.huelk would be great if you were to get involved here too!

ludwig.huelk · 6 November 2019 16:49

Thanks for the follow up on the working group at the EMP-E. Of course I’m in.
Maybe it would make sense to take out some of the suggested topics to be more focused.
It is already planned to have a separate session on the ontology.

robbie.morrison · 7 November 2019 21:46

This do‑a‑thon looks entirely sensible and the Berlin workshop is certainly a good opportunity to build upon previous work by this community on data and metadata standards (much of it pushed along by the OpenEnergy Platform and OPSD projects mentioned in the original post).

This do‑a‑thon is also timely given some significant EU‑funded modeling infrastructure projects are either just kicking off (SENTINEL and openENTRANCE) or part underway (Spine). The openENTRANCE project, as I understand it, is as much about the social dimension of cooperation as it is about the technical dimensions of interoperability. So I think the social context is at least obliquely significant to this discussion — for starters, just to consider which stakeholders should be involved.

The title mentions both “integrated assessment modeling” and “energy system modeling” and those two worlds have been disjoint for way too long. Although I suspect providing a common language (or semantics or ontology) that works across both domains is going to be rather challenging.

Some issues, minor and otherwise, that might also be traversed at the do‑a‑thon:

Data model sophistication: The data model underpinning this exercise can range from pedestrian data structures to high‑level abstractions with supporting semantics. Abstractions require implementation, so that choice is essentially progressive. Therefore a first question might be the level of sophistication appropriate to this initiative at this juncture?

The Spine project offers a central generic abstract data model, lacking semantics, with bi‑directional translators to service both data sources and energy model instances. Another early question would then be how much of the Spine approach can and should be used here and whether it could even be a core component? Spine is based on an EAV/CR or entity-attribute-value/class-relation approach with the actual semantics left to each team to define. So Spine, if you like, has no explicit social dimension. And although Spine does not currently support object composition (to aid compilations of datasets as discussed shortly), it could be thus extended. Adoption of Spine is simply a question on my part and not an implicit suggestion.

Standardized derived metrics: Described as computed aggregates above. Yes please!

Data license tracking: If each dataset (or more specifically, each legally separate work) was accompanied by standardized legal metadata, then users could filter on license compatibility and perhaps also machine‑generate license notices, with listings of contributors, when merging datasets (resulting in a new single work rather than a compilation of several existing works).

UML class diagrams: Also worth thinking about is the adoption of UML class diagrams for depicting the underlying data models and similar concepts. (I am a big fan of UML diagrams.)

Technical matters: Preferred practice for CSV data. Preferred character encoding (clearly either ASCII or UTF‑8). Preferred license for metadata (CC0‑1.0). Support for object composition.

Graph objects: Analysts normally think of scalars, timeseries, tabular data, and key‑value pairs as the primitives. But graph objects should perhaps also be supported as primitives — while noting these objects usually manifest as lists of various kinds.

Workflow: I guess the organizers will not wish to stray into the area of scripted workflow, but that question is probably also obliquely relevant.

Process: Some thought about how best to converge to agreement would be useful.

robbie.morrison · 1 December 2019 21:38

Standardized reporting

I considered starting a new forum topic but decided to tack this on to the end of the current thread. In some senses, standardized reporting leads on relatively naturally from data structures. Perhaps it would be better if this discussion was indeed forked? But let’s see what others think first?

I recently talked to people active in Scientists for Future who see a need for standardized reporting from technically comparable models (I’ll start with the more usual term “model” here and later switch to “scenario + method” which I think is a better characterization). Standardized reporting would then allow, for instance, civil society organizations (CSO) advocating for rapid decarbonization to more easily develop and defend their policy positions by citing suitable scientific studies. These studies should at least be consistent in the sense of compatible definitions, sufficiently similar numerical paradigms, and standardized reporting — but need not necessarily cover the same set of scenarios. Moreover, representative metrics should be provided in an easily interpreted format to facilitate interpretation across these various studies and scenarios.

The theme of this posting is therefore to add the interests of downstream users to this discussion. While noting that such users — including NGO advocates — may well not be analysts or possess much specific domain knowledge.

With respect to the energy sector, key study attributes should be readily available on first inspection, including but not limited to:

year of carbon neutrality — typically 2050 but earlier target years are increasingly being proposed
scope — whether local, national, supra‑national, or global
sector coverage — are the heat and mobility sectors included, for instance
role of international trade in fuels — biofuels, natural gas, green (electrolysis) and non‑green hydrogen, and other e‑fuels
treatment of cross‑boarder electricity
technical definition of final demand (see this thread on the mailing list)
role of energy efficiency and assumptions on future developments and uptake
assumptions regarding future demand, behavioral change, and energy sufficiency
degree of spatial and topological disaggregation and temporal resolution
representation of power flow
consideration or otherwise of novel technologies like thermal storage, CCUS, BECCS, and direct air capture
rate constraints, if any, on uptake, the treatment of fleet vintage, and the role of early retirement and capital stranding more generally

The underpinning methodology needs communicating as well because the modeling paradigm adopted can have a profound influence on the nature of the results and hence conclusions. And thereby limit the comparison exercises under discussion. Methodological questions include but are not limited to:

use of perfect foresight
strict versus near optimality (typically under intertemporal optimization)
full horizon modeling versus time‑slicing
evolution from present versus greenfield design (with or without pre‑existing transmission corridors)
recursive dynamic (stepping though time) methods, including hybrid agent‑based
system dynamics
projected accounting methods

The final aspect is standardized aggregate metrics. Common definitions are necessary in order that each study performs the same arithmetic. Such metrics would include:

final energy
externally traded energy
installed capacity by technology type
generation or production by fuel type
breakdown of above by sector using consistent definitions for coverage
carbon dioxide equivalent and cost formation information for all the above

In addition, a set of standardized derived metrics should also be considered. These are often produced by dividing two aggregate metrics to produce intensive quantities. Examples include CDM‑style financial and carbon additionalities for specific projects. Or system energy efficiency metrics calculated by dividing final and primary aggregate energy fluxes for selected years.

The three broad strands above — scenario attributes, underpinning methodology, and aggregate (and perhaps also derived) metrics — leads to the question of a standardized worksheet. Whether such a worksheet could be usefully realized remains an open question, at least in my mind.

A related idea would be a recommendation to provide Sankey diagrams as described on wikipedia. Discussion elsewhere in this community included forming a recommended set of colors for for energy commodities and technologies (although I am not aware of the outcome).

The uses of standardized reporting should not be seen as a substitute for detailed comparative analysis in which researchers drill down to identify and understand the differences between several studies and/or sets of scenarios. One example of this exercise is the ESYS, BDI, dena (2019) report (listed below) which government policy analysts apparently found particularly useful.

In summary, the two coupled issues — technically comparable scenarios + methods and standardized reporting — introduced in the first paragraph, might well appear to fall outside the boundary of the original do‑a‑thon concept. But I would argue that the idea of semantic consistency is implied in the concept of data interchange and that standardized reporting is likewise implied in the idea of “aggregate” metrics. I would also guess that the integrated assessment modeling community is further along this path than the energy system modeling community and that the IAM experiences could prove valuable in this regard.

The organizers may need to decide, I would suggest, whether these ideas fall within the scope of the proposed do‑a‑thon or would be better traversed elsewhere. Equally, if the do‑a‑thon discussion is limited to a particular modeling paradigm (such as graph‑dynamical systems representations), then that paradigm should be explicitly stated at the outset.

It is also worth reinforcing — for those outside our community who are reading this posting — that the process of converging on agreed common practices is greatly facilitated when the projects involved are genuinely open. Once your project is open‑licensed and easily downloadable, you have every incentive to press for beneficial intra‑community practices and every reason to cooperate on the development of enabling protocols.

References

ESYS, BDI, dena (20 February 2019). Expertise bündeln, Politik gestalten — Energiewende jetzt!: Essenz der drei Grundsatzstudien zur Machbarkeit der Energiewende bis 2050 in Deutschland [Pooling expertise, shaping policy — Energiewende now: essence of the three fundamental studies on the feasibility of energy system transformation in Germany by 2050] (in German). Presented at Auditorium Friedrichstraße, Friedrichstraße 180, 10117 Berlin, Germany.

khaeru · 10 December 2019 13:14

Thanks @stefan.pfenninger @danielhuppmann for proposing this session. While I haven’t been involved in such discussions via OpenMod, I think this is an important topic and wanted to throw in my two cents.¹

General

I would strongly encourage this community to as much as possible re-use and extend existing, non-energy-specific technologies (I use this term to mean: data formats, specifications, and tools/code/software) than to create new, bespoke/domain-specific ones.

This is for a few chief reasons:

Linking to ecosystems around existing technologies saves work in several ways:
- It avoids repeating discussions about how to design the tech to cover a broad set of use-cases.
- Tools developed for the existing formats provide turnkey functionality that can be applied to energy/model data.
- Tools developed for energy/model use-cases will attract interest from the broader ecosystem, so development and maintenance work can be shared.
Work saved on building tools can be instead invested in improving their robustness/scientific validity, i.e. pursuit of FAIR principles.
Special needs for data handling in energy modeling are, I think, often overestimated.

One technology that should be considered is SDMX, already in use by national statistical agencies and central banks, with a growing software ecosystem. In particular, its Information Model (IM) (PDF link; see also “Section 2” at [3]) is impressively thorough and universal (though this does mean it takes a little reading to digest).

After reading about SDMX, I more recently have worked with the IAMC data format [4]. I have not yet found any feature of the latter which cannot be handled—in a more robust, yet also more flexible way—using SDMX.

Suggested points of discussion

The SDMX IM suggests an inversion of the question order from Stefan’s post above (starts with “Which file type…”).

Instead, the first questions should be:

Which concepts are to be captured in the (meta)data?
- Units of time, geography, technologies, energy carriers, model, scenario, physical quantities such as energy, power, mass, distance, etc.
- How is each concept measured? e.g.
  - Continuous or integer values with associated units.
  - Codes from certain lists
    - Energy-specific, shared lists
    - Particular modelers’ bespoke lists,
    - Lists from existing standards, e.g. ISO 3166-1 alpha-3.
  - Text, URLs, or other formats
Which concepts are used as:
- the actual values (‘measures’) of observations in a data set? e.g. energy consumed.
- dimensions for an observation? e.g. time period, region.
- metadata? e.g. model; data source.
For metadata concepts, may they be attached to
- an entire dataset,
- a group of observations,
- a single series, and/or
- a single observation?
Similarly, for concepts used as dimensions:
- Which measures may be associated with which dimensions?
  
  E.g. atmospheric CO₂ concentration may not have a REGION dimension.
- Which are fixed for whole series, groups, or datasets, and which vary by observation within a particular series?
  
  Here, the SDMX IM reminds that a “data series" is not necessarily a “time series”; rather the latter is a case of the former, where the dimension at observation level is the time period.

Second, after these questions are answered, IAM/energy systems-specific (meta)data structures are defined by stating required or optional associations between concepts, dimensions, and measures—with some description of how this applies in the intended use-cases:

For instance: a REGION dimension in global-scope modelling can be linked to a COUNTRY concept which is denoted by codes from the ISO 3166-1 alpha-3 list (or alpha-2; these are choices to be made by the community).

Data for/from models with national scope can still have the REGION dimension, but it can be fixed at the level of the entire dataset to a certain value.
Or, a TIME_PERIOD dimension can be measured as a calendar year, or periods of certain length,

E.g. the code '2010' in one code list may mean “the period 2010-01-01–2014-12-31”, while in another code list it may mean “the period 2010-01-01–2019-12-31”; the specification of which code list is used in a data set makes the meaning unambiguous.
etc.

And then, the third/last question or topic is representation: how to store the data in one or more file formats. SDMX defines XML, CSV, and JSON representations for data, as well as XML representations for data structures; the ecosystem provides tools for validating the former against the latter.

Further points

Approaching the discussion in this way ensures the required data and metadata are defined before the representation/format; so it is obvious up-front if the representation has limitations (e.g. is unable to carry certain data/metadata).
For some concepts, the SDMX user community has already thought carefully about possible values.

For instance, here [5] is a discussion of an oft-used OBS_STATUS attribute, with 20 possibilities already identified such as “normal value”, “estimated value”, “imputed value”, “missing value; data cannot exist”, etc.
Modelers are trained to see “this model has regional/global scope; country/sub-national resolution” as a whole different category of thinking from “this model has separates cars from SUVs; that model does not.”

Starting from an information model lets us see:
- These each a concept/dimension.
- Each has a number of codes/possible values.
- These codes have specific meanings that can be listed.
- Different modelers will use different lists of codes/meanings for the same concept.
Discussion about the final point can be separated from identifying which measures have which dimensions.

¹ FYI: I am a colleague of Daniel’s at IIASA, but these are solely my views as an IAM/energy/transport modeller, and in part as a coordinator of the iTEM consortium [6].

khaeru · 10 December 2019 13:15

I had to separate the following links because the forum software only allows me to use two at a time:

khaeru · 10 December 2019 13:18

Final two links:
5. https://www.sdmx.org/?sdmx_news=possible-ways-of-implementing-the-observation-status-concept-version-2-0
6. https://transportenergy.org

danielhuppmann · 10 December 2019 19:15

Thank you @khaeru for your comments and insights from a slightly-different-but-related field of modelling!

One minor comment: an alternative description of the IAMC data format ([4] above), which is more tangible (and more up-to-date) than the reference you provided: https://pyam-iamc.readthedocs.io/en/stable/data.html

robbie.morrison · 16 December 2019 15:13

Hi @khaeru You mention that the special structural needs of energy system models might be overestimated. Probably true. But one requirement that is relatively domain‑specific is specifying component connectivity in a (preferably) non‑redundant way — usually as a graph object and articulated using some kind of list‑based data structure. The term “graph” is not defined in the SDMX glossary (SDMX 2018), nor mentioned in the SDMX data model (SDMX 2011). So I think that is one extension that should be considered in this exercise. Based on a cursory search too, the IAMC does not appear to cover graphs either.

One other point to keep in mind: an XML schema can be mapped to a UML class diagram but a UML class diagram with inheritance cannot be fully captured within XML. HTH R.

References (also given in earlier postings)

SDMX (October 2018). sdmx guidelines — SDMX glossary — Version 2.0. Statistical Data and Metadata Exchange (SDMX).

SDMX (July 2011). sdmx standards — Section 2: information model: UML conceptual design — Version 2.1. Statistical Data and Metadata Exchange (SDMX).