Do-a-thon: Towards a common data standard for integrated assessment and energy systems modelling

khaeru · 10 December 2019 13:14

Thanks @stefan.pfenninger @danielhuppmann for proposing this session. While I haven’t been involved in such discussions via OpenMod, I think this is an important topic and wanted to throw in my two cents.¹

General

I would strongly encourage this community to as much as possible re-use and extend existing, non-energy-specific technologies (I use this term to mean: data formats, specifications, and tools/code/software) than to create new, bespoke/domain-specific ones.

This is for a few chief reasons:

Linking to ecosystems around existing technologies saves work in several ways:
- It avoids repeating discussions about how to design the tech to cover a broad set of use-cases.
- Tools developed for the existing formats provide turnkey functionality that can be applied to energy/model data.
- Tools developed for energy/model use-cases will attract interest from the broader ecosystem, so development and maintenance work can be shared.
Work saved on building tools can be instead invested in improving their robustness/scientific validity, i.e. pursuit of FAIR principles.
Special needs for data handling in energy modeling are, I think, often overestimated.

One technology that should be considered is SDMX, already in use by national statistical agencies and central banks, with a growing software ecosystem. In particular, its Information Model (IM) (PDF link; see also “Section 2” at [3]) is impressively thorough and universal (though this does mean it takes a little reading to digest).

After reading about SDMX, I more recently have worked with the IAMC data format [4]. I have not yet found any feature of the latter which cannot be handled—in a more robust, yet also more flexible way—using SDMX.

Suggested points of discussion

The SDMX IM suggests an inversion of the question order from Stefan’s post above (starts with “Which file type…”).

Instead, the first questions should be:

Which concepts are to be captured in the (meta)data?
- Units of time, geography, technologies, energy carriers, model, scenario, physical quantities such as energy, power, mass, distance, etc.
- How is each concept measured? e.g.
  - Continuous or integer values with associated units.
  - Codes from certain lists
    - Energy-specific, shared lists
    - Particular modelers’ bespoke lists,
    - Lists from existing standards, e.g. ISO 3166-1 alpha-3.
  - Text, URLs, or other formats
Which concepts are used as:
- the actual values (‘measures’) of observations in a data set? e.g. energy consumed.
- dimensions for an observation? e.g. time period, region.
- metadata? e.g. model; data source.
For metadata concepts, may they be attached to
- an entire dataset,
- a group of observations,
- a single series, and/or
- a single observation?
Similarly, for concepts used as dimensions:
- Which measures may be associated with which dimensions?
  
  E.g. atmospheric CO₂ concentration may not have a REGION dimension.
- Which are fixed for whole series, groups, or datasets, and which vary by observation within a particular series?
  
  Here, the SDMX IM reminds that a “data series" is not necessarily a “time series”; rather the latter is a case of the former, where the dimension at observation level is the time period.

Second, after these questions are answered, IAM/energy systems-specific (meta)data structures are defined by stating required or optional associations between concepts, dimensions, and measures—with some description of how this applies in the intended use-cases:

For instance: a REGION dimension in global-scope modelling can be linked to a COUNTRY concept which is denoted by codes from the ISO 3166-1 alpha-3 list (or alpha-2; these are choices to be made by the community).

Data for/from models with national scope can still have the REGION dimension, but it can be fixed at the level of the entire dataset to a certain value.
Or, a TIME_PERIOD dimension can be measured as a calendar year, or periods of certain length,

E.g. the code '2010' in one code list may mean “the period 2010-01-01–2014-12-31”, while in another code list it may mean “the period 2010-01-01–2019-12-31”; the specification of which code list is used in a data set makes the meaning unambiguous.
etc.

And then, the third/last question or topic is representation: how to store the data in one or more file formats. SDMX defines XML, CSV, and JSON representations for data, as well as XML representations for data structures; the ecosystem provides tools for validating the former against the latter.

Further points

Approaching the discussion in this way ensures the required data and metadata are defined before the representation/format; so it is obvious up-front if the representation has limitations (e.g. is unable to carry certain data/metadata).
For some concepts, the SDMX user community has already thought carefully about possible values.

For instance, here [5] is a discussion of an oft-used OBS_STATUS attribute, with 20 possibilities already identified such as “normal value”, “estimated value”, “imputed value”, “missing value; data cannot exist”, etc.
Modelers are trained to see “this model has regional/global scope; country/sub-national resolution” as a whole different category of thinking from “this model has separates cars from SUVs; that model does not.”

Starting from an information model lets us see:
- These each a concept/dimension.
- Each has a number of codes/possible values.
- These codes have specific meanings that can be listed.
- Different modelers will use different lists of codes/meanings for the same concept.
Discussion about the final point can be separated from identifying which measures have which dimensions.

¹ FYI: I am a colleague of Daniel’s at IIASA, but these are solely my views as an IAM/energy/transport modeller, and in part as a coordinator of the iTEM consortium [6].