Data standards for energy system modeling

robbie.morrison · 18 May 2021 17:32

I drafted this blog for use elsewhere and decided it might fit here as well.

Introduction

This topic reviews the intertwined questions of data architectures, data specifications, and data sources under the rubric of data standards. The context is policy‑oriented energy system modeling.

The issues surrounding data standards range from high‑level to specific. And please be warned in advance: most of the accompanying matters are poorly resolved, be they overarching objectives or implementation details. The view presented here is that analysts should aim for good practice and, whenever applicable, contribute to wider agendas, but that very often compromises are unavoidable and expedient solutions are necessary.

One important emerging theme is the development of data systems to create coherent datasets, facilitate information sharing between similarly conceived system models, and assist reproducibility.

Another important activity are communal efforts in improve data semantics and related high‑level agendas.

This posting attempts to indicate the type of data sources available and where data catalogs might be found. This posting does not seek to present or maintain a comprehensive list of information sources.

Genuine open data with suitable data‑capable open licenses should be favored whenever possible. More on license choice.

Data architectures

The term data architecture is used here to cover the structure of data without reference to other aspects. Data architectures range in purpose and sophistication. The following breakdown is applied:

Construct	Example	Comment	URL or citation
spreadsheets	—	ubiquitous but difficult to integrate
frictionless data packages	—	semi‑formal standard for simple packaging	Karev and Winfree (2020)
model‑specific nested datasets	Flomb/Calliope‑Italy	Calliope project data instance	Lombardi (2020)
model‑specific databases
data systems	PowerSystems.jl	a framework, other projects curate data
community backends	Open Energy Platform
linked open data (LOD)	LOD‑GEOSS	based on DBpedia Databus

The first two entries, ad‑hoc spreadsheets and frictionless data packages, are mostly used to collect, store, and exchange data. Model‑specific nested datasets and databases are used to support different modeling frameworks and are specific to each framework. The so‑called data systems attempt to provide more general solutions and can potentially interact with any number of semantically‑similar frameworks. Community backend solutions offer dedicated infrastructure that is specific to this domain. Linked open data projects are developing very broad functionality that operates across the web and accounts for the strengths and drawbacks of web‑sourced information.

This posting does not considered brokered data platforms like the United Kingdom Icebreaker One (IB1) Open Energy project or the Open Subsurface Data Universe (OSDU) being developed by oil majors and allied servicing companies. Brokered data platforms like these are useful when dealing with both data protected under personal privacy and data that is legally‑encumbered in some way and needs to remain so.

Spreadsheets and frictionless data packages

Spreadsheets are widely used for general data management. More recently, online spreadsheets and file hosting services have become popular. Spreadsheets tend not to be used by energy system modelers as they are more difficult than other data architectures to place under version control and to process using scripts written in languages like Python, Julia, and R.

Frictionless data packages (FDP) are a concept promoted by the London‑based Open Knowledge Foundation. The FDP concept combines CSV files for collected data and JSON files for the accompanying metadata. CSV stands for comma‑separated values and stores numerical information as character encoded text strings. As such, CSV files are human readable, space‑inefficient, and widely portable. CSV files and FDP packages are more amenable to version control and scripted processing than spreadsheets. FDP packages are preferred over CSV files however due to their better support for metadata, including type information.

Some data portals also serve SQLite files alongside spreadsheets and FDPs. One example is the OPSD portal .

Framework-specific solutions

The term “framework” is used by system modelers to distinguish individual models from the application software that supports them. Under this usage, the application software is a “model framework” and a particular model becomes a “framework instance”. Model frameworks include Calliope, oemof, and PyPSA — to cite three that are written in the Python language and share relatively similar underlying designs.

The data requirements for most models are naturally tree structured. Some modeling frameworks therefore deploy data files located within a predefined hierarchy of subdirectories and files. These sets of files are often placed within git repositories, including public git portals like GitHub. The datasets from specific studies can then be zipped and archived on sites like zenodo to support scientific transparency. One such example from Calliope (screenshot below) is provided by Lombardi (2020). Other projects instead interface to a fully‑fledged relational database or similar.

Diagram: Zenodo interface to the Flomb/Calliope‑Italy study dataset (Lombardi 2020).

Data systems

The admittedly novel term “data system” is used here to describe prepackaged data and supporting software which is coherent, internally consistent, and complete within its stated scope. Data systems embrace the notion of a digital twin in that regard. In contrast to CSV solutions, data systems often prioritize the efficient storage and rapid recovery of information. They are designed to interact with suitable model frameworks via a readily customizable translation layer or overloaded APIs called by the framework itself. Data systems can also normally perform integrity checks, receive and store results, and offer standardized reporting. Some systems can additionally manage key parts of the data processing pipeline and thereby promote repeatability. These systems, taken overall, should improve both reproducibility and productivity. Examples include PowerSystems.jl, PowerGenome, and the Spine Toolbox.

Community backend services

Community portals are portals that cater specifically to the needs of a particular research domain and normally offer API access. In the case of energy system modeling, the Open Energy Platform (OEP) provides such features. In addition, the OEP can manage scenarios for use across any number of suitable model frameworks. Cross‑framework comparisons based on identical scenario specifications are likely to play an important role in terms of model validation going forward.

Linked open data

Linked open data (LOD) is an internet architecture whereby a smart “bus” interfaces between the user and various web API‑exposed data sources. Work in this area is currently at a research stage only. But the techniques being explored have the prospect of revolutionizing data handling and can support both automatic data updating and manual rollback. One project is LOD‑GEOSS based on fundamental work by the DBpedia Databus project.

Data specifications

The term “data specifications” refers collectively to the various standards, norms, protocols, guidelines, and conventions that a particular research community has adopted in relation to collecting, exchanging, processing, and verifying numerical and non‑numerical information. Data specifications fall into two camps. High‑level specifications cover concepts like semantics, metadata, collection protocols, and information relationships. While low‑level specifications include prosaic matters like data representations, encodings, and timestamping and specialist formats for geodata and such like.

Semantics

Semantics refers to the meaning of data. Sometimes that meaning remains implicit and sometimes it is subject to formal treatments and ontologies. The Open Energy Ontology (OEO) is currently under development by energy system modelers to service the needs of the energy modeling domain (Booshehri et al 2021).

Metadata, collection standards

Metadata provides information on the circumstances of collection, downstream usage, and legal context. Data collection standards inform metadata and the two concepts are tightly coupled. Metadata is sometimes described as data about data. Metadata practices, like semantics, are not well resolved in the energy analysis domain. The EERAdata metadata project is currently advancing these issues (Wierling et al under review).

Relationships between objects

Extending these concepts, the relationships between objects within a given domain, collectively create a labeled directed graph with objects mapped to nodes and relationships to edges. Such objects need not be tangible. One kind of relationship central to energy system modeling is that of physical connectivity. Each project tends to handle this information using its own framework‑specific practices. Depending on the design of the framework, other forms of connectivity can exist, for instance authority relationships between operators and their assets within hybrid agent‑based models.

One widely used electricity‑specific standard for information exchange is the UML‑based CIM model (Wollenberg et al 2016). The CIM naturally includes physical connectivity. CIM is proprietary and a copy of the documentation is priced at €320.

Object–relationship information can also be captured as web‑based RDF triples. This scheme compels that each object has a distinct IRI address — clearly not an appropriate requirement for project‑level usage. Under RDF, objects may also be values, thereby representing primary information in some form. The concept of RDF triples does prompt wider questions regarding the role of the semantic web in the context of energy system analysis.

Collectively

The more sophisticated architectures mentioned earlier, namely framework‑specific solutions, data systems, and LOD architectures, also naturally embed these various structural concepts, albeit with varying degrees of formalism and rigor.

Furthermore, projects like the OEO and the EERAdata metadata initiative provide the underpinnings for future model inter‑comparison projects (MIPs), as used for some time by the integrated assessment community.

And stepping back again, an informing ontology together with dataset‑specific metadata and (probably local) semantic triples covering physical connectivity and other relationships merge to form a larger canvas onto which automated reasoning can be applied. Such reasoning may include rule‑based data validation, simple inference, and semantic search methods. These activities relate to the data itself and are separate from the numerical modeling.

Low-level protocols

Geodata and geographic information more generally require special consideration. Hinz and Bill (2018) provide background. Other low‑level protocols are too specific to be covered in this posting.

Open licensing

Legal characteristics also form part of a data specification. Public licenses provide data users with a set of permissions and obligations without the need to form bilateral agreements with data providers or their brokers. Open licenses are a subset of public licenses that additionally meet community expectations on the meaning of free use. More on license choice. In the absence of other considerations, collected or processed data and object attributes should be licensed Creative Commons CC‑BY‑4.0 and metadata and semantic triples should be licensed CC0‑1.0.

Sources of data

Sources of numerical data for energy system modeling have focused historically on the electricity sector. This is largely because energy system models tend to start with the electricity sector and later bridge to other sectors. But also because a given electricity subsystem, in contrast to other subsystems, requires detailed engineering analysis to adequately resolve its design and operation.

As indicated, this posting does not attempt to provide a complete and current listing. Rather it gives a flavor of what exists and indicates where one might seek further information.

As the table below indicates, there are quite a range of primary and derived sources of data at hand. Clearly, sources that have been community‑curated are normally preferable to those that have not.

Construct	Example	Comment	URL or citation
primary sources	ENTSO‑E Transparency Platform	statutory reporting	Hirth et al (2018)
subject‑specific data repositories	JRC ENSPRESO database		Ruiz et al (2019)
community‑curated data portals	OPSD, Open Energy Outlook
general data portals	WRI Power Explorer
data system stock libraries	PowerGenome
proprietary databases	S&P Global Platts	legally‑encumbered
citizen science projects	mapping PV installations in UK		Stowell et al (2020)
AI‑based projects	predictive mapping of global power system		Arderne et al (2020)

Primary sources

Primary sources are often mandated under statutory reporting, with the European ENTSO‑E Transparency Platform being one such portal. One problem with information made public under statutory reporting, in Europe at least, is that the underpinning legislation is often silent on public licensing (Hirth 2020). The Energy Information Administration (EIA) provides primary information in the United States.

Subject-specific data repositories

Subject‑specific data repositories cover specific areas like wind energy. One such example is the European Commission JRC ENSPRESO dataset (Ruiz et al 2019).

Community-curated data portals

Community‑curated data portals are exactly that: data portals stocked and maintained by energy system modelers. Examples include the Open Power System Data (OPSD) portal in Europe (Wiese et al 2019) and the Open Energy Outlook database (DeCarolis 2020) and PUDL projects in the United States.

Data portals more generally

There are a number of data portals which mostly assemble and publish primary data from other sources. Examples include energydata.info from the World Bank Group, OpenEI from the US Department of Energy, and Power Explorer and ClimateWatch from the World Resources Institute (WRI). Launched in 2021, ClimateWatch covers electricity, stationary energy, and transport (Grinspan and Worker 2021).

Data system stock libraries

Data systems often bundle stock libraries to service the more widespread needs of their userbases. These system will also output selected data in common character‑encoded interchange formats for the purposes of archiving and exchange. Any upstream data dependencies will need to be established by each user. Examples include PowerGenome and the stock library for PowerSystems.jl known as PowerSystemCaseBuilder.jl.

Proprietary databases

Several companies sell access to commercial databases, controlled by contract. This information is normally both legally‑encumbered and expensive and therefore the least attractive choice for policy‑oriented energy system analysis. These databases also tend to be less well resolved for the global south.

Citizen science projects

There are a number of citizen science projects, often based on OpenStreetMap, that collect fine‑grained information. Examples include information on electricity transmission assets within Europe and solar‑PV installations within the United Kingdom (Stowell et al 2020). As of 2021, the latter project utilizes 300 volunteers and has mapped 250k installations.

AI-based projects

Information projects combining aerial imagery and artificial intelligence (AI) techniques represent an emerging field. Examples include the machine recognition of power lines (Arderne et al 2020) and of renewables installations.

Data catalogs

The following catalogs of relevance to energy system analysis may be of help — otherwise the various data portals can be inspected in turn:

openmod wiki: open data sources for energy modeling
Open Energy Platform: individual databases
Wikipedia: energy system data databases
European Open Data Portal: catalogs page

Validation and provenance

Validation, provenance tracking, and other forms of quality assurance remain vitally important. These processes have only mostly been undertaken informally to date. Looking forward, such features can be naturally built into both data systems and linked open data systems. Moreover as indicated, these two data architectures can potentially support both recall‑and‑reissue and rollback functionalities, thereby relieving researchers of tedious and typically error‑prone hand maintenance.

Closure

Data management for energy system modeling is now a sophisticated area in its own right. That said, many high‑level matters remain works‑in‑progress — including data semantics and metadata standards. Those handling data for policy‑oriented analysis will need to navigate their way around the landscape indicated in this posting and make suitable and ofttimes difficult choices and compromises.

One theme here is the degree of formalism required. Often less formalism seems sensible but, as the domain matures, more formalism and rigor are indicated. Another parallel theme is the natural fit between data models and ontologies, on the one hand, and model framework design, including object‑oriented analysis, on the other.

Much of the effort in obtaining data for energy system analysis is simply replicating what the European Commission calls “privately held information [of] public interest”. It remains an open question as to whether strengthening statutory reporting provisions might offer a better approach, with suitable anonymization techniques applied where needed.

As indicated, this posting does not attempt to catalog data sources. The energy system field is too specific, too diverse, and too volatile to make the listing of sources here remotely useful or feasible.

Useful links

openmod forum: “open data” category
Open Energy Platform: landing page
Wikipedia: energy open system databases

References

Arderne, Christopher, C Zorn, C Nicolas, and EE Koks (15 January 2020). “Predictive mapping of the global power system using open data”. Scientific Data. 7 (1): 19. ISSN 2052‑4463. doi:10.1038/s41597‑019‑0347‑4. Creative Commons CC‑BY‑4.0 license.

Booshehri, Meisam, Lukas Emele, Simon Flügel, Hannah Förster, Johannes Frey, Ulrich Frey, Martin Glauer, Janna Hastings, Christian Hofmann, Carsten Hoyer‑Klick, Ludwig Hülk, Anna Kleinau, Kevin Knosala, Leander Kotzur, Patrick Kuckertz, Till Mossakowski, Christoph Muschner, Fabian Neuhaus, Michaja Pehl, Martin Robinius, Vera Sehn, and Mirjam Stappel (1 September 2021). “Introducing the Open Energy Ontology: enhancing data interpretation and interfacing in energy systems analysis”. Energy and AI. 5: 100074. ISSN 2666-5468. doi:10.1016/j.egyai.2021.100074. Creative Commons CC‑BY‑4.0 license.

DeCarolis, Joe (24 December 2020). An Open Energy Outlook for the United States powered by TEMOA. Raleigh, North Carolina, USA: NC State University. YouTube video 15:16. Creative Commons CC‑BY‑4.0 license.

Grinspan, Delfina and Jesse Worker (March 2021). Implementing open data strategies for climate action: suggestions and lessons learned for government and civil society stakeholders — Working paper. Washington DC, USA: World Resources Institute. doi:10.46830/wriwp.19.00093. Creative Commons CC‑BY‑4.0 license.

Hinz, Matthias and Ralf Bill (2018). Mapping the landscape of open geodata. 21th AGILE Conference on Geographic Information Science.

Hirth, Lion, Jonathan Mühlenpfordt, and Marisa Bulkeley (1 September 2018). “The ENTSO-E Transparency Platform: a review of Europe’s most ambitious electricity data platform”. Applied Energy. 225: 1054–1067. ISSN 0306-2619. doi:10.1016/j.apenergy.2018.04.048. Creative Commons CC-BY-4.0 license.

Hirth, Lion (1 January 2020). “Open data for electricity modeling: legal aspects”. Energy Strategy Reviews. 27: 100433. ISSN 2211-467X. doi:10.1016/j.esr.2019.100433. Creative Commons CC‑BY‑4.0 license.

Karev, Evgeny and Lilly Winfree (8 October 2020). Announcing the new frictionless framework. Open Knowledge Foundation blog.

Lombardi, Francesco (22 June 2020). Flomb/Calliope-Italy: v0.2.2 — Direct and indirect electrification options. Europe: Zenodo. doi:10.5281/zenodo.3903089. Zipped dataset. DOI resolves to the latest version. Creative Commons CC‑BY‑4.0 license.

Reder, Klara, Carsten Pape, Mirjam Stappel, Hannah Förster, Lukas Emele, Christian Winger, Ludwig Hülk, Christian Hofmann, Editha Kötter, Martin Glauer, and Till Mossakowski (2019). Scenario data on the Open Energy Platform (SzenarienDB on the OEP): a web-platform to improve transparency and reproducibility of energy system analyses — Poster. Kassel, Germany: Fraunhofer Institute for Energy Economics and Energy System Technology (IEE).

Reder, Klara, Mirjam Stappel, Christian Hofmann, Hannah Förster, Lukas Emele, Ludwig Hülk, and Martin Glauer (24 January 2020). “Identification of user requirements for an energy scenario database”. International Journal of Sustainable Energy Planning and Management. 25: 95–108. ISSN 2246-2929. doi:10.5278/ijsepm.3327.

Ruiz, Pablo, Wouter Nijs, Dalius Tarvydas, Alessandra Sgobbi, Andreas Zucker, Roberto Pilli, Klas Jonsson, Andrea Camia, Christian Thiel, Carsten Hoyer-Klick, Francesco Dalla Longa, Tom Kober, Jake Badger, Patrick Volker, Berien S Elbersen, Andre Brosowski, and Daniela Thrän (1 November 2019). “ENSPRESO — an open, EU‑28 wide, transparent and coherent database of wind, solar and biomass energy potentials”. Energy Strategy Reviews. 26: 100379. ISSN 2211-467X. doi:10.1016/j.esr.2019.100379. Creative Commons CC‑BY‑4.0 license.

Stowell, Dan, Jack Kelly, Damien Tanner, Jamie Taylor, Ethan Jones, James Geddes, and Ed Chalstrey (13 November 2020). “A harmonised, high-coverage, open dataset of solar photovoltaic installations in the UK”. Scientific Data. 7 (1): 394. ISSN 2052-4463. doi:10.1038/s41597-020-00739-0. Creative Commons CC‑BY‑4.0 license.

Wierling, August, Valeria Jana Schwanitz, Sebnem Altinci, Maria Balazińska, Michael J Barber, Mehmet Efe Biresselioglu, Christopher Burger-Scheidlin, Massimo Celino, Muhittin Hakan Demir, Richard Dennis, Nicolas Dintzner, Adel el Gammal, Carlos M Fernández-Peruchena, Winston Gilcrease, Pawel Gladysz, Carsten Hoyer-Klick, Kevin Josho, Mariusz Kruczek, David Lacroix, Malgorzata Markowska, Rafael Mayo-García, Robbie Morrison, Manfred Paier, Giuseppe Peronato, Mahendranath Ramakrishnan, Janeita Reid, Alessandro Sciullo, Berfu Solak, Demet Suna, Wolfgang Süß, Astrid Unger, Maria Luisa Fernandez Vanoni, and Nikola Vasiljevic (under review). “Advancing FAIR metadata standards for low carbon energy research”. Energy Strategy Reviews. Code: ESR‑D‑21‑00036.

Wiese, Frauke, Ingmar Schlecht, Wolf-Dieter Bunke, Clemens Gerbaulet, Lion Hirth, Martin Jahn, Friedrich Kunz, Casimir Lorenz, Jonathan Mühlenpfordt, Juliane Reimann, and Wolf‑Peter Schill (15 February 2019). “Open Power System Data: frictionless data for electricity system modelling”. Applied Energy. 236: 401–409. ISSN 0306-2619. doi:10.1016/j.apenergy.2018.11.097. Postprint.

Wollenberg, Bruce, Jay Britton, Ed Dobrowolski, Robin Podmore, Jim Resek, John Scheidt, Jerry Russell, Terry Saxton, and Chavdar Ivanov (February 2016). A brief history: the Common Information Model. IEEE Power and Energy Society eNews Update.

▢

robbie.morrison · 2 September 2021 15:13

The Linux Foundation is developing the following project to support metadata management. I have not been through the details but is seems useful to record the URL here for future reference:

Egeria (ongoing). Egeria: open metadata and governance. Linux Foundation. San Francisco, California, USA.
https://egeria-project.org