Ithaka questionnaire on community data useage

Overview

Danielle Cooper, Manager – Collaborations and Research Ithaka S+R, New York, USA recently spoke at the Open Science Conference 2021 run by the German National Library of Economics (ZBW), Hamburg, Germany. Ithaka S+R is a United States non‑profit organization helping academia adapt to new opportunities and contexts, which may include assisting a transition to open science methods.

During her OSC2021 presentation, Danielle asked for data communities to reach out and share their experiences. I duly contact her and this questionnaire was the outcome.

The resulting webpage will be published under a Creative Commons CC‑BY‑4.0 license.

Example publication

Here is a previous emergent data community spotlight webpage:

Editing process and close‑off

Now closed after 17 days

Those interested can work up the answers in the wikipost below. I have taken the liberty of a first‑cut attempt. And when roughly complete @robbie.morrison will copy‑edit them and return the text to Danielle for addition to their website. Or alternatively discuss any issues and propose alternative wordings by replying to the wikipost. Please provide useful commit messages when editing the wikipost!

When the feedback process is complete, the active topic will be hidden from public view and the final text transferred to a new persistent topic in public view.

I imagine about two weeks will be needed to complete the community responses, but the exact close‑off will be dictated by the level of activity. Alternatively, if you don’t think we should submit as a community, please make your views known. We should also talk about whether this exercise is useful or not.

Preamble text from Daniella:

Emergent Data Community Spotlight: An interview about Energy Modelling with the Open Energy Modelling Initiative

Fostering data and code sharing among scholars is an important component to fostering a culture of open research - but how can this work be done most effectively? At Ithaka S+R we are exploring the crucial contextual elements that optimize research data sharing. We’ve found that data communities, which are formal or informal groups of scholars who share a certain type of data with each other, regardless of disciplinary boundaries, provide important clues to understanding how research data sharing works.

Identifying and supporting scholarly communities that are just beginning to develop an active, sustainable data sharing culture, is an important strategy for those who wish to support data sharing. In order to understand how data communities can be built from the ground up we are interviewing experts who are at the forefront of growing emergent data communities in a variety of research areas. We’ve highlighted promising developments in the areas of spinal cord injuries, literary sound recordings, and zooarchaeology.

Today we share about the Open Energy Modelling Initiative (openmod), which has about 600 listed participants, with most of them being full‑time researchers or analysts. Openmod has a strongly collective ethos and so this interview was conducted with Robbie Morrison serving as a facilitator with input on the responses provided by the entire community through their public forum.

1. What does the the energy modelling community look like? What kinds of researchers are involved in this work (e.g. disciplinary and organizational affiliations), how do they collaborate, and what kinds of formal structures have been established to organize them?

The community arose in Berlin, Germany in September 2014. Most people involved are completing their higher education or classify as early‑stage researchers. A few are mid‑stage researchers and beyond. And some work for consultancies, companies, start‑ups, or government agencies.

Geographically, the community started in the German‑speaking DACH world, later spread to the United States, and is now making inroads into the United Kingdom. Other participants are sprinkled throughout the planet, including the Russian Federation, India, and the Global South. Aside from the first workshop, the working language has always been English.

The community has no formal structures. Its ethos derives from open source software development. By common understanding, those running the various online services or twice‑annual physical workshops are accorded complete dominion. The mailing list is the principle place for making community decisions.

Much of the discussion that follows below centers on European Union law — in part, because Europe provides a more restrictive legal context for data than that found in the United States. But this focus is equally a reflection of our roots.

2. How and what kinds of data are typically incorporated into energy modelling?

Modelers do not generally deal with personal information — as defined under EU law. If such information is required for numerical models, it can normally be anonymized from real data, generated using estimated statistics, or otherwise synthesized — the key issue is that the information remains representative but need not be exact.

Energy system models require general information about component technologies and their engineering and cost characteristics. Technologies such as windfarms, coal‑fired electricity generation, and high‑voltage transmission lines. Cost information is necessarily estimated in most cases because this information is normally commercially sensitive. Notwithstanding, the European Commission, as well as other governing agencies around the world, could collect cost and performance information under a public interest rationale and make key metrics available in generic form. Future costs and performance projections, sometimes also subject to technological learning, are necessarily speculative.

Energy system models require specific details about the system being modeled — including the location, age, and connectivity of all represented assets. That includes information about the networks under investigation — usually the electricity grid but perhaps also gas and district heat infrastructure. Current and potential future demand profiles are needed. Location‑specific resource potentials are needed too, including solar and wind assessments and land availability. And possibly also information concerning the built environment and mobility, depending on the scope of the model. Some models may also require historical market clearance information or information on how households and firms may take short‑run and long‑run decisions.

The bulk of models capture national and supra‑national systems but some research groups investigate municipal systems, islanded microgrids, and standalone systems. Most research questions provide natural boundaries.

Some of the information indicated above is subject to statutory reporting. But the processes for assembling and publishing that information are often archaic and error prone, leading to poor quality disclosure. Projects within the openmod community assemble and curate this information so it can be more readily utilized by modelers and analysts. One such project is the OPSD portal.

Information on future climate patterns is sometimes required but this information can be readily sourced from the climate science literature and is not legally encumbered.

Most of the modeling within the community is intended to inform public policy options for our rapid trajectory to net‑zero carbon. Research either concentrates on methodologies or seeks to provide policy‑relevant results and insights.

3. What infrastructure is currently available to facilitate the sharing of this data among researchers?

Within the orbit of the openmod, the Open Energy Platform (OEP) is the primary resource. This platform is specifically designed to handle the needs of energy system modeling and, in particular, scenario analysis. Energy system modeling differs from other forms of computational science in that testable outcomes are not possible and a range of speculative scenarios — each with their own explicit objectives, constraints, and assumptions — must instead be analyzed and traded‑off against one another.

In addition, there are initiatives specifically aimed at allowing data to be transferred between different modeling projects in order to facilitate cross‑model comparisons. Each model has necessarily evolved its own data interface and internal semantics.

4. Why is open data sharing important to energy modelling? What are the typical positions on this issue among stakeholders engaged with energy modelling?

We adopt the European Commission description for open data (EU Directive 2019/1024, recital 16):

Open data as a concept is generally understood to denote data in an open format that can be freely used, re‑used and shared by anyone for any purpose

Data sharing reduces duplicated work, improves data quality and coverage, and facilitates cross‑model comparisons — that last point being necessary for strengthening confidence in both the direct results and in subsequent interpretations.

Conversely, data without appropriate open licensing may well be legally encumbered and this lack of certainty makes it unsuitable for open modelling.

5. What challenges or barriers to widespread data sharing are unique to research involving energy modelling?

Our primary challenge is the lack of open licensing, particularly on public sector information and information published under statutory reporting. European Union legislation on the terms of use of public sector information is unclear and contradictory and legislation on energy sector disclosure is silent on licensing. These defects need fixing at the level of the European Parliament. The best that researchers can do until then is to push relentlessly for Creative Commons CC‑BY‑4.0 licensing on all such information.

That means that suitable open licensing is key. In most cases, such licenses do not grant binding permissions but rather confer certainty. Particularly given the presence of Directive 96/9/EC database protection within the European Economic Area (EEA) in which one cannot know if a data extraction from a public portal was insignificant or not.

The power exchanges that run the wholesale electricity markets are particularly resistant to providing disclosed information in any kind of usable form — and deploy techniques like serving data that cannot be highlighted and copied to evade recovery. This is certainly against the spirit of the legislation, even if technically compliant.

Another emerging problem is the proliferation of national open data licenses — such as the recent German Government dl‑de/by‑2‑0. Such licenses could well lead to legally siloed data when not inbound compatible with the CC‑BY‑4.0 license, even if only on some trivial legal point.

Data lacking CC‑BY‑4.0 licensing (or CC0‑1.0 waivers or something inbound compatible) is particularly problematic in the United Kingdom because the threshold for copyright is effort‑based and addressable collections of data may also attract database protection. The situation in the United States is considerably better because datasets and databases are unlikely to be intellectual property. Europe falls somewhere in between.

6. What are the most important supports needed in order to cultivate a thriving data community among energy modelers?

Recognition by science funding organizations of several necessities would help. First, the need to require suitable licenses on all appropriate outputs. Second, support for ongoing maintenance, once the underlying data projects have completed. Third, to provide stable online archiving for non‑deliverable artifacts such as project websites, wikis, public mailing lists, and code repositories.

But beyond that, most solutions have to come from within the modeling community.

7. How is openmod working to address the open data sharing needs of the energy modelling community? Who else is doing important work in this area and what else is on the horizon?

For the openmod, the concept of genuinely open data was central from day one. But maturity has brought forward two vitally important related agendas:

  • a community ontology — a shared worldview
  • agreement on collection protocols and metadata — the latter being data about data

Both initiatives are interconnected, both involve deep buy‑in from within the community, and both will take significant effort to work through and bed‑in. The Open Energy Ontology is addressing the first and the EERAdata initiative is pursuing the second. The EERA and openmod communities have begun to work together on the latter.

Open is not the only paradigm for energy system modeling. Another is the closed consortium that effectively remains only within the reach of government ministries, multilateral agencies, and allied research teams. How that paradigm evolves in an increasingly open world remains to be seen. In any case, there is virtually no crossover between these two realms at present. A third paradigm is the single‑institution closed project — and again one whose future looks doubtful.

An upcoming challenge is the tracking of both data provenance and data versioning at scale — taken together these represent active research questions in computer science and are certainly not unique to the domain of energy analysis.

The prospect of supporting and using linked open data (LOD) is now surfacing. Some in our community are working with the DBpedia Databus project to explore the possibilities that semi‑smart knowledge graph systems can offer.

Returning to the present, another issue is dataset forking and fragmentation. Under this process, researchers grab whatever data they need for the issues at hand, modify it to suit their needs, and perhaps later publish as a static archive to support transparency and reproducibility. But any corrections and improvements are not propagated back upstream for wider uptake and benefit. LOD clearly has the potential to assist here.

Finally members from within the openmod community make written and oral submissions to European Union public consultation on law reform and science policy. Making ones voice heard in such processes is an important and necessary activity.

Hi @robbie.morrison - Thanks for your work on this! Really nicely written, and nice to see a summary of all the pieces that openmod is connected to/working on! A few minor comments:

#2, 2nd para, 4th sentence: Nothwithstanding, the European Commission**, and other governing agencies around the world,** could collect…

#4 I think it might also be worth noting that data encumbered by license and other restrictions can, in some cases, make it impossible (or at least legally questionable) if certain modelling can be done without infringing copyright.

#5 - might be worth acknowledging the North American/US context, even if it’s only one sentence reminding readers of the European context.

A good point — although most public licenses are underpinned by intellectual property rights, because, if there is no IP, there is nothing to license. (Note the legal action surrounding the happy birthday song, also split between EU and US governing law.) Notwithstanding, civil law concepts can apply to published data not otherwise protected as intellectual property, such as missappropriation. Davidson (2008) covers these and I’ll check what he has to say.

We are also starting to see data brokerage systems being developed, founded on non‑disclosure and voluntary sign‑on. These schemes embed their own internal legal systems covering accreditation, oversight, dispute resolution, and sanction. But that data is not public in any sense and should only be used for scientific research or public policy analysis as a last resort. Two examples are the Icebreaker One Open Energy platform and the Open Subsurface Data Universe (OSDU). The Open Energy platform should go beta live toward the end of 2021, while the OSDU is scheduled for launch at the end of this month, March 2021. Subsurface data includes drill well logs, seismic investigations, and similar artifacts produced during hydrocarbon exploration and recovery. The OSDU project plans to rebrand in due course and cover above ground assets such as renewables potentials and windfarm SCADA traffic. Use of the qualifier “open” in the context of data brokerage is highly questionable.

I need also to read through the current draft and check that the public/private data boundary is clear and consistent.

Thanks @tniet for your other remarks, I’ll fold them in in due course. R

References

Davidson, Mark J (January 2008). The legal protection of databases. Cambridge, United Kingdom: Cambridge University Press. ISBN 978-0-521-04945-0. Paperback edition.

Just to note that before submitting, I added a paragraph on dataset forking and fragmentation as a community issue, as follows:

By way of background, please see the following post which exemplifies the problem: