Data processing and pipelining software

FloKo · 13 September 2023 12:44

Hey all, I’m working on energy data elt processes (extract, load and transfer). This means I work on data pipelines that automatically download and clean datasets. Right now I’m using the tools dagster and dbt to build pipelines for data which is relevant in the energy sector (mainly german data). Is anyone of you also working at this topic? If so I would be interested to get to know you, have an exchange about the topic and maybe find ways to cooperate and work together?

Best,
Florian

robbie.morrison · 13 September 2023 13:18

Hi @FloKo, particularly from an architectural perspective, work undertaken by the Open Energy Platform might be relevant. Some of the underlying projects were discussed at the last openmod workshop in Vienna as video 00:02:50 and video 01:52:50 (but mostly just the first half for that second video). That work might be a bit high‑level for your needs, but always useful to understand the wider game plan and align where sensible.

This preprint might be of interest and a bit more practical (but I could not see if and where it had been formally published):

Fleischer, Christian Etienne (2022). “A data processing approach with built-in spatial resolution reduction methods to construct energy system models”. Open Research Europe. Version 2. Creative Commons CC‑BY‑4.0 license.

The article contains a good summary of data sources too.

Finally, in passing, I plan to distill those various YouTubes mentioned above and other material down into a narrated 40 minute video to make the material a bit more accessible — but that is a couple of months away. HTH R

ludwig.huelk · 13 September 2023 21:18

I just came across this paper comparing different workflow management tools.
Dagster is not included but it seems like a good choice for collaborative data processing.
Have you considered to include the (brand new) Energy Databus in the process to organize the input sources?

Citation: Aleyna Dilan Kiran, Mehmet Can Ay and Jens Allmer. Criteria for the Evaluation of Workflow
Management Systems for Scientific Data Analysis
Journal of Bioinformatics and Systems Biology. 6 (2023): 121-133

JuhaKiviluoma · 15 September 2023 14:19

This may also be too high level for you, but we’re also working on data management tool that can help to build repeatable data pipelines. We use some of Dagster underneath.

iagw · 18 September 2023 19:18

Hi Florian, that’s an interesting query - thank you for raising it. We normally use python code tooling in the research group - as it makes some sense in terms of scaling workloads onto the University’s High Performance Computing cluster if needed. However, it is something we are looking to automate more of too, particularly for repeatability and API development. So, happy to have a chat it that suits sometime (or anyone else too), best wishes, Grant, i.a.g.wilson@bham.ac.uk

johannes.hampp · 25 September 2023 09:57

We’re building all our pipelines based on Python scripts + snakemake for workflow management.

FloKo · 25 September 2023 10:23

Do you have these pipelines in an open source repository? I also considered snakemake in an earlier stage of development, it looks like a nice tool.

johannes.hampp · 25 September 2023 10:37

Yes, plenty, e.g. PyPSA-EUR is using snakemake:

or this repo which is less extensive:

zaneselvans · 25 September 2023 17:31

Hi @FloKo!

We use Dagster + a bunch of python and pandas to build our PUDL database and Apache Parquet outputs if you’re interested in checking out another example. It’s all US data (from FERC, EIA, EPA, etc.) The main repository is here:

We’ve got about 300 assets in the DAG right now, covering almost 30 years of data for some datasets. Is your system open source? We’d love to collaborate with others working on the same kind of data in other contexts but with similar tooling.

We found that using the various APIs and web-accessible data as the starting point for our ETL was extremely brittle, since many of the US agencies are not great at curating their data, so formats and locations and structures would change without notice, old versions of the data would become unavailable making analyses unreproducible, etc. So instead we take periodic snapshots of the data with this tool and archive it on Zenodo where it gets a DOI and a version and can be accessed programmaticaly. We synchronize the Zenodo archives to a publicly accessible GCS bucket and treat Zenodo as cold-storage.

robbie.morrison · 27 September 2023 07:49

Just adding this publication — which will quite possibly be selected as a lightening talk at the upcoming San Fransisco openmod workshop:

de Chalendar, Jacques A and Sally M Benson (15 December 2021). “A physics-informed data reconciliation framework for real-time electricity and emissions tracking”. Applied Energy. 304: 117761. ISSN 0306-2619. doi:10.1016/j.apenergy.2021.117761. PDF available on arXiv.

The method applies the least adjustment necessary to obtain an internally‑consistent first law conservation-of‑energy compliant set of energy flows.

FloKo · 27 September 2023 07:49

Hi, I’ve seen your repo before and especially liked the idea of publishing data in the end with datasette. We also work on an open source repo here.

As already mentioned, it is focusing on data from germany, often from the GIS area. But some pipelines, eg. from open street maps are also applicable to other regions.

If this is interesting for you, we can have a chat sometimes.

ludwig.huelk · 21 February 2024 12:19

Just came across this project:

ludwig.huelk · 21 February 2024 12:47

I just compiled an overview of the collected projects:
Feel free to improve and enhance:

https://etherpad.wikimedia.org/p/data-pipelines