Hey all, I’m working on energy data elt processes (extract, load and transfer). This means I work on data pipelines that automatically download and clean datasets. Right now I’m using the tools dagster and dbt to build pipelines for data which is relevant in the energy sector (mainly german data). Is anyone of you also working at this topic? If so I would be interested to get to know you, have an exchange about the topic and maybe find ways to cooperate and work together?
Hi @FloKo, particularly from an architectural perspective, work undertaken by the Open Energy Platform might be relevant. Some of the underlying projects were discussed at the last openmod workshop in Vienna as video 00:02:50 and video 01:52:50 (but mostly just the first half for that second video). That work might be a bit high‑level for your needs, but always useful to understand the wider game plan and align where sensible.
This preprint might be of interest and a bit more practical (but I could not see if and where it had been formally published):
The article contains a good summary of data sources too.
Finally, in passing, I plan to distill those various YouTubes mentioned above and other material down into a narrated 40 minute video to make the material a bit more accessible — but that is a couple of months away. HTH R
I just came across this paper comparing different workflow management tools. Dagster is not included but it seems like a good choice for collaborative data processing.
Have you considered to include the (brand new) Energy Databus in the process to organize the input sources?
Citation: Aleyna Dilan Kiran, Mehmet Can Ay and Jens Allmer. Criteria for the Evaluation of Workflow
Management Systems for Scientific Data Analysis
Journal of Bioinformatics and Systems Biology. 6 (2023): 121-133
Hi Florian, that’s an interesting query - thank you for raising it. We normally use python code tooling in the research group - as it makes some sense in terms of scaling workloads onto the University’s High Performance Computing cluster if needed. However, it is something we are looking to automate more of too, particularly for repeatability and API development. So, happy to have a chat it that suits sometime (or anyone else too), best wishes, Grant, email@example.com
We use Dagster + a bunch of python and pandas to build our PUDL database and Apache Parquet outputs if you’re interested in checking out another example. It’s all US data (from FERC, EIA, EPA, etc.) The main repository is here:
We’ve got about 300 assets in the DAG right now, covering almost 30 years of data for some datasets. Is your system open source? We’d love to collaborate with others working on the same kind of data in other contexts but with similar tooling.
We found that using the various APIs and web-accessible data as the starting point for our ETL was extremely brittle, since many of the US agencies are not great at curating their data, so formats and locations and structures would change without notice, old versions of the data would become unavailable making analyses unreproducible, etc. So instead we take periodic snapshots of the data with this tool and archive it on Zenodo where it gets a DOI and a version and can be accessed programmaticaly. We synchronize the Zenodo archives to a publicly accessible GCS bucket and treat Zenodo as cold-storage.