Merge pull request 'docs(dataprep): Add documentation about data sources and structure' (#4) from docs/dataprep into main

Reviewed-on: #4
This commit was merged in pull request #4.
This commit is contained in:
2024-08-31 13:23:47 +00:00
2 changed files with 54 additions and 1 deletions

View File

@@ -1,2 +1,43 @@
# grid_application
Code for my application to Avacon AG for the role of Data Scientist. This is a web app containing different data science use-cases related to power grids and electricity generation.
Code for my application to Avacon AG for the role of Data Scientist. This is a web app containing different data science use-cases related to power grids and electricity generation.
## Data sources
In this application, the data was randomly generated and has been uploaded into an Azure SQL Database for you already. In order to be transparent about how this was done, the scripts and files are included in this repository.
The scripts for general preprocessing as well as database interaction are both located in the `data_preparation` directory. The raw data and also the preprocessed data file that has ultimately been uploaded to the database are found in the `data` directory.
All sources for this data are publically available. Here is a list of the resources used for the different information content:
- German surnames: Most frequent German Surnames from [Wiktionary](https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Namen/die_h%C3%A4ufigsten_Nachnamen_Deutschlands)
- German given names: Most frequent [male](https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Namen/die_h%C3%A4ufigsten_m%C3%A4nnlichen_Vornamen_Deutschlands) and [female](https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Namen/die_h%C3%A4ufigsten_weiblichen_Vornamen_Deutschlands) given names in Germany from Wiktionary
- Street names: These are street names from the hanseatic city of Rostock, made available as open data [here](https://geo.sv.rostock.de/download/opendata/adressenliste/adressenliste.json)
- Zip codes: from [opendatasoft](https://public.opendatasoft.com/explore/dataset/georef-germany-postleitzahl/table/?dataChart=eyJxdWVyaWVzIjpbeyJjb25maWciOnsiZGF0YXNldCI6Imdlb3JlZi1nZXJtYW55LXBvc3RsZWl0emFobCIsIm9wdGlvbnMiOnt9fSwiY2hhcnRzIjpbeyJhbGlnbk1vbnRoIjp0cnVlLCJ0eXBlIjoiY29sdW1uIiwiZnVuYyI6IkNPVU5UIiwic2NpZW50aWZpY0Rpc3BsYXkiOnRydWUsImNvbG9yIjoiI0ZGNTE1QSJ9XSwieEF4aXMiOiJwbHpfbmFtZSIsIm1heHBvaW50cyI6NTAsInNvcnQiOiIifV0sInRpbWVzY2FsZSI6IiIsImRpc3BsYXlMZWdlbmQiOnRydWUsImFsaWduTW9udGgiOnRydWV9&location=6,51.3294,10.45412&basemap=jawg.light)
- Additional information for each zip, such as city name, longitude, latitude etc. using this public [API](https://github.com/digitalfabrik/gemeindeverzeichnis-django)
- Rough bounding box information for Avacon Netz service area: [netzgebiete.avacon.de](https://netzgebiete.avacon.de/rcmap/Content/Map/Detail.aspx?keep=dzjernxQf/whawjGMPFQgA==)
## Data structure
The above data is used to randomly generate a user-specified number of customers. Currently, a number of 1000 customers were generated. Customer information includes:
- Given name and surname
- Street name, house number, zip code and city
- Two meter IDs per customer: one for a natural gas meter, one for an electricity meter
- Each customer has between 1 and 10 (also chosen randomly) meter readings, which include:
- The date at which the reading was obtained
- The value that was read from the meter
- For simplicity, I assumed that both electricity and gas meter readings are always occurring in pairs (i.e. there is no customer that *just* reads electricity meter values or *just* natural gas meter values)
The customers, meters and address data are generated and uploaded to the SQL database. The ERD of the database looks like this:
<p align="center">
<img src="./assets/db_diag.svg" alt="An ERD of the avacon customer database" width="50%">
</p>
Customers have a first name and a last name and reference other tables only by gas and electricity meter IDs. I preferred this to addresses because there are multiple households (and meters) at one address so meter IDs seemed the more natural choice.
Meters have a signature, which also works like an ID. It is a string in the format `W.XXX.YYY.Z` Where `W`, `X`, `Y` and `Z` are digits from 1 to 9. The `MeterType` has the value `GAS` for gas meters and `ELT` for electricity meters. Each Meter is located at a certain address and is therefore linked to the `Addresses` table by an `AddressID`.
The `Addresses` table contains street name, house number, city, zip and geo information.
Finally, the `Readings` table stores the data of the meter values read by the customers. Each reading is done by a unique customer from a unique meter and contains the date and the value that was read off the meter.

12
assets/db_diag.svg Normal file

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 877 KiB