In order to compare the (not yet implemented) SQL query generated by the LLM with an actual query, another text field was added that parses the query to `pyodbc`, which connects to our database, stores the resulting rows in a `pandas` dataframe and then visualizes it as a table in plotly dash. The SQL functionalities are implemented in the `sql_utils.py` module. Additionally, some minor updates to the overall behavior and layout of the app were implemented.
grid_application
Code for my application to Avacon AG for the role of Data Scientist. This is a web app containing different data science use-cases related to power grids and electricity generation.
Data sources
In this application, the data was randomly generated and has been uploaded into an Azure SQL Database for you already. In order to be transparent about how this was done, the scripts and files are included in this repository.
The scripts for general preprocessing as well as database interaction are both located in the data_preparation directory. The raw data and also the preprocessed data file that has ultimately been uploaded to the database are found in the data directory.
All sources for this data are publically available. Here is a list of the resources used for the different information content:
- German surnames: Most frequent German Surnames from Wiktionary
- German given names: Most frequent male and female given names in Germany from Wiktionary
- Street names: These are street names from the hanseatic city of Rostock, made available as open data here
- Zip codes: from opendatasoft
- Additional information for each zip, such as city name, longitude, latitude etc. using this public API
- Rough bounding box information for Avacon Netz service area: netzgebiete.avacon.de
Data structure
The above data is used to randomly generate a user-specified number of customers. Currently, a number of 1000 customers were generated. Customer information includes:
- Given name and surname
- Street name, house number, zip code and city
- Two meter IDs per customer: one for a natural gas meter, one for an electricity meter
- Each customer has between 1 and 10 (also chosen randomly) meter readings, which include:
- The date at which the reading was obtained
- The value that was read from the meter
- For simplicity, I assumed that both electricity and gas meter readings are always occurring in pairs (i.e. there is no customer that just reads electricity meter values or just natural gas meter values)
The customers, meters and address data are generated and uploaded to the SQL database. The ERD of the database looks like this:
Customers have a first name and a last name and reference other tables only by gas and electricity meter IDs. I preferred this to addresses because there are multiple households (and meters) at one address so meter IDs seemed the more natural choice.
Meters have a signature, which also works like an ID. It is a string in the format W.XXX.YYY.Z Where W, X, Y and Z are digits from 1 to 9. The MeterType has the value GAS for gas meters and ELT for electricity meters. Each Meter is located at a certain address and is therefore linked to the Addresses table by an AddressID.
The Addresses table contains street name, house number, city, zip and geo information.
Finally, the Readings table stores the data of the meter values read by the customers. Each reading is done by a unique customer from a unique meter and contains the date and the value that was read off the meter.