# Avacon Data Science project: Chat with your data application Code for a self-implemented web app for my application at Avacon Netz for the role of Data Scientist. This is a "chat with your data app" using generated data of customers and their gas and electricity meter readings. ## Einleitung Ich hoffe, dass ich mit diesem kleinen Projekt meiner Begeisterung und Neugier für die Stelle noch einmal Nachdruck verleihen kann, aber auch einen Einblick in meinen "Coding Style" und meine technischen Fähigkeiten geben kann. Ich habe das Thema "Chat with your data" ausgewählt, weil ich es passend für den Themenbereich Daten- und Plattformmanagement fand. Eine einfache Schnittstelle in natürlicher Sprache, mit der auch Kolleginnen und Kollegen ohne SQL-Programmierkenntnisse oder komplexe Benutzeroberflächen relevante Daten abrufen könen, ist sicherlich sehr hilfreich und kann die Effizienz und Nutzung von Daten-Tools im Unternehmen verbessern. Ich habe diese App in Gänze selbst implementiert, von der generation des Datensatzes aus öffentlich verfügbaren Quellen bis hin zum finalen Deployment auf der Microsoft Azure Cloud. Alle Details sind in diesem Repository enthalten. Die Applikation kann unter folgendem Link abgerufen werden: [Link](https://avc-app-ahbhc8hagheua3bx.germanywestcentral-01.azurewebsites.net/) Benutzername und Passwort lasse ich über die Bewerbungsunterlagen zukommen. Gleiches gilt für die Umgebungs-Variablen, die benötigt werden, um den Code auch lokal laufen zu lassen. ## Running the Application The app is deployed on an azure App Service instance (see Link above), but can also be run locally. To do this, several environment variables need to be set in order for APIs and authentication to work properly. These will be sent via my application documents. To run this app, install poetry (see the [official documentation](https://python-poetry.org/docs/) for details). Then, simply run the following commands in a shell of your choice: ``` poetry install poetry shell ``` Now, you should be in a shell with all the required packages installed. This code connects to Azure SQL using `pyodbc` and therefore, the Microsoft ODBC driver (version 18) must be installed. To do this, follow the [official documentation](https://learn.microsoft.com/en-us/sql/connect/odbc/download-odbc-driver-for-sql-server?view=sql-server-ver16) again. Once this is done, make sure the mentioned environment variables have been exported and then run: ``` cd app gunicorn app:server -b 0.0.0.0:8000 ``` The server should then start and can be accessed via browser at `0.0.0.0:8000`. ## General Structure of the app: The main structure of the `Dash` app starts with a text input field, where the question prompt can be inserted. Once the submit button is klicked, the user message, together with a long and somewhat optimized system prompt, is sent via the OpenAI API to a generic GPT-4o model. The model is prompted to give back its answer as a `JSON`-encoded string. It includes a summary in natural language and a SQL query. The query is run on the Azure SQL Database using `pyodbc`. The summary as well as the query itself are shown in an output text field to the user. The query result is read into a `pandas` dataframe, which is then displayed as an interactive table. Below this main section, there is a "control field", which can be used to manually input SQL queries for comparison. It is also possible to copy/paste the SQL output of the model into this field to check its result. The questions that can be asked of course depend on the data, which is described in detail in the following sections. Additionally, some example prompts are provided in the web application directly. ## Data sources In this application, the data was randomly generated and has been uploaded into an Azure SQL Database for you already. In order to be transparent about how this was done, the scripts and files are included in this repository. The scripts for general preprocessing as well as database interaction are both located in the `data_preparation` directory. The raw data and also the preprocessed data file that has ultimately been uploaded to the database are found in the `data` directory. All sources for this data are publically available. Here is a list of the resources used for the different information content: - German surnames: Most frequent German Surnames from [Wiktionary](https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Namen/die_h%C3%A4ufigsten_Nachnamen_Deutschlands) - German given names: Most frequent [male](https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Namen/die_h%C3%A4ufigsten_m%C3%A4nnlichen_Vornamen_Deutschlands) and [female](https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Namen/die_h%C3%A4ufigsten_weiblichen_Vornamen_Deutschlands) given names in Germany from Wiktionary - Street names: These are street names from the hanseatic city of Rostock, made available as open data [here](https://geo.sv.rostock.de/download/opendata/adressenliste/adressenliste.json) - Zip codes: from [opendatasoft](https://public.opendatasoft.com/explore/dataset/georef-germany-postleitzahl/table/?dataChart=eyJxdWVyaWVzIjpbeyJjb25maWciOnsiZGF0YXNldCI6Imdlb3JlZi1nZXJtYW55LXBvc3RsZWl0emFobCIsIm9wdGlvbnMiOnt9fSwiY2hhcnRzIjpbeyJhbGlnbk1vbnRoIjp0cnVlLCJ0eXBlIjoiY29sdW1uIiwiZnVuYyI6IkNPVU5UIiwic2NpZW50aWZpY0Rpc3BsYXkiOnRydWUsImNvbG9yIjoiI0ZGNTE1QSJ9XSwieEF4aXMiOiJwbHpfbmFtZSIsIm1heHBvaW50cyI6NTAsInNvcnQiOiIifV0sInRpbWVzY2FsZSI6IiIsImRpc3BsYXlMZWdlbmQiOnRydWUsImFsaWduTW9udGgiOnRydWV9&location=6,51.3294,10.45412&basemap=jawg.light) - Additional information for each zip, such as city name, longitude, latitude etc. using this public [API](https://github.com/digitalfabrik/gemeindeverzeichnis-django) - Rough bounding box information for Avacon Netz service area: [netzgebiete.avacon.de](https://netzgebiete.avacon.de/rcmap/Content/Map/Detail.aspx?keep=dzjernxQf/whawjGMPFQgA==) ## Data structure The above data is used to randomly generate a user-specified number of customers. Currently, a number of 1000 customers were generated. Customer information includes: - Given name and surname - Street name, house number, zip code and city - Two meter IDs per customer: one for a natural gas meter, one for an electricity meter - Each customer has between 1 and 10 (also chosen randomly) meter readings, which include: - The date at which the reading was obtained - The value that was read from the meter - For simplicity, I assumed that both electricity and gas meter readings are always occurring in pairs (i.e. there is no customer that *just* reads electricity meter values or *just* natural gas meter values) The customers, meters and address data are generated and uploaded to the SQL database. The ERD of the database looks like this: