Following the Part 1, we will guide you now through a real world use case which the energy consumption in France using our methodology to deal with data science problems. Please note that for any question you may come up with in the course of this article, we will be happy to reply through our [email protected].
For this purpose we collected data from RTE, the French’s electricity transmission operator. They launched a challenge on www.datascience.net in order to forecast the daily power consumption in France. In the following paragraphs we will present the different steps which are required in order to implement a well suited model. Let us note that we will also take advantage of this opportunity to highlight the many challenges we may face face throughout time series analysis.
1. Description of the data
The data is composed of a set of columns including informations from the date and the time when the data was collected to the individual repartition of the consumption across different energy types. Additionally we can note that:
We downloaded the historical data from 2012 to 2016
Time slot between valid observations is 30 minutes
The national and regional French’s power consumption data is available from the RTE eco2mix portal http://www.rte-france.com/fr/eco2mix/eco2mix-telechargement .
1.1. Performance score
For each forecasted day, we will use the Mean Absolute Percentage Error (MAPE) to determine the exactitude of our predictions. It is defined by:
\(MAPE=\frac{100\%}{n}\times\sum_{i=1}^n \frac{|C_i-C_i^*|}{Ci}\)
Being \(C_i\) the actual consumption, \(C_i^*\) the forecasted value, and \(n\) the number of forecasted points.
1.2. Goal definition
As we intend to give a detailed approach and understanding of the problem, we fixed ourselves the following objective: forecast one day of power consumption, namely 48 predictions ahead of current data.
1.3. Technical information
1.3.1. Development platform
Axionable’s core data science team always works on a Jupyter Notebooks with the version 3 of Python for developing and test purposes. Afterwards a production code can be made according to the client’s technology or architecture.
In the following image, we are presenting our working directory for this project.
1.3.2. Hardware specifications
We initially started developing this part of the project according to our methodology by performing the data exploration and data cleaning in a personal computer. However when we started modeling and optimizing parameters, we were quickly limited by the important delays of computing due to the high granularity of the data (more than 87000 valid observations). In fact as we wanted to use the whole dataset to train our model, there was an enormous quantity of computations that had to be done on memory which tends to slow down all processes.
Later after getting more accustomed to the context, we arrived at a critical point where we started questioning the necessity to use the whole dataset: Is it really relevant using this huge amount of information in order to forecast only a day of power consumption ? That question will be answered later.
In this context, let us list the to hardware specifications that we used:
- Personal computer
- Model: Macbook Pro
- RAM: 8GB
- Processing: 2 core at 2,6 GHz
- Cloud Computing:
- VM: AWS m4.4large
- RAM: 64GB
- Processing: 16 cores at 2.4 GHz
2. Data importing and data cleaning
The data source is stored in the “./data” folder and it’s composed of many Excel files, each file containing the information of a whole year. Here is a glimpse of the operation that we are going to perform:
- Load data into a Pandas DataFrame
- Format the names of the DataFrame’s columns
- Filter null data
- Create a datetime index
2.1. Load data into dataframe
As we aim to have our code to be production-ready, we implemented a loop in order to read automatically all the excel files contained in the “./data” folder. In that way we are not worried about reading individual files nor appending them in a single DataFrame.
directory = "data" df_data = pandas.DataFrame() for filename in os.listdir(directory): if filename.endswith(".xlsx"): excel_file = "{}/{}".format(directory, filename) print("Appending file '{}'".format(excel_file)) df_data = df_data.append(pandas.read_excel(io=excel_file)) df_data.head() 2.2. Format the names of the DataFrame's columns
It is recommended to format the column names of Excel tables to remove spaces and special characters as it is the case for most french documents. Moreover, having no space nor special characters allows us to utilize the python autocompletion feature.
from unidecode import unidecode # Format column names def change_column_name(name): name = re.sub(r"(\.|°)", "", name) # Changer les spaces par "" name = unidecode(name) # Supprimer les caractères spéciaux name = re.sub(r"\s+", " ", name) name = name.lower() # Passer les noms en minuscule name = re.sub(r"(\s|-)", "_", name) # Changer les espaces tirets par "" return name df_data.columns = [change_column_name(_) for _ in df_data.columns] df_data.head()
For the cleaning step, we would rather use the regex library instead of the standard Python string functions. As a matter of fact it enables us to create very powerful regular expressions in order to format strings, even if it is sometimes slower than the standard functions.
2.3. Filter null data
As explained in the introduction, we will only focus on univariate time-series forecasting. Consequently we are not considering the other variables such as fioul, charbon, gaz, etc. Besides, we notice that a 15 minutes time slot frame is observed in the data frame, having “NaN” values every 30 minutes (or every two rows/observations).
In this context, we will remove the null informations and take only the France perimeter:
target_columns = ["perimeter", "nature", "date", "hour", "consumption"] df_data = df_data[target_columns] # Take only France perimeter idx_france = df_data.perimetre == "France" df_data = df_data[idx_france] # Filter the NaN values idx_nans = df_data.consumption != df_data.consumption df_data[idx_nans]
A nice trick to check whether there are the NaN values in a DataFrame is the use of the equality operator. Indeed in a pandas’ Series auto comparison, we can obtain False only when we are comparing null values.
2.4. Create a datetime index
To have a valid datetime at a minute granularity, we need to combine the “date” and “heures” columns. An index sort is also applied in order to avoid plotting problems.
def create_index(row): new_date = datetime.combine(row.date, row.heures) return new_date df_data["temps"] = df_data.apply(create_index, axis=1) df_data.set_index("temps", inplace=True) df_data.sort_index(inplace=True) 3. Data exploration
Once we consider our data as cleaned, we should proceed to the data exploration phase which aims at answering at least the following questions:
- Is there any trend and/or seasonality on the TS?
- Do we need to do further data cleaning?
- Sometimes, plotting the data helps to detect other problems in data like outliers, human errors in the data input, etc.
- Do we need to do data aggregation?
- Is the data coherent to reality?
- This is a very crucial step in a data science problem. Our analysis and results must conform with the reality.
As you can imagine, we will plot the data at different time granularities in order to detect some behaviors like trend or seasonality. At Axionable, we prefer using the plotly library to make our plots interactive. In addition let us mention that we are accustomed to create functions for every automatisable action to improve the code readability. Moreover those functions may be used in another context without the need of redeveloping them from scratch.
Down here is the first plotting function to draw a time series given the data points and some basic properties.
def plot_time_serie(s, title=None, width=950, height=400): """Plot a time-series The input data must be an pandas.Series """ datadis = list() datadis.append( go.Scatter( x=s.index, y=s.values ) ) layout = go.Layout( title=title, showlegend=False, width=width, height=height ) fig = go.Figure(data=datadis, layout=layout) iplot(fig) return None
3.1. Plot interactive time series using Plot-ly
Here, we will plot the whole historical data and hopefully determine at first glance key indicator for the next modeling phase.
# Plot TS at 30' granularity plot_time_serie(df_data.consumption, title="Power consumption from 2012 to 2016 @30' granularity")
Considering this plot, we can infer that:
There is a clear yearly seasonality summed with a trend in the data. To add up if we zoom in, we may also see that this tendency is coupled with a weekly and daily seasonality. Those remarks make up the main problem that we faced: the SARIMA model that we will use for modeling takes into account only one parameter of seasonality which calls out a major question: Which seasonality is the most suited for our objective ?
It would be interesting to see if there are some commun consumption behaviours at a week/year/day level and to corroborate it to the reality. For instance, we should find less energy consumption in the holidays season.
3.2. Plot seasonality and trend profiles at different granularities
How can we determine the evolution profile of a selected season in a time series ?
The Box-Plots are really useful in giving a fast insight of data trend at different levels of granularity.
We are defining below a function that directly plots the box-plots from a DataFrame by indicating the target and labels of a set of box-plots.
def plot_box_plot(df, target, granularity_level, title=None): list_labels = df[granularity_level].unique() datadis = list() for label_name in list_labels: idx_label = df[granularity_level] == label_name datadis.append( go.Box( y=df[target][idx_label], boxmean=True, boxpoints='all', name=label_name ) ) layout = go.Layout( title = title ) fig = go.Figure(data=datadis, layout=layout) iplot(fig) return None
First of all, we will plot weekly consumption and analyse it. For this purpose data aggregation at day granularity is naturally required which is the reason why we are grouping the data using the pandas’ groupby function.
# Create dataframe at day granularity df_day = pandas.DataFrame(df_jour) df_day = df_day[df_day.index > "2012-01-01"] df_day["week"] = df_day.index.map(load_week_python) df_day["consumption_week"] = (df_day .groupby("week")["consumption"] .transform(sum)) df_day["week_weight"] = df_day.consumption / df_day.consumption_week df_day["week_weight"] *= 100 # Add the day name df_day["day_name"] = df_day.index.map(lambda x: x.strftime("%A")) plot_box_plot(df_day, "week_weight", "day_name", title="Weekly power consumption profile")
This plot can be read from 2 different perspectives. As a matter of fact if we consider the whole figure, we can assume that the the overall energy consumption is greater on working days in comparison to the weekends. On the other hand the boxplots taken independently show that daily consumptions don’t vary a lot from a week to another (less than 1% difference between the first and the third third quartiles).
In the same way, we proceed to plot the yearly and daily profile…
The plots above play a significant role in our understanding of the problem and helped us jump into the following conclusions:
The power consumption lowens during summer holidays, and that tendency is confirmed over the 5 years of historical data.
The peaks in energy consumption of the daily profile is highly correlated with the lunch and dinner French times.
There seems to be weekly and daily profiles which could have been very useful if we had missing values in our dataset.
Furthermore had there been missing values, it would have been a well suited method imputing them according to the profiles seen above.
From now, this dataset being inline with the reality, we may assert that no more data cleaning is needed. Also considering the objective we define in combination with the insights that we got from the exploration part, we may assert that a daily seasonality is more appropriate (explicitly every 48 observations).
4. Modeling – Choosing our prediction model
Reassured by all the work done in the previous stage, particularly the box-plots which highlighted the seasonal effects as well as a clear trend, we decided to opt for a SARIMA model. However, we can only apply this model to a time series (TS) which is stationary.
The next paragraphs will discuss about how with deal with non stationarity in practice as well as the core of the modeling part which the parameters’ optimization.
To fix up things, let’s keep in mind that the SARIMA model has 6 optimization parameters (p, d, q), (P, D, Q, S). We refer you back to our previous article to see the basic explanation of each of them and get a fresh recall of how to choose the optimal ones in some cases.
4.1. Pre selection of the SARIMA parameters
In line with our explanations, we will first infer the differentiation parameters d and D before starting the modelisation.
In that regard, we can implement a function that, for given d and D values, will plot the new TS and its corresponding ACF and PACF plots.
In practice, we look for the following patterns:
- The ACF and the PACF plot should both be decreasing rapidly towards 0 (exponentially or in a sinusoidal manner)
- A significant spike at a certain lag of the ACF without any other repetition after that phenomenon (it gives us a hint over what could be the value of the parameter q)
- A significant spike at a certain lag of the PACF without any other repetition after that phenomenon (it indicates an ideal value of the parameter p).
Followingly, we will choose an order of the values which make the new TS stationary.
Above is the code chunk which we developed to automate the plot of the TS at different lags. We derived from them that the best parameters were 1 and 2 for respectively d and D.
def plot_acf_pacf(s, nb_lags, d=0, D=0, seasonality=0): """ Plot the ACF and PACF functions, after differentiating steps """ # Changing the default color palette cl.scales["3"].keys() col = cl.scales["3"]["qual"]["Paired"] # Apply the differencing steps, if necessary if D > 0: s_diff = s for i in range(1, D+1): s_diff = s_diff + pow(-1, i) * binom(D, i) * s.shift(i * seasonality) s = s_diff[(D * seasonality):] if d > 0: s_diff = s for i in range(1, d+1): s_diff = s_diff + pow(-1, i) * binom(d, i) * s.shift(i) s = s_diff[d:] plot_ts_serie(s) lag_acf = acf(s, nlags=nb_lags) lag_pacf = pacf(s, nlags=nb_lags) x_axis = [*range(nb_lags+2)] plot_lag_curves(x=x_axis, y=lag_acf, title="Autocorrelation Function", T=len(s)) plot_lag_curves(x=x_axis, y=lag_pacf, title="Partial Autocorrelation Function", T=len(s)) return None
4.2. Hyper parameters
Now we can focus on the tuning of the other parameters (p, q, P, Q). Naturally it implies creating grids of possible values, implementing the corresponding models and evaluating them in order to choose the best one according to our metric (MAPE).
We are doing this by using GridSearchCV from the Scikit Learn module. It presents the advantage of allowing us to perform operations parallelly as well as validate the results through cross validation techniques.
Important note:
SARIMA not being part of the scikit learn module, we had to create a custom class calling our model in order for it to fit the standards of the GridSearchCV. This problematic appears also in the automation part of projects where Scikit Learn pipelines are often used for this regard.
We implemented the MAPE error function and added this function to our class to evaluate the performances.
The cross validation technique for time series is very different to the one which is done for most data science problems because of the sequential character of the time which has to be kept.
This marks the end of our modeling part, we now can turn to the evaluation of our model on real data. Inevitably it involves its comparison to values observed on a yearly basis. Having trained our model on the date range going from 01/01/2015 to 01/30/2015, we will now perform a forecast for the first day of february.
Below are the results observed for the forecast of the first day.
5. Conclusions
5.1. Data Science on a production environment
One of the key suggestions we can make is to always adapt solutions to the production environment. In many cases, the best performing model cannot be implemented because of processing/memory limitations or the delay it would take to fit it.
In our experience at Axionable, in order to choose the most appropriate model one should inevitably answer these questions:
- Scheduling
- At which frequency should we run our codes ?
- Production environment
- Do we need to do parallel processing ?
- What are the memory and computing limitations ?
- Scalability
- Will the input data increase over time ?
- Can we extend to a parallel processing our code, if any ?
- Do we need our code implementation to be implemented on a distributed way ?
We then need to do a tradeoff between these variables and choose the best suitable model for the target project.
5.2. Next steps
Part of modeling is the research of alternatives to solve a particular problem. In that regard we are listing above some steps that we could have done, but for simplicity purposes we chose not to tackle them for now.
Compare SARIMA with other forecasting methods, in the univariate domain:
- Exponential Smoothing
- Recurrent Neural Networks (LSTM)
- Extend SARIMA forecast by adding exogenous variables. That is to represent it like a Generalised Linear Model.
- As a long-term goal, make an article on the multivariate time series predictions.