In this Part 1 , we will introduce the building blocks as well as Axionable’s methodology for data science projects. In the following article, we will then go through a real world use case of Time Series prediction which is the prediction of energy consumption in France.

Requiring the implication of different fields, data science projects may be on the decline in many parts. In that regard we have presenting below our end-to-end methodology towards the resolution of such problems.

Typically our methodology aims at dealing with all challenges and changes that may occur in the lifecycle of a project. Thereby in most cases 7 steps are involved:

**Goals’ definition:**In close collaboration with the clients we define clearly the goals and the scope of the project in order to iterate over it**Data Acquisition and Cleaning:**In this step, given the objective defined previously, we identify every possible data source that may be relevant and design extraction pipelines in order to retrieve the data needed. Then, due to the anomalies that are inevitably present in dataset, we perform a cleaning for our data to fit analysis that will be done later**Data Exploration:**The objective of this stage is to have more advanced insights of the dataset and an idea of operations that we may perform to improve the relevance**Modeling and Evaluation:**This step may be considered as the core of data science, i.e. it consists of the application of diverse machine learning techniques and the evaluation of all the contending models.**Documentation:**Part of our job is to make our work the most understandable possible. In that regard we provide a detailed notebook with comments of the major steps that we followed**Automation:**Machine learning models may need to be recoded in order to make them more efficient using distributed framework such as Spark for example**Operations and optimization:**This final step involves monitoring the performances through well chosen KPIs and retraining the models whenever changes occur

Let us note that for these series we will follow those steps until the Modelling and Evaluation. If you need more details about any other part, you may contact us at datascience@axionable.com.

Time series (TS) can be seen as sequences of data points measured over successive intervals of time. Their main specificities in opposition to most common fields of Machine Learning are the dependence in time and the seasonal behaviors that may appear in their evolution. Indeed the observations are no more independent and some sets of variations, along with an increasing or decreasing trend, occur according to a particular timeframe.

In practice we distinguish 2 kinds of TS: we talk about univariate TS when only occurrences of a single variable are observed. When more than one variable is considered, they are qualified as multivariate. Please note also that we will focus on univariate time series for simplicity. Most of the approaches presented may then be applied with some adjustments to the process of multivariate TS.

Over the last few years, the field has benefitted a lot of attention thanks to the multiplicity of its applications. In fact in a context where business decisions are more and more guided by insights we get from data analysis, TS enable companies to project themselves and perform 3 main tasks:

**Forecast:**They allow the prediction of future based on past events**Prevention:**They permit the control of the processes producing the series**Description:**They help understand the inherent structure and the mechanism generating the series (overall trend, cyclic patterns, etc).

In accordance with its main applications, TS analysis intervenes in different sectors, giving it several use cases. Among those we may cite:

**Meteorology:**Prediction of weather variables such as temperature, precipitation, wind, etc.**Economy & Finance:**Explanation and prediction of the economic factors, financial indexes, exchange rates, etc.**Marketing:**Keeping track the key performance indicators of businesses such as sales, incomes/expenses, etc.**Telecommunications:****Industry:**Control of energetic variables, efficient logs, sentiment and behavior analysis etc.**Web:**Web traffic sources, clicks and logs, etc.

In practice, although most of the data is collected continuously we usually work with discrete TS where consecutive observations are equally spaced in a time interval. The process is done by either keeping values measured according to a given time frame or merging continuous variables together over a specified period.

In order to make the definitions as representative as possible, we will present present the 3 main characteristics of TS using a dataset that is freely available on internet: the trend, the seasonality and cyclical patterns.

The data in question is downloaded from RTE, the french system operator, and consists of records of daily power consumption in France from 2012 to 2016. The time interval between to measures is 30 minutes which gives us a very consequent dataset with more than 150.000 records.

Below is an interactive plot to let you visualize the data. Note that we generally use a python package named *Plotly* to generate such visualisations. The main advantage lies in the fact that we can zoom in and zoom out to have either a granular view or a synthetical one.

**Trend & seasonality:**

If we consider the yearly observations, we may see a typical tendency repeating itself. It consists of a decreasing pattern in the first half followed by an increasing one in the second half of the year. That is what we call a trend which can be informally defined as the long term increases or decreases present in a dataset. In addition, we remarked that the trend seems to be dictated by each half the year. That kind of fixed influence corresponding to a defined timeline is what is called seasonality.

**Cyclical patterns:**

Moreover cyclical patterns like seasonality is a set of variations like trends with the only difference that patterns are not of a fixed length.

Before deep diving in the analysis, it is important to fix up the core notions in order to have an understanding of the underlying mechanisms of time series. The reason is that modeling requires first gaining as much insights as possible on the situation and typically, from one data science problem to another, models need to be optimized according to a certain set of values called *hyperparameters*. In fact *hyperparameter* tuning is among key differentiators between good models and state of the art models.

Thus we will define stationarity, differentiation and SARIMA to give you a package of tools to build a well-suited model. Let us note that we do not aim at providing a theoretical course, so we focus on providing a practical understanding in order to be able to deal with those parameters while modeling.

**Stationarity & differentiation:**

Basically stationarity is the fact that the dependence between values is not through the time but rather on the rule of their realizations. That means that the immediate correlation between 2 variables doesn’t depend on their values but rather on the lag between them. Its importance lies in the fact that the parameters of stationary models are stable in time. Naturally that assumption implies that the mean and the variance should be constant regardless to the chosen period of time. Consequently TS with trend or seasonality are not stationary. As a matter of fact those factors immediately affect the overall mean and variance of the series.

As a result, some transformations are to be done in order for it to fit theoretical requirements. We will do so by applying a common technique named differencing.

It is nothing more than the computation of the differences between consecutive observations (or equally separated by a seasonal factor in the seasonal case). This helps in most cases to eliminate the trend and seasonality of a time series.

With due consideration, we need to define as a first step the type of notation to represent differencing. In the table below, two notations are shown being the backshift and the linear notation. For the backshift notation we define B as Byt=yt-1.

For its compactness and to improve readability we recommend using the Backshift notation. To illustrate it, below is the representation of a second order differencing combined with a first order seasonal differencing of m steps:

\(y’_t=(1-B)^2(1-B^m)y_t\)

\(y’_t=(1-2B+B^2-B^m+2B^{m+1}-B^{m+2})y_t\)

\(y’_t=y_t – 2y_{t-1} + y_{t-2} – y_{t-m} +2y_{t-m-1} – y_{t-m-2}\)

**SARIMA Models:**

Time series analysis can be seen as the search of the closest characterization of an observed set of values. To perform such task we rely on various mathematical models among which we have SARIMA models which stands for Seasonal AutoRegressive Integrated Moving Average. In this section also, we limit ourselves to a basic description of the model parameters.

More precisely, here is a description of the main components of the SARIMA:

- The seasonal part takes into consideration the parameters to deal with seasonal behaviors
- The autoregressive part will consider past occurrences of the time series of the model, p being the AR order of steps back in the past.
- The integrated part is the degree needed to make a time series stationary, d being its order of differencing.
- The moving average part will take into consideration the past errors instead of its values, q being its MA order.

Throughout this article we introduced you to the basic concepts which are essential to time series modelling. In the next part, we will tackle the practical aspect of TS analysis while giving you a full access to our code in order for you to have a support to start further analysis.

- “Forecasting: principles and practice | OTexts.” https://www.otexts.org/fpp
- Date de consultation : 3 janv. 2018

Concocté avec amour par :

Concocté avec amour par :