This is a complete Time_Series store chain analysis the goal was to determine:
- products at each store of the chain and their promotion at a given date.
- which product family has the most and least sales.
- which store has the most and least sales
- which (city, state, type) has the most and least sales.
- The trend and seasonality of the time series and how the data changes over time(weekly, monthly, daily, etc)
- the relationship between sales and (Holidays(Local, Regional), Oil Price) and how big they affect the sales.
- the relationship between the time lags
- How many shop visitors in each day(transactions) and how it affects the sales
before beginning with the analysis first, we need to know the skewness of the data:
the sales are highly positively skewed, the given distribution is shifted to the left and with its tail on the right side.
Yearly average sales by: day, week, month
Daily:
Weekly:
Monthly:
Weekday average sales:
there is an increase in sales every year that indicates a trend variable
every day of the year the sales =0 as shown from "Daily Avg, which means that the market is closed.
sales on the weekend the highest of the week and generally low on Thursdays.
Identifies the type of family product sold:
The best family products sold are: ['GROCERY I', 'BEVERAGES', 'PRODUCE', 'CLEANING', 'DAIRY']
The worst family products sold are: ['MAGAZINES', 'HARDWARE', 'HOME APPLIANCES', 'BABY CARE', 'BOOKS']
Identifies how the sales are going with each store:
Best stores sales are : [44, 45, 47, 3, 49]
Worst stores sales are : [35, 30, 32, 22, 52]
I chose a window of 365 days since this series has daily observations to smooth over any short-term changes within the year so that only the long-term changes remain
The sales have been increasing over the years
Using the periodogram to determine the seasonality there are 10 different seasonalities Annual (1) Semiannual (2) Quarterly (4) Bimonthly (6) Monthly (12) Biweekly (26) Weekly (52) Semiweekly(104) Daily(365) Time of day
The periodogram suggests a strong weekly seasonality
A lag plot of a time series shows its values plotted against its lags. Serial dependence in a time series will often become apparent by looking at a lag plot.
Using plot_pacf to see the correlation between 12 lags only:
plot_pacf shows a strong correlation between lags (1 3 5 6 7 8 and 9) so we will be using these lags in training
After removing the transferred data so it dont caused misleading the sales showed a strong correlation with sales:
Comparing Avg_sales on holidays vs on workdays
Sales are significantly higher in Holidays
The oil price has a negative correlation with sales the lower the oil price is the more purchasing power for the customers.
-
the best city sales is 'Quito' with over 500 and a percentage of 9.4% of total sales the lowest city sales is 'Puyo' with below 100 sales with only 1.2%
-
The order of best store type sales is (A, over(700),38% of sales),(B, over(300),19% of sales),(E,(300),17% of sales),(C, over(200),10.7% of sales)
-
the highest sale state is 'pichincha' with 13.2% of the sales and over 500 while the other states are close in sales with the highest percentage of 8.7 ('tungurahua') and the lowest is 'pastaza' at 1.8%
-
the highest store cluster is 5 with over 1000 and a percentage of 16%
Linear regression excels at extrapolating trends, but can't learn interactions. CATBoost excels at learning interactions, but can't extrapolate trends. In the next codes, I'll create "hybrid" forecasters that combine complementary learning algorithms and let the strengths of one make up for the weaknesses of the other.
It's possible to use one algorithm for some of the components and another algorithm for the rest. This way we can always choose the best algorithm for each component. To do this, we use one algorithm to fit the original series and then the second algorithm to fit the residual series.
In detail, the process is this:
1-Train and predict with the first model
model_1.fit(X_train_1, y_train)
y_pred_1 = model_1.predict(X_train)
2-Train and predict with the second model on residuals
model_2.fit(X_train_2, y_train - y_pred_1)
y_pred_2 = model_2.predict(X_train_2)
And then Add to get overall predictions
y_pred = y_pred_1 + y_pred_2
Data Link:
https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview