Store-chain-Analysis-and-Forecasting

Description

This is a complete Time_Series store chain analysis the goal was to determine:

products at each store of the chain and their promotion at a given date.
which product family has the most and least sales.
which store has the most and least sales
which (city, state, type) has the most and least sales.
The trend and seasonality of the time series and how the data changes over time(weekly, monthly, daily, etc)
the relationship between sales and (Holidays(Local, Regional), Oil Price) and how big they affect the sales.
the relationship between the time lags
How many shop visitors in each day(transactions) and how it affects the sales

Data Analysis

before beginning with the analysis first, we need to know the skewness of the data:
the sales are highly positively skewed, the given distribution is shifted to the left and with its tail on the right side.
Yearly average sales by: day, week, month
Daily:
Weekly:
Monthly:
Weekday average sales:

there is an increase in sales every year that indicates a trend variable
every day of the year the sales =0 as shown from "Daily Avg, which means that the market is closed.
sales on the weekend the highest of the week and generally low on Thursdays.

Identifies the type of family product sold:
The best family products sold are: ['GROCERY I', 'BEVERAGES', 'PRODUCE', 'CLEANING', 'DAIRY']
The worst family products sold are: ['MAGAZINES', 'HARDWARE', 'HOME APPLIANCES', 'BABY CARE', 'BOOKS']
Identifies how the sales are going with each store:
Best stores sales are : [44, 45, 47, 3, 49]
Worst stores sales are : [35, 30, 32, 22, 52]

Determine the trend

I chose a window of 365 days since this series has daily observations to smooth over any short-term changes within the year so that only the long-term changes remain
The sales have been increasing over the years

Determine seasonality

Using the periodogram to determine the seasonality there are 10 different seasonalities Annual (1) Semiannual (2) Quarterly (4) Bimonthly (6) Monthly (12) Biweekly (26) Weekly (52) Semiweekly(104) Daily(365) Time of day
The periodogram suggests a strong weekly seasonality

A lag plot of a time series shows its values plotted against its lags. Serial dependence in a time series will often become apparent by looking at a lag plot.
Using plot_pacf to see the correlation between 12 lags only:
plot_pacf shows a strong correlation between lags (1 3 5 6 7 8 and 9) so we will be using these lags in training

Holidays

After removing the transferred data so it dont caused misleading the sales showed a strong correlation with sales:
Comparing Avg_sales on holidays vs on workdays
Sales are significantly higher in Holidays

Oil prices

The oil price has a negative correlation with sales the lower the oil price is the more purchasing power for the customers.

Stores

the plots shows:

the best city sales is 'Quito' with over 500 and a percentage of 9.4% of total sales the lowest city sales is 'Puyo' with below 100 sales with only 1.2%
The order of best store type sales is (A, over(700),38% of sales),(B, over(300),19% of sales),(E,(300),17% of sales),(C, over(200),10.7% of sales)
the highest sale state is 'pichincha' with 13.2% of the sales and over 500 while the other states are close in sales with the highest percentage of 8.7 ('tungurahua') and the lowest is 'pastaza' at 1.8%
the highest store cluster is 5 with over 1000 and a percentage of 16%

Modeling

Linear regression excels at extrapolating trends, but can't learn interactions. CATBoost excels at learning interactions, but can't extrapolate trends. In the next codes, I'll create "hybrid" forecasters that combine complementary learning algorithms and let the strengths of one make up for the weaknesses of the other.
It's possible to use one algorithm for some of the components and another algorithm for the rest. This way we can always choose the best algorithm for each component. To do this, we use one algorithm to fit the original series and then the second algorithm to fit the residual series.
In detail, the process is this:
1-Train and predict with the first model model_1.fit(X_train_1, y_train)
y_pred_1 = model_1.predict(X_train)
2-Train and predict with the second model on residuals
model_2.fit(X_train_2, y_train - y_pred_1)
y_pred_2 = model_2.predict(X_train_2)
And then Add to get overall predictions
y_pred = y_pred_1 + y_pred_2

Forecasting the next 16 days in sales:

Data Link:
https://www.kaggle.com/competitions/store-sales-time-series-forecasting/overview

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
README.md		README.md
Store_Sales_Analysis_&_Forecasting.ipynb		Store_Sales_Analysis_&_Forecasting.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Store-chain-Analysis-and-Forecasting

Description

Data Analysis

Determine the trend

Determine seasonality

Holidays

Oil prices

Stores

Modeling

Forecasting the next 16 days in sales:

About

Releases

Packages

Languages

Abdelrhman-Sadek/Store-chain-Analysis-and-Forecasting

Folders and files

Latest commit

History

Repository files navigation

Store-chain-Analysis-and-Forecasting

Description

Data Analysis

Determine the trend

Determine seasonality

Holidays

Oil prices

Stores

Modeling

Forecasting the next 16 days in sales:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages