Data Analytics Sales Prediction Model
Avinash Dangwani
PhD Scholar
Department of Engineering (Computers)
Pacific University
Airport Road, Debari, Udaipur (Rajhasthan)
Dr. Chandansingh Rawat
Associate Professor
Dept of Electronics & Telecommunication Engineering
VESIT HAMC Collectors Colony,
Chembur Mumbai
Abstract
In every New financial year Company propose Advertisement Budget to improve their sales. Estimation of Advertisement Budget is not easy task as it involves financial parameters. Managers are always interested to know prediction model for sales which is function of Advertisement expenses.
This paper will develop Sales prediction model using simple linear regression. The model will be built using the training dataset to estimate the regression parameters. The method of Ordinary Least Squares (OLS) will be used to estimate the regression parameters using python. Regression model will be validated to ensure goodness of fit before it can be used for practical application. The single variable regression is the limitation of this model. In future multiple variables can be calculated using multiple linear regression model using python.
Keywords:
Simple linear regression, Ordinary Least Square (OLS), Training & validation data, Sum square regression (SST), Total sum of squares (SSR).
Introduction
This paper will develop sales prediction model using Simple Linear regression. sales prediction has two main methods(1) Qualitative method, (2) Quantitative method [3].Some of the Qualitative methods are Expert’s Opinion Method, Sales Force Composite Method, Survey of Buyer’s Expectations, Historical Analogy Method, Jury of Executive Opinions & Leading Indicators Method.
Some of the Quantitative methods areTest Marketing, Time Series Analysis, Moving Average Method, Exponential Smoothing Method, Regression Analysis&Econometric Models.
This paper will explore sales prediction using regression analysis due to its lower time complexity as compare to some of the other algorithm, Furthermore, these models can be trained easily and efficiently even on systems with relatively low computational power when compared to other complex algorithms. Building a regression model is an iterative process and several iterations may be required before finalizing the appropriate model [2]. Regression model is Organized in following sections.
Simple Linear Regression
Simple linear regression (SLR) is a statistical technique which uses the existence of an association relationship between a dependent variable (outcome variable) and an independent variable(predictor/feature variable).
The functional form of SLR is as follows
Yi = β0 + β1 Xi + εi (1)
Where
Yi =Value of the ith observation of the dependent variable
Xi = Independent variable of ith observation
εi = random error (residuals) in predicting thevalue of Yi
β0 & β1 = regression parameters
Ordinary least square (OLS) Method
Equation (1) can be re written as
εi = Yi - β0 - β1 Xi (2)
The regression parameters β0& β1 are estimated by minimizing the sum of squared errors(SSE)
SSE = (3)
The estimated values of regression parameters are given by taking partial derivative of SSE with respect to β0 & β1 and solving resulting equation for the regression parameters. The estimated parameters are given by
(5)
Where & are estimated values of the regression parameters β1 & β0 and , are mean values of X & Y.
Sample Data is taken from Advertising Ratios & Budgets provided in annual report by Schonfeld & Associates, Inc [6]. which covers over 2,400 companies and 320 industries with information on fiscal 2018 and 2019 advertising& revenue spending.
For OLS Analysis total 52 sample companies data is taken from 12 different industries.
Table - 1 shows the sample percentage revenue growth & percentage advertisement growth for Electromedical & Electrotheraputic Appartus.
Table – 1:Data Source[6]: June 2020 Sample data of Advertising Ratios & Budgets from Schonfeld-Associates-Inc-v417 of Market Research.com
We will develop an simple regression model to understand and predictpercentage sales revenue growth on the percentage advertisement growth.
The OLS model takes two parameters Y and X.In our example percentage advertisement growth will be X and percentage sales revenue growth will be Y.We will split data set into two sets, training & validation set. Trainng set will be used to train algorithm to predict output. Validation set will be used to test accuracy & efficiency.
Python language is used as tool for building regression model for sales prediction. The statsmodel library is used in Python for building statistical models. OLS(Oridnary least square) API available in statsmodel.api is used for estimation of parameters for simple linear rgression model.
The data is divided into two subsets training data set and validation data set. The proportion of training dataset is usually between 70% and 80% of the data and the remaining data is used for validation data. We have taken train_size = 0.8which implies that 80% of the data will be used for training the model and remaining 20% will be used for validating the model. The records that are selected for training and test set are randomly sampled using python functions which returns four variables as shown below.
Regression parameters & are estimated from equations (4) & (5) using Python functions as tool.
Linear regression calculates an equation that minimizes the distance between the fitted line and all of the observed data points. Technically, ordinary least squares (OLS) regression minimizes the sum of the squared residuals.In general, a model fits the data well if the differences between the observed values and the model's predicted values are small and unbiased.
R-squared is a statistical measure of how close the data are to the fitted regression line. It is defined as
= (6)
(7)
SSR = The sum squared regression (SSR) is the sum of the square residuals . Residual is the difference between observed value & estimated value as shown below in Fig - 1.
Fig – 1: Residuals as function of Actual value & estimated value
SSR = = e12 + e22 + e32 + e42 (8)
= square sum of variations w.r.t to estimated value
SST = total sum of squares is the sum of thedistance the data is away from the mean (central tendency) all squared as shown below in Fig - 2.
Fig – 2: Residuals as function of Actual value & Mean value
SST = = ( y1 – )2 + ( y2 - )2+ ( y3 - 2 + ( y4 - )2 (9)
=
(10)
The above equation indicates that R2 is directly proportional to difference between the square sum of variations in y w.r.t mean and square sum of variations in y w.r.t estimated value.
Not good fit:
Smaller R2 value indicates that SSR value is large and close to SST which indicates that variation in y w.r.t estimated value is large & close to variation in y w.r.t mean, which is not good fit.
Good fit:
Large R2 value indicates that SSR value is very small (actual values of y are close to estimated values of y) and not close to SST, which indicates that variation in y w.r.t estimated value is not close to variation in y w.r.t mean, which is a good fit.
Results & Model Diagnostics
Using python as tool parameters of regression model are calculated as shown below.
Using 80% training data set
The estimated model can be written as
Yi = β0 + β1 Xi(11)
Rev Grw(%) = 6.101 + 0.160 * ( Ad Grw(%) )
Model estimates that 1% Ad Growth will increase Revenue by 0.160 %. For example, if the sales revenue was 2 Million in year 2018 then according to our model sales revenue in year 2019 will increase by 0.0032 million i.e. estimated sales revenue can be 2.0032 millions that is rise of 3200/- in revenue.
Before using regression model in practical applications, it should be validated & tested for goodness of fit. We will be using Co-efficient of determination (R-squared) method to determine goodness of fit. Using python as a tool following value of R2 is calculated
R2 = 0.208
According to Cohen – 1992 [9] r-square value 0.12 (12%) or below indicate low, between 0.13 (13%) to 0.25 (25%) values indicate medium& 0.26 (26%) or above values indicate high. Our model explains 20.8% of the variance in the validation set, so it is reasonably good fit.
Conclusion
The simple linear regression model using ordinary least square (OLS) method shows functional relationship between the outcome variable (Sales revenue growth in %) and the feature (advertisement growth in %). The model validation is investigated using R2 technique to ensure goodness of fit.while an R-square as low as 10% is generally accepted for studies in the field of arts, humanities and social sciences because human behavior cannot be accurately predicted, therefore, a low R-square is often not a problem in studies in the arts, humanities and social science field. There are various other control parameters which affects the value of R-square. Therefore, in order to extend scope of this research various social science characteristics like age, gender, motivation towards product and festive season should be included as control variables in analysis.
References:
[1] Core Python Programming by Dr.R.Nageswara Rao second edition dreamtechPress.
[2] Machine Learning Using Python by Manaranjan Pradhan & U Dinesh Kumar first reprint
edition Wiley publications
[3] Sales prediction types available online at URL:
https://www.economicsdiscussion.net/sales/sales-forecasting-methods/32270
[4] Advantages of Linear regression model available online at URL:
https://iq.opengenus.org/advantages-and-disadvantages-of-linear-regression/
[5] Will Koehrsen Article howtosetupyour-machinelearningproblem can be found at
following URL:https://towardsdatascience.com/prediction-engineering-how-to-set-up-your-
machine-learning-problem-b3b8f622683b
[6] Data source of Ratios & Budgets can be found at following URL:
https://www.marketresearch.com/Schonfeld-Associates-Inc-v417/Advertising-Ratios-
Budgets-13373044/
[7] https://www.keboola.com/blog/linear-regression-machine-learning.
[8] https://internal.ncl.ac.uk/ask/numeracy-maths-statistics/statistics/regression-and-correlation/coefficient-of-determination-r-squared.html#:~:text=R2%3D1%E2%88%92sum%20squared,from%20the%20mean%20all%20squared.
[9] Cohen’s Conventions for Small, Medium,and Large R2 values can be found on
following URL http://core.ecu.edu/psyc/wuenschk/docs30/EffectSizeConventions.pdf
[10] Small is beautiful. The use and interpretation of R2 in social Research by Ferenc Moksony Pages 6 & 7