This post was originally published at ODSC. Check out their curated blog posts!
The holiday shopping season is in full swing! The economy is relatively strong compared to a few years back and so retail sales are probably going to be strong especially for amazon. Other retailers like Target and Wal-Mart are also running amazing black Friday and holiday sales to attract customers. However, amazon has consistently shown it can outwit these retail giants with a greater selection, customer service, and sophisticated pricing. In the last post I showed you how impressive their growth since 1996 has been. In fact, amazon’s 2015 second quarter revenue is more than 800 times the 1997 Q2 revenue!
In this post I will show you how to take the previous web scraped data and create a time series. In case you missed it check out the last post to get the data. I will show you how to decompose amazon’s quarterly revenue and then make a simple forecast for the Q4 holiday sales season.
Recall the all.df
data frame was organized with 4 columns. It contains 80 rows representing quarterly revenue starting in 1996. As a refresher, the table below can be called using the head
function to show a portion of the data.
head(all.df)
ID | string_revenue | revenue | period |
---|---|---|---|
1 | N.A. | NA | Q_1_1996 |
2 | $0 | 0 | Q_2_1996 |
3 | N.A. | NA | Q_3_1996 |
4 | $8.4 million | 8400000 | Q_4_1996 |
5 | $16 million | 16000000 | Q_1_1997 |
6 | $27.9 million | 27900000 | Q_2_1997 |
The first 6 rows of all.df shows raw and cleaned amazon quarterly revenue.
When you are starting out with forecasting I suggest the forecast
package. It contains many standard yet accurate forecasting methods. I will show you how to use two methods to understand a time series. After loading the package change the initial NA values to 0 using the is.na
function in the second code line. Of course you could handle NAs differently but these occur early in the time series so I just switched them to zero.
library(forecast)
all.df[is.na(all.df)] <-0
After changing the NA values to 0 you can change the entire data frame to a time series object. The time series object not only captures the revenue value but also the meta-information associated with the values. In all.df
the meta-information is the periodicity. The repeating pattern of amazon’s revenue needs to be captured as a time series so the forecast
package can work its magic.
Using the ts
function pass in the numeric revenue vector called revenue
. Within the function specify the frequency. Since our data is quarterly frequency=4
. If your data is daily change this parameter to 365, and use 52 for weekly. Make sure this input matches the inherent periodicity of your data! The last parameter start=1996
simply tells the ts
function where the series begins.
data.ts<-ts(all.df$revenue, frequency=4, start=1996)
I always examine the time series object in the console to make sure it is organized the way I expected. I have been known to make mistakes with frequency
and my start
inputs! Call the data.ts
object in the console. The screenshot below shows amazon’s quarterly revenue is now organized from a linear vector into rows representing years and quarters as columns.
data.ts
Amazon’s quarterly revenue represented as a time series object with annual rows and quarterly columns.
Time Series Decomposition
Within the stats package there is a function called decompose
. This function will deconstruct a time series object into three parts. The time series decomposition creates trend, seasonality and random subsets of the original time series.
components<-decompose(data.ts)
Calculating Trend
First, using a moving average the function calculates the trend of the time series. This is the overall upward, downward or stationary relationship the revenue has to time. Given amazon’s growth and success, I expect the trend to be moving upward in an exponential fashion. Plus in the previous post I observed exponential growth when reviewing only Q2 revenues.
Calculating Seasonality
Next decompose
will use averages to understand the seasonality. In this case, for all Q1 values (minus the trend) the average value is calculated. This process is repeated for Q2 and so on. Once there are four quarterly averages the values are centered. The seasonality represents the repeating pattern within the time series. For instance, the seasonality values may catch the fact that every Q4 amazon sales jump by $1B+ compared to Q3. In the last post visually examining the line chart there was a repeating peak. So I would expect the seasonality in this decomposition to be strong and look like a saw took. Every year we should expect a Q4 peak, with a Q1 reduction comparatively. Keep in mind the periodicity may impact the seasonality, so be sure to understand if your data is in weeks or months not just quarters.
Accounting for “random”
The left over values not accounted for in either the trend or seasonality are the error terms. The error terms are called random in this method. However, the values may not be true random noise. A forecaster could further model the time series to account for events like significant competitor sales or snow storms forcing more shoppers to be online versus at brick and mortar stores.
Putting it all together
In this basic example I am using additive modeling. An additive model assumes the differences between each period is the same once trend is accounted for. So the difference between Q1 and Q2 is roughly the same each year. The starting points for Q1 and Q2 in subsequent years change because of trend but the impact of the holiday shopping season is the same each Q4.
An decomposed additive model uses the simple equation below. A quarterly revenue at time period “t” is made from adding the trend at time t, seasonality of time t and error of time t.
Y[t] = T[t] + S[t] + e[t]
To make the equation real, at time period t10, which is Q2 1998, the value is made of the trend moving average $129,140,000 plus the seasonal Q2 impact which is a negative -$83,6072,105, plus the left over value $822,912,105. Adding it all up Q2 1998 is $115,980,000.
Once the previous code decomposes the time series you can reference each individual decomposition section with the $.
components$seasonal
components$trend
components$random
I like to visually examine data as much as possible. For me it is the best way to draw insights and make conclusions. Luckily it is easy to plot the components by calling either autoplot
or plot
on the components
object.
plot(components)
Amazon’s quarterly revenue shows strong upward trend & seasonality from the holiday shopping season.
As expected there is a clear upward trend in the data. Additionally I expected to see strong seasonality represented in a repeating pattern. This is characterized in the “saw tooth” of the seasonal section in the above plot. Interestingly there is still some repeating pattern in the random section. I suspect the decomposition struggled with even larger Q4 spikes starting around 2010. Before 2010 the “random” values work to diminish the Q4 peak. As I examine the plot above I come away thinking that Amazon is growing exponentially and that the Q4 peaks are becoming more pronounced.
With time series decomposition it is easy to remove the seasonality effects in the data. To remove the effect of seasonality from amazon’s quarterly revenue simply subtract it from the original time series object. In this case subtract components$seasonal
from data.ts
. The resulting plot leaves behind the trend with the random unexplained variance.
Now that you have the data set, check out the rest of this post at ODSC!