Forecasting Amazon Holiday Sales: Gathering, Organizing & Exploration

This post was originally published at ODSC. Check out their curated blog posts!

 

When I worked at Amazon more than 3 years ago our holiday season was always an adrenaline rush. My role was in customer service operations. We had record breaking volume each holiday season which means customer service was the front line for customer issues, technical support and even catching fraud. Amazon’s retail business, like all retail businesses in the US, is extremely cyclical. Working in operations and workforce planning means a lot of diligence is put into forecasting the Q4 spike.
To get an historical sense of the customer service spike (correlated with revenue) you can track the velocity on amazon’s public customer service forums which has a 10x spike in Q4 each year. This makes working in an amazon operational role one of the most lovingly frustrating and rewarding professional challenges I have ever been a part of. You have to hire many thousands of seasonal customer service agents worldwide, train them, plan for a holiday staffing season to minimize mandatory overtime (affecting morale and cost), likely create new processes for some new device or service and then wind it all down in an orderly fashion in Q1. When I was there, each holiday season amazon launched a new kindle, android app store or video on demand service which means agents need to be cross trained accordingly. The agents skills need to shift from answering “where’s my stuff?” before the holidays to “how does this kindle fire I got as a gift as a gift work?” The intense period starting with Black Friday and Cyber Monday, is one part chaos, beautiful orchestra music and 2 parts hard work so as not to fail our customers.
Since customer service is really a supply and demand problem a lot of deliberation was put into Q4 planning. Given some unknown launch of “x”, and organic growth the organization could expect a forecasted amount of emails, phone calls, and chats each half an hour among 25 queues globally. These contacts represent the demand curve on an ecosystem. It was the operational capacity planning analysis that forecasts the needed supply of agents to fill each of these contact channels worldwide.
In this post I show two methods for understanding a time series. Since we can’t truly know the amazon contact volume outside of the organization I will use time series decomposition and Holt Winters forecasting with amazon’s quarterly revenue. Our goal will be to understand the revenue cycles and forecast this holiday’s revenue which starts on Black Friday!

Gathering the Data

Surprisingly, finding amazon’s quarterly revenue in a free online service was painful. In the end I scraped it from www.wikinvest.com using rvest.
In the code below I load the packages to extract and organize the quarterly revenue. Rvest is used to harvest webpages. Pbapply is a simple progress bar version of the base apply functions. Next qdap has one of my favorite function msgub for multiple global substitutions. The last two ggthemes and ggplot2 are used for creating nice visualizations. I also add a global option to R so character strings are not categorical factors.
library(rvest)
library(pbapply)
library(qdap)
library(ggthemes)
library(ggplot2)

options(stringsAsFactors = F)

I found the information I was looking for on the page shown below. There is a table of annual quarterly revenue in the screenshot below for 2015 that I want to extract and organize into our time series.

image01

Amazon’s annual wikinvest page can be found here.
Reviewing wikinvest’s URL I noticed how the final parameter represents the year. Thus, I need to create a URL for amazon from 1996 to 2015 to extract each pages’ information. Using paste0 and seq the object all.years is a text vector of the urls I intend to web scrape. Paste0 concatenates strings without separation. Seq creates an integer sequence between two numbers. Here numbers 1996, then 1997 and so on until 2015. Since R recycles when needed, the base URL starting with “http…” repeats for each of the sequence numbers.
all.years<-paste0('http://www.wikinvest.com/stock/Amazon.com_(AMZN)/Data/Revenue/',seq(1996,2015))

After some trial and error I found the exact table containing the revenue numbers is called “nv_related_nums.” Further, cells 2 through 5 of that table have the revenue numbers themselves. So I wrote a custom function called rev.get that reads a web page. Once read the node containing the exact table is selected and more specifically exact cells within the table are extracted. Within the node>table> cell in question, just the html_text is extracted. The four different revenue numbers from cells 2, 3, 4, and 5 are then organized into data frame called revenues.

Follow this pseudo-code to understand the xpath within the function:



Pseudo code explanations for the rev.get xpath.

rev.get<-function(x){
x<-read_html(x)
revenues<-data.frame(
q1=html_text(html_node(x,xpath='//*[@id="nv_related_nums"]/td[2]')),
q2=html_text(html_node(x,xpath='//*[@id="nv_related_nums"]/td[3]')),q3=html_text(html_node(x,xpath='//*[@id="nv_related_nums"]/td[4]')),
q4=html_text(html_node(x,xpath='//*[@id="nv_related_nums"]/td[5]'))
)
return(revenues)
}

Armed with this function I can now run my pre-constructed URLS through it. The pblapply function accepts the all.years URL and applies rev.get to each one. I simply unlist the resulting list to get a simple text vector containing all the information from the table cells for amazon’s quarterly revenue.
all.rev<-pblapply(all.years,rev.get)
all.rev<-unlist(all.rev)

To examine what we’ve captured just use the head function on all.rev. The resulting information is shown in the following table.
head(all.rev)



The first six web scraped amazon revenue values.
Notice how each value is actually a text string and the web page has NA listed for some initial quarters? We will need to clean that up but first let’s organize the scraped data into a data frame. The code below creates a single column from 1 to the total length of the scraped data (all.rev) and another column containing all the scraped values.
all.df<-data.frame(ID=seq(1:length(all.rev)),string_revenue=all.rev)

Organizing the Data

When scraping data from the web a fair amount of data manipulation is often needed. Here we have text when we want numbers, there are NA values and since Amazon’s revenue started at 0 and has grown into the billions values are listed as “million” and “billion.” The next step is to clean it all up for analysis.
First I create an object called patterns. This is a small text vector for “billion,” “million,” and “$.” Next another object called replacements is created. Replacements is an empty character vector which means it actually doesn’t contain anything, not even a space in between quotes. BTW I like to use single quotes when I code but you can use double quotes “” for an empty character too.
patterns<-c(' billion',' million','$')
replacements<-c('')

I then pass patternsreplacements and the column of scraped revenue data, all.df$string_revenue , into mgsub. The function will search for any of the string patterns and then make multiple substitutions to the empty character. The last function parameter is the vector where the substitutions will occur. The code below adds another vector to the data frame called revenue instead of string_revenue.
all.df$revenue<-mgsub(patterns,replacements,all.df$string_revenue)

At this point the all.df$revenue column needs to be changed to a numeric value instead of a character. This is easy enough with as.numeric shown below. This will change the string “N.A” to logical NA by default so you will get a warning message but it is expected behavior.
all.df$revenue<-as.numeric(all.df$revenue)

Recall that the values have both millions and billions. Thus, the order of magnitude between the values needs to be changed to reflect reality. For example, the scraped data had “$27.9 million” and later “$25.36 billion.” Obviously, there is a big difference between these values so a change has to occur.
To account for this change I used an ifelse statement. An ifelse statement follows this logic:

If this logical test is TRUE then do this first operation otherwise perform the second operation.

The logical test in the code below uses grepl. I am searching for any string that contains “billion” in the column of raw scraped data. Grepl will return a TRUE if the pattern is found at least once and a FALSE if the pattern is not found. As a result, the test identifies any rows in the data set where the revenue is in billions. Any FALSE rows occur when revenue was only in the millions.
The next operation is what to do when a “billion” is found in the row. Here the column addressed is all.df$revenue. I multiply the revenue number by 1,000,000,000.
If the billion was not found in the row, a FALSE from grepl, then I multiply the number by 1,000,000.
In the end I can reword the ifelse statement to now read like this:

If a row of the scraped data has “billion” in it then multiply the revenue value by 1,000,000,000 otherwise multiply the revenue row by 1,000,000.
all.df$revenue<-ifelse(grepl('billion',
all.df$string_revenue)==1,
all.df$revenue*1000000000,
all.df$revenue*1000000)

Lastly to keep track of the periodicity in the data frame I added another column called period. Since this is quarterly data I paste together “Q” and a sequence from 1 to 4. This is repeated for all years from 1996 to 2015. The tricky part is that you have to create four of 1996, four 1997 and so on. So within the rep function the last parameter each=4 ensures the years are accurately pasted to the Q1 to Q4 sequence.
all.df$period<-paste("Q",seq(1:4),rep(seq(1996,2015), each=4), sep="_")
Finally we’re all set to start examining and forecasting Amazon’s holiday revenue!

Exploring the Data

Using the base summary function I like to examine the fundamental stats of the data. Not surprisingly the minimum value is a 0 where there was an NA that was changed and the max is the most recent fourth quarter.
summary(all.df$revenue)

To set up a base ggplot pass in the data frame, all.df, along with an X axis which is the ID column and a Y axis representing the cleaned revenue values. Once done in amzn.rev then add another layer using geom_line. Within geom_line I like a color that stands out so I usually use a “red” or “darkred”. The next layer applies a pre-constructed aesthetic, theme_gdocs to mimic google docs. The theme_gdocs layer comes from ggthemes and is useful for saving time. The last layer adds a title “Amzn Quarterly Rev” using ggtitle.
amzn.rev amzn.rev + geom_line(color='darkred')+ theme_gdocs()+ggtitle('Amzn Quarterly Rev ')

 

image03

Amazon’s quarterly revenue from 1996 to 2015.
Wow it looks like each fourth quarter has the peak I experienced when I worked at amazon! Visually examining this data there are some interesting patterns. First, there is a clear and steep upward trend. Second, there is strong periodicity. That means there is a repeating pattern of 4 that occurs. If we could obtain the data there is likely monthly, weekly and possibly daily period patterns that a forecaster could examine to obtain a good future forecast.
To look at an individual quarter I set up a small custom function called quarter.explore. The function accepts a data frame and a string called qtr. The default value of qtr is “Q_1.” To examine another quarter change this parameter to “Q_2” and so on. The function will produce a basic revenue plot only for the quarter in question and will print a summary for each of the quarters.

 

Now that you have the data set, check out the rest of this post at ODSC!

This entry was posted in blog posts, ODSC and tagged , . Bookmark the permalink.