This post was originally published at ODSC. Check out their curated blog posts!
Let’s face it, if you’re American and you celebrate Christmas chances are you have visited Target, Wal-Mart or shopped at amazon.com this holiday season. Even if you haven’t shopped there you probably are getting something from one of these retailers…let’s hope it’s a good old fashioned data science book. The only other gift worth having is a ticket to the next ODSC!
Combined these retailers made $187B in Q4 last year. To support that much product moving around the country there has to be an extensive logistics network. I wanted to explore this unseen backbone of US consumption so I painstakingly gathered warehouse information from various sites. Not surprisingly complete, and unified data was hard to come by. I suspect each retailer wants to keep their distribution philosophy as secret as possible. That would avoid competing for the same labor pool in an area, defining the cost of real estate and distance to population centers. Still with some late nights I persevered and unified almost complete data, largely manually collecting and trying to cross validate in multiple sources.
In this post we will pretend Santa has to visit each of the warehouses so the Elves’ toys can be disseminated to all the children and good data scientists in the US. Along the way you will learn some basic mapping and routing techniques you can use the next time you have geospatial data.
Set Up
This work is an example of “GIS” which stands for geospatial information systems. Admittedly, I am no expert in this field, but find it fascinating to visualize geospatial data. I like to use a basic package maps
during EDA for geospatial data. It has quick and easy, although somewhat unappealing map information for plotting. Then I load both ggplot2
and ggthemes
which are among my favorite R visualization packages. Next, the TSP
package is a simple package for “traveling salesperson problems.” This type of problem is one where you are trying to minimize the distance traveled from among multiple points in space. As you can imagine this is complex and has real world implications. Delivery companies, or in our case, Santa, do not want to waste effort by picking a suboptimal route. In this post, we suppose Santa is like a traveling salesman that has to efficiently travel between Amazon, Target and Wal-Mart warehouses because he only has a day to do it.
Lastly, I change the number of decimals that R will print to be 15 using options
. When working with specific latitude and longitude values it is helpful during EDA. I also use set.seed
so your code will create the same visualizations as I do for the TSP map.
library(maps)
library(ggplot2)
library(ggthemes)
library(TSP)
options(digits=15)
set.seed(2016)
Next, go get the data here and load it using read.csv
. When parsing the data frame the vector classes are incorrect so I use as.numeric
to change the $sq_ft
. You may want to change the others as you explore the data more.
warehouses<-read.csv('warehouse_stats_w_latlon.csv')
warehouses$sq_ft<-as.numeric(as.character(warehouses$sq_ft))
Exploring the Warehouse Locations
It’s always a good idea to perform basic EDA on a new data set. I usually start with head
to examine a few rows. This code will print the top 10 rows to your console but the default is 6. You will notice the column names and also some NA values. This data was challenging to obtain so in some cases I used NA or it was returned when I geocoded the location for lat/lon.
head(warehouses,10)
Since I collected multiple retailers you may be interested in the distribution among the companies. Using table
will tally the retailers. I also use barplot
in the second line to visualize the frequency of warehouses. Surprisingly Amazon has more warehouses than Wal-Mart. This despite the fact that Wal-Mart does about 4x the annual revenue. Clearly the organizations think about distribution efficiency differently.
table(warehouses$retailer)
barplot(table(warehouses$retailer))
Amazon has more warehouses even when adding Sam’s Club to the Wal-Mart.
A popular base function is summary
which can be applied to an entire data set. However, a more nuanced way to explore data is to summarize the features by another feature. Summary
can be applied in this manner using tapply
along with the vectors $sq_ft
and $retailer
. The tapply
function applies summary to each cell of the intersections of the vectors. The result is a list summarizing the square footage of warehouses for each retailer.
tapply(warehouses$sq_ft, warehouses$retailer, summary)
Comparing the median values from the code above shows that Wal-Mart is committed to larger warehouses resulting less of a need compared to Amazon. Specifically, amazon’s median warehouse size is 449,000 compared to Wal-Mart’s 880,000.
To get more feature interactions the aggregate function can be helpful. This code will provide the mean average square foot by retailer and by warehouse type. In the second parameter pass in a list containing the features for subsetting.
aggregate(warehouses$sq_ft, by=list(warehouses$retailer,warehouses$type), FUN=mean,na.rm = T)
This code lets you compare the size of each retailer’s warehouse according to its function. For example the table is a portion of the aggregate result comparing average “food” distribution center size for the three retailers.
Retailer | Type | Sq_Ft |
Amazon | Food | 317,589 |
Target | Food | 431,600 |
Wal-Mart | Food | 802,195 |
Again, you can see that Wal-Mart is committed to large warehouse centers by type.
During your EDA you should have observed multiple NA’s. Often geospatial functions fail with NA lat/lon values. A quick way to exclude them is with subset. Since a latitude that is NA will also have a longitude NA this single line of code will remove any rows that meet the logical declaration for either lat or lon. The complete.cases
function is my declaration and returns a TRUE if the row contains no NA values.
complete.latlon<-subset(warehouses,
complete.cases(warehouses$lat)==T)
Quick Map
To get a quick EDA map visual I like to use the map
function. First plot the line map and specifying “state” to get the lower 48. Next add points to represent each fulfillment center. The X values are longitude and the Y are latitude. Lastly, I tell R to plot 4 colors and in 4 different shapes. This is because there are 4 different retailers, Wal-Mart, Sam’s Club, Target and Amazon.
map('state')
points(complete.latlon$lon,complete.latlon$lat, col =c('orange','blue','red','yellow'),pch =c(16,17,18,19))
The result isn’t pretty but during EDA I am less concerned with aesthetics. The point is that there are dense East coast and West coast clusters no matter the retailer. This type of quick map is used to find some insights but isn’t really presentation worthy.
A quick map showing a lot of Eastern warehouses and none in Idaho, Montana, Wyoming, and the Dakotas..
The Traveling Santa Problem
Santa will be busy and not have much time to get all these gifts to the warehouses around the US. To help him Mrs. Clause, a Kaggle Grand Master, knows that data science can help. To get started, the ETSP
function accepts and an X and Y vector within a matrix. This problem is a Euclidean Traveling Salesperson Problem (ETSP) since Santa can fly straight measured as Euclidean distance between points. In “reality” Earth is a sphere so a great circles approach is really what airlines would use. Remember these points represent each warehouse location and the goal is to to minimize the distance he has to travel so. The solve_TSP
function applies heuristics solutions to the problem of minimizing the distance traveled. Here I selected two_opt which essentially looks to minimize how often Santa would backtrack over his route. A good article on the approach is here. Another approach in solve_TSP
is “nn” which stands for nearest neighbor. In this case the function selects the nearest warehouse in turn but allows for Santa crossing over his path again.
latlon.m<-ETSP(cbind(complete.latlon$lat, complete.latlon$lon))
santa.tour<-solve_TSP(latlon.m, method = 'two_opt')
To examine the result, print santa.tour
in your console. This will print the class and length of Santa’s trip but this is in Euclidean distances of lat/lon so it’s not that helpful. Instead use plot to see the entire route firsthand. In this example I did not specify a start= value. Technically Santa would be coming from the North Pole which is 90,0 but I didn’t bother adding it ☺
santa.tour
plot(latlon.m, santa.tour)
The basic Santa Tour plot isn’t very informative!
To get the order of warehouses that Santa will visit call as.interger
on the tour object. This extracts the index number from santa.tour
. In this example, Santa should start at row 81, 80, 82 and so on. You can see the order by calling route.order
in your console.
route.order<-as.integer(santa.tour)
route.order
Now that you have the data set, check out the rest of this post at ODSC!