This post was originally published at DataCamp. Check out their other curated blog posts.
We’re back with our third post investigating H-1B data. If you want to learn how to programmatically collect and perform exploratory data analysis (EDA), check out the first post here.
In this final installment, you will learn how to geocode locations for latitude and longitude coordinates from an API. Then you create a map of the data. Next, you learn the top H-1B software developer employers and compare their salaries by year.
I hope these views help you see where you should apply, how much to negotiate for and regions of the country with high volume and good salaries for software developers.
I chose software developers because these are among the most widespread job titles in our data set. Plus, I have a few friends in need of sponsorship that are technically minded. Hopefully this code and the visuals can help them hone their employment efforts!
Get the Data
You could review the original H-1B posts here to web scrape the data like I did. If you do not have the patience, I suggest heading to my github repo to get the CSV file.
The first thing to do is set up your R session with the correct packages. The bit64
packages lets data.table
work efficiently on vectors of 64bit integers.
The data.table
package is once again used since it is so efficient. You next use lubridate
in case you want to perform data related functions. After that, pbapply
gives your R session progress bars when using apply functions. The jsonlite
package is used to parse the geocoded API response from Bing Maps. Lastly, plyr
is a great package for manipulating data.
The two lines set global options to first get rid of scientific notation and also to increase the decimal places printed to 15
. Often you need many decimal points when working with latitude and longitude. Using set.seed
ensures your visuals will look similar to this post by selecting a specific seed instead of random
.
library(bit64)
library(data.table)
library(lubridate)
library(pbapply)
library(jsonlite)
library(plyr)
options(scipen=999, digits=15)
set.seed(1234)
The next step is to read in the CSV data from the repository. The fread()
function is “fast and friendly file finagler” that quickly reads and provides a progress update. In my experience, it is the best way to import large data frames.
The next lines of code apply ymd
to the data related vectors.
If you followed along from the previous posts, you wouldn’t need to do this but using fread()
means you will need to correct the vector types.
h1b.data<-fread('h1b_data_clean.csv')
h1b.data$start_date<-ymd(h1b.data$start_date)
h1b.data$submit_date<-ymd(h1b.data$submit_date)
Now that your data is in, let’s identify the most frequent job titles. To create h1b.jobs
, you should change the $job_title
vector to categorical. Next, use table()
to tally the unique job titles and nest the result in as.data.frame()
so it can be easily manipulated. The third line orders the new data frame by the new $Freq
vector. Since the default behavior of order is to arrange values ascending, you can review the 6 most frequent job titles with tail()
.
h1b.jobs<-as.factor(h1b.data$job_title)
h1b.jobs<-as.data.frame(table(h1b.jobs))
h1b.jobs<-h1b.jobs[order(h1b.jobs$Freq),]
tail(h1b.jobs)
Reviewing h1b.jobs
, you can pick out the jobs you are interested in.
Here, you can create jobs by concatenating “SOFTWARE DEVELOPER,” “COMPUTER PROGRAMMER” and “SOFTWARE ENGINEER.”
Do you notice how each is separated without a space by the pipe operator? The “|” is above your enter key and represents “OR”. So you are going to subset your h1b.data
by “A OR B OR C” e.g. “A|B|C”.
jobs<-c('SOFTWARE DEVELOPER|COMPUTER PROGRAMMER|SOFTWARE ENGINEER')
The subset()
function will keep a row of data whenever a logical statement is TRUE
. Using grepl()
to get a logical outcome pass in jobs and search within $job_title
. Now h1b.devs
has ~262K rows instead of h1b.data
’s ~1.8M.
h1b.devs<-subset(h1b.data,grepl(jobs,h1b.data$job_title)==T)
Geocoding with Bing Maps
To create a map, you will need lat
/lon
coordinates. There are multiple services to look up locations like google Maps.
However, I like to use Bing Maps since the limit is 25,000 lookups per day. To use this code sign up here for a Bing Maps API Key. You could look up all locations row by row. However, you would reach your Bing limit quickly and have to wait to finish.
Instead, I suggest looking up each unique location and then joining the information later to save time.
To find unique values in a character vector change the $location
with as.factor()
and then use levels to extract unique levels.
In this data set, the Software Developer H-1B Applications with 262,659 rows come from only 1,763 unique locations.
all.locs<-levels(as.factor(h1b.devs$location))
Some towns like New York have a space that need to be encoded to work properly. The URLencode()
function will fill in spaces to make “new%20york.” In this code, URLencode()
is applied to the unique location vector, all.locs
, and the result is unlisted back to a character vector.
all.locs<-unlist(pblapply(all.locs,URLencode))
After signing up, paste your API key in between the quotes to make bing.key
. Create the location URLs using paste0()
. Pass in the base section of the URL, then the all.locs
vector, some more URL information and lastly your API key string. R will recycle all the individual address parts as paste0()
works through all.locs
.
bing.key<-'BING_MAPS_API_KEY'
bing.urls<-paste0('http://dev.virtualearth.net/REST/v1/Locations?q=',all.locs,'&maxResults=1&key=', bing.key)
Next you need a function so you do not have to write code for each URL with location information.
The bing.get()
function accepts an individual URL. Using fromJSON
the response is collected.
Then, two ifelse
statements identify the individual lat
/lon
values. In each statement if the API provided a response the lat
and lon
are selected from within the response. If the API couldn’t find the location NA
is returned. This helps to overcome API responses without lat
/lon
values so your code doesn’t break.
Lastly, the lat
/lon
values are organized into a small data frame.
bing.get<-function(bing.url){
bing.resp<-fromJSON(bing.url, simplifyDataFrame = T)
ifelse(bing.resp$resourceSets$estimatedTotal > 0,
lat<-bing.resp$resourceSets$resources[[1]]$point$coordinates[[1]][1], lat<-'NA')
ifelse(bing.resp$resourceSets$estimatedTotal > 0,
lon<-bing.resp$resourceSets$resources[[1]]$point$coordinates[[1]][2], lon<-'NA')
lat.lon<-data.frame(lat=lat,lon=lon)
return(lat.lon)
}
To test the function, apply it to the first bing.urls
. If the code works, your R console will print the lat
/lon
for the first location.
bing.get(bing.urls[1])
Now that you verified that the function works and the URLs have been created correctly, you can apply them to the entire bing.urls
vector. Using progress bar apply for vectors, pblapply()
, pass in the Bing URLs and the bing.get
function. This will return a list of 1763 data frames with a $lat
and $lon
columns.
A fast way to combine the data frames is with rbindlist()
, which will bind the list elements row-wise to create a single data frame of lat
/lon
values.
all.coords<-pblapply(bing.urls,bing.get)
coords.df<-rbindlist(all.coords)
In order to join the original Software Developer H-1B visa application data to the location data you need a common value. This code adds $location
using the same code to extract unique locations as before. Then $lat
and $lon
are changed from character to numeric values. The final data frame can be examined with head()
.
coords.df$location<-levels(as.factor(h1b.devs$location))
coords.df$lat<-as.numeric(as.character(coords.df$lat))
coords.df$lon<-as.numeric(as.character(coords.df$lon))
head(coords.df)
Next you need to add the lat
/lon
values to the h1b.data
. Using join will treat the coords.df
as a lookup table. Every time one of the locations in h1b.devs
is identified in the cords.df
data frame the additional $lat
and $lon
values will be added. To check the number of NA
returned from the API use is.na()
along with sum()
.
h1b.devs<- join(h1b.devs,coords.df,by='location')
sum(is.na(h1b.devs$lat))
The h1b.devs
object contains all H-1B Visa applications for Software Developers along with lat
/lon
for the job location.
Mapping H-1B Software Developer Information
You can construct a map now that the Software Developer H-1B Data has the appropriate lat and lon. To start, let’s create a static map with unique locations only.
The code below references the 1763 unique lat
/lon
pairs to demonstrate software developer “hotspots”. For example, San Francisco and San José are distinct locations but using lat
/lon
you can see how the locations cluster on the map.
For this code add ggplot2
and ggthemes
. These are popular grammar of graphics plotting libraries:
library(ggplot2)
library(ggthemes)
Start by loading the state map information. Using map_data()
with “state” will create a data frame with coordinates and plotting information. This is used as the base layer of the US map.
state.map <- map_data("state")
Next create a base ggplot. The polygon layer declares state.map
data and X, Y as long
/lat
respectively. This part of the code is separated so that you can see how the layers are added and changed in the next part.
ggplot() + geom_polygon(data = state.map, aes(x=long,y=lat,group=group),
colour = "grey", fill = NA)
This is the base USA Map layer.
On top of this layer you will add points with geom_point
. In this layer you declare the coords.df
object with lon
and lat
. You change the dots to be semi-transparent with alpha
and specified color “red”. Next, you limit the plot to center it on the US, add a title, theme and finally remove a lot of elements.
ggplot() +
geom_polygon(data = state.map, aes(x=long,y=lat,group=group),
colour = "grey", fill = NA) +
geom_point(data=coords.df, aes(x=lon, y=lat), alpha=0.05, col="red") +
xlim(-125,-65) + ylim(25,50) +
ggtitle("Certified H1B Visas") +theme_gdocs() +
theme(panel.border = element_blank(),
axis.text = element_blank(),
line = element_blank(),
axis.title = element_blank())
The plot is a static image showing clusters of H-1B locations:
Software Developer H-1B Application Locations
It’s better to have a dynamic map so you can interact with the data more freely. The JavaScript leaflet
package has an R package to create HTML-based interactive maps. Load the leaflet
library along with htmltools
to get started.
library(leaflet)
library(htmltools)
Since the plotting of the points is done client side with javascript, it can be time-consuming. As a result, you can sample the data. Sampling a data table can be done while indexing by using sample with .N
for number of rows and the sample size, 1000
. Next, use jitter()
to slightly adjust each lat
/lon
value in case there are duplicates.
leaf.sample<-h1b.devs[sample(.N,1000),]
leaf.sample$lat<-jitter(leaf.sample$lat)
leaf.sample$lon<-jitter(leaf.sample$lon)
An easy way to use leaflet
is with the %>%
operator.
Confusingly, this is also called the pipe!
This time the pipe operator, %>%, forces an object forward into the next function. This makes code concise and easy to follow.
Here, the leaf.sample
data frame is passed into the leaflet()
function. The object is forwarded into addTiles()
to get map graphics and that again is forwarded into addMarkers()
. This could be done with 3 lines of code but only takes one with %>%
.
Within the addMarkers()
function a popup is added so that you can click on a marker to see the employer.
leaflet(leaf.sample)%>% addTiles() %>% addMarkers(popup = ~htmlEscape(employer))
Now that you have the data set, check out the rest of this post at DataCamp!