Exploring H-1B Data with R: Part 3

This post was originally published at DataCamp. Check out their other curated blog posts.

We’re back with our third post investigating H-1B data. If you want to learn how to programmatically collect and perform exploratory data analysis (EDA), check out the first post here.

In this final installment, you will learn how to geocode locations for latitude and longitude coordinates from an API. Then you create a map of the data. Next, you learn the top H-1B software developer employers and compare their salaries by year.

I hope these views help you see where you should apply, how much to negotiate for and regions of the country with high volume and good salaries for software developers.

I chose software developers because these are among the most widespread job titles in our data set. Plus, I have a few friends in need of sponsorship that are technically minded. Hopefully this code and the visuals can help them hone their employment efforts!

Get the Data

You could review the original H-1B posts here to web scrape the data like I did. If you do not have the patience, I suggest heading to my github repo to get the CSV file.

The first thing to do is set up your R session with the correct packages. The bit64 packages lets data.tablework efficiently on vectors of 64bit integers.

The data.table package is once again used since it is so efficient. You next use lubridate in case you want to perform data related functions. After that, pbapply gives your R session progress bars when using apply functions. The jsonlite package is used to parse the geocoded API response from Bing Maps. Lastly, plyr is a great package for manipulating data.

The two lines set global options to first get rid of scientific notation and also to increase the decimal places printed to 15. Often you need many decimal points when working with latitude and longitude. Using set.seedensures your visuals will look similar to this post by selecting a specific seed instead of random.

library(bit64)
library(data.table)
library(lubridate)
library(pbapply)
library(jsonlite)
library(plyr)
options(scipen=999, digits=15)
set.seed(1234)

The next step is to read in the CSV data from the repository. The fread() function is “fast and friendly file finagler” that quickly reads and provides a progress update. In my experience, it is the best way to import large data frames.

The next lines of code apply ymd to the data related vectors.

If you followed along from the previous posts, you wouldn’t need to do this but using fread() means you will need to correct the vector types.

h1b.data<-fread('h1b_data_clean.csv')
h1b.data$start_date<-ymd(h1b.data$start_date)
h1b.data$submit_date<-ymd(h1b.data$submit_date)

Now that your data is in, let’s identify the most frequent job titles. To create h1b.jobs, you should change the $job_title vector to categorical. Next, use table() to tally the unique job titles and nest the result in as.data.frame() so it can be easily manipulated. The third line orders the new data frame by the new $Freqvector. Since the default behavior of order is to arrange values ascending, you can review the 6 most frequent job titles with tail().

h1b.jobs<-as.factor(h1b.data$job_title)
h1b.jobs<-as.data.frame(table(h1b.jobs))
h1b.jobs<-h1b.jobs[order(h1b.jobs$Freq),]
tail(h1b.jobs)

Reviewing h1b.jobs, you can pick out the jobs you are interested in.

Here, you can create jobs by concatenating “SOFTWARE DEVELOPER,” “COMPUTER PROGRAMMER” and “SOFTWARE ENGINEER.”

Do you notice how each is separated without a space by the pipe operator? The “|” is above your enter key and represents “OR”. So you are going to subset your h1b.data by “A OR B OR C” e.g. “A|B|C”.

jobs<-c('SOFTWARE DEVELOPER|COMPUTER PROGRAMMER|SOFTWARE ENGINEER')

The subset() function will keep a row of data whenever a logical statement is TRUE. Using grepl() to get a logical outcome pass in jobs and search within $job_title. Now h1b.devs has ~262K rows instead of h1b.data’s ~1.8M.

h1b.devs<-subset(h1b.data,grepl(jobs,h1b.data$job_title)==T)

Geocoding with Bing Maps

To create a map, you will need lat/lon coordinates. There are multiple services to look up locations like google Maps.

However, I like to use Bing Maps since the limit is 25,000 lookups per day. To use this code sign up here for a Bing Maps API Key. You could look up all locations row by row. However, you would reach your Bing limit quickly and have to wait to finish.

Instead, I suggest looking up each unique location and then joining the information later to save time.

To find unique values in a character vector change the $location with as.factor() and then use levels to extract unique levels.

In this data set, the Software Developer H-1B Applications with 262,659 rows come from only 1,763 unique locations.

all.locs<-levels(as.factor(h1b.devs$location))

Some towns like New York have a space that need to be encoded to work properly. The URLencode() function will fill in spaces to make “new%20york.” In this code, URLencode() is applied to the unique location vector, all.locs, and the result is unlisted back to a character vector.

all.locs<-unlist(pblapply(all.locs,URLencode))

After signing up, paste your API key in between the quotes to make bing.key. Create the location URLs using paste0(). Pass in the base section of the URL, then the all.locs vector, some more URL information and lastly your API key string. R will recycle all the individual address parts as paste0() works through all.locs.

bing.key<-'BING_MAPS_API_KEY'
bing.urls<-paste0('http://dev.virtualearth.net/REST/v1/Locations?q=',all.locs,'&maxResults=1&key=', bing.key)

Next you need a function so you do not have to write code for each URL with location information.

The bing.get() function accepts an individual URL. Using fromJSON the response is collected.

Then, two ifelse statements identify the individual lat/lon values. In each statement if the API provided a response the lat and lon are selected from within the response. If the API couldn’t find the location NA is returned. This helps to overcome API responses without lat/lon values so your code doesn’t break.

Lastly, the lat/lon values are organized into a small data frame.

bing.get<-function(bing.url){
  bing.resp<-fromJSON(bing.url, simplifyDataFrame = T)
  ifelse(bing.resp$resourceSets$estimatedTotal > 0,
         lat<-bing.resp$resourceSets$resources[[1]]$point$coordinates[[1]][1], lat<-'NA')
  ifelse(bing.resp$resourceSets$estimatedTotal > 0,
         lon<-bing.resp$resourceSets$resources[[1]]$point$coordinates[[1]][2], lon<-'NA')
  lat.lon<-data.frame(lat=lat,lon=lon)
  return(lat.lon)
}

To test the function, apply it to the first bing.urls. If the code works, your R console will print the lat/lon for the first location.

bing.get(bing.urls[1])

Now that you verified that the function works and the URLs have been created correctly, you can apply them to the entire bing.urls vector. Using progress bar apply for vectors, pblapply(), pass in the Bing URLs and the bing.get function. This will return a list of 1763 data frames with a $lat and $lon columns.

A fast way to combine the data frames is with rbindlist(), which will bind the list elements row-wise to create a single data frame of lat/lon values.

all.coords<-pblapply(bing.urls,bing.get)
coords.df<-rbindlist(all.coords)

In order to join the original Software Developer H-1B visa application data to the location data you need a common value. This code adds $location using the same code to extract unique locations as before. Then $lat and $lon are changed from character to numeric values. The final data frame can be examined with head().

coords.df$location<-levels(as.factor(h1b.devs$location))
coords.df$lat<-as.numeric(as.character(coords.df$lat))
coords.df$lon<-as.numeric(as.character(coords.df$lon))
head(coords.df)

Next you need to add the lat/lon values to the h1b.data. Using join will treat the coords.df as a lookup table. Every time one of the locations in h1b.devs is identified in the cords.df data frame the additional $lat and $lon values will be added. To check the number of NA returned from the API use is.na() along with sum().

h1b.devs<- join(h1b.devs,coords.df,by='location')
sum(is.na(h1b.devs$lat))

The h1b.devs object contains all H-1B Visa applications for Software Developers along with lat/lon for the job location.

Mapping H-1B Software Developer Information

You can construct a map now that the Software Developer H-1B Data has the appropriate lat and lon. To start, let’s create a static map with unique locations only.

The code below references the 1763 unique lat/lon pairs to demonstrate software developer “hotspots”. For example, San Francisco and San José are distinct locations but using lat/lon you can see how the locations cluster on the map.

For this code add ggplot2 and ggthemes. These are popular grammar of graphics plotting libraries:

library(ggplot2)
library(ggthemes)

Start by loading the state map information. Using map_data() with “state” will create a data frame with coordinates and plotting information. This is used as the base layer of the US map.

state.map <- map_data("state")

Next create a base ggplot. The polygon layer declares state.map data and X, Y as long/lat respectively. This part of the code is separated so that you can see how the layers are added and changed in the next part.

ggplot() + geom_polygon(data = state.map, aes(x=long,y=lat,group=group), 
               colour = "grey", fill = NA)

This is the base USA Map layer.

On top of this layer you will add points with geom_point. In this layer you declare the coords.df object with lonand lat. You change the dots to be semi-transparent with alpha and specified color “red”. Next, you limit the plot to center it on the US, add a title, theme and finally remove a lot of elements.

  ggplot() + 
  geom_polygon(data = state.map, aes(x=long,y=lat,group=group), 
               colour = "grey", fill = NA) + 
  geom_point(data=coords.df, aes(x=lon, y=lat), alpha=0.05, col="red")  +
  xlim(-125,-65) + ylim(25,50) + 
  ggtitle("Certified H1B Visas") +theme_gdocs() +
  theme(panel.border = element_blank(),
        axis.text = element_blank(),
        line = element_blank(),
        axis.title = element_blank())

The plot is a static image showing clusters of H-1B locations:

Software Developer H-1B Application Locations

It’s better to have a dynamic map so you can interact with the data more freely. The JavaScript leafletpackage has an R package to create HTML-based interactive maps. Load the leaflet library along with htmltools to get started.

library(leaflet)
library(htmltools)

Since the plotting of the points is done client side with javascript, it can be time-consuming. As a result, you can sample the data. Sampling a data table can be done while indexing by using sample with .N for number of rows and the sample size, 1000. Next, use jitter() to slightly adjust each lat/lon value in case there are duplicates.

leaf.sample<-h1b.devs[sample(.N,1000),]
leaf.sample$lat<-jitter(leaf.sample$lat)
leaf.sample$lon<-jitter(leaf.sample$lon) 

An easy way to use leaflet is with the %>% operator.

Confusingly, this is also called the pipe!

This time the pipe operator, %>%, forces an object forward into the next function. This makes code concise and easy to follow.

Here, the leaf.sample data frame is passed into the leaflet() function. The object is forwarded into addTiles()to get map graphics and that again is forwarded into addMarkers(). This could be done with 3 lines of code but only takes one with %>%.

Within the addMarkers() function a popup is added so that you can click on a marker to see the employer.

leaflet(leaf.sample)%>% addTiles() %>% addMarkers(popup = ~htmlEscape(employer))

 

Now that you have the data set, check out the rest of this post at DataCamp!

 

This entry was posted in blog posts, DataCamp, EDA and tagged , , , . Bookmark the permalink.