Two years as a Data Scientist at Stack Overflow
This article is originally published at http://varianceexplained.org
Last Friday marked my two year anniversary working as a data scientist at Stack Overflow. At the end of my first year I wrote a blog post about my experience, both to share some of what I’d learned and as a form of self-reflection.
After another year, I’d like to revisit the topic. While my first post focused mostly on the transition from my PhD to an industry position, here I’ll be sharing what has changed for me in my job in the last year, and what I hope the next year will bring.
Hiring a Second Data Scientist
In last year’s blog post, I noted how difficult it could be to be the only data scientist on a team:
Most of my current statistical education has to be self-driven, and I need to be very cautious about my work: if I use an inappropriate statistical assumption in a report, it’s unlikely anyone else will point it out.
This continued to be a challenge, and fortunately in December we hired our second data scientist, Julia Silge.
I have some very exciting news! I am joining the data team at @StackOverflow. ✨?✨?✨— Julia Silge (@juliasilge) December 13, 2016
We started hiring for the position in September, and there were a lot of terrific candidates I got to meet and review during the application and review process. But I was particularly excited to welcome Julia to the team because we’d been working together during the course of the year, ever since we met and created the tidytext package at the 2016 rOpenSci unconference.
Julia, like me, works on analysis and visualization rather than building and productionizing features, and having a second person in that role has made our team much more productive. This is not just because Julia is an exceptional colleague, but because the two of us can now collaborate on statistical analyses or split them up to give each more focus. I did enjoy being the first data scientist at the company, but I’m glad I’m no longer the only one. Julia’s also a skilled writer and communicator, which was essential in achieving the next goal.
Company blog posts
In last year’s post, I shared some of the work that I’d done to explore the landscape of software developers, and set a goal for the following year (emphasis is new):
I’m also just intrinsically pretty interested in learning about and visualizing this kind of information; it’s one of the things that makes this a fun job. One plan for my second year here is to share more of these analyses publicly. In a previous post looked at which technologies were the most polarizing, and I’m looking forward to sharing more posts like that soon.
I’m happy to say that we’ve made this a priority in the last six months. Since December I’ve gotten the opportunity to write a number of posts for the Stack Overflow company blog:
- How Do Software Developers in New York, San Francisco, London and Bangalore Differ?
- Developers, Webmasters, and Ninjas: What’s in a Job Title?
- Developers Without Borders: The Global Stack Overflow network
- How Do Students Use Stack Overflow?
- Does Anyone Actually Visit Stack Overflow’s Home Page?
- What Programming Languages Are Used Late at Night?
- Introducing Stack Overflow Trends
- Exploring the State of Mobile Development with Stack Overflow Trends
- Stack Overflow: Helping One Million Developers Exit Vim
- Developers Who Use Spaces Make More Money Than Those Who Use Tabs
Other members of the team have written data-driven blog posts as well, including:
- The Changing Landscape of Programming Technologies (Kevin Montrose)
- Benefits for Developers from San Francisco to Sweden (Julia Silge)
- Women in the 2016 Stack Overflow Survey (Julia Silge)
- What Programming Languages Are Used Most on Weekends? (Julia Silge)
- Developer Hiring Trends in 2017 (Alyssa Mazzina and Julia Silge)
- And the Most Realistic Developer in Fiction is… (Julia Silge)
- A Dive Into Stack Overflow Jobs Search (Aurélien Gasser)
- New Kids on the Block: Understanding Developers Entering the Workforce Today (Julia Silge)
I’ve really enjoyed sharing these snapshots of the software developer world, and I’m looking forward to sharing a lot more on the blog this next year.
Teaching R at Stack Overflow
Last year I mentioned that part of my work has been developing data science architecture, and trying to spread the use of R at the company.
This also has involved building R tutorials and writing “onboarding” materials… My hope is that as the data team grows and as more engineers learn R, this ecosystem of packages and guides can grow into a true internal data science platform.
At the time, R was used mostly by three of us on the data team (Jason Punyon, Nick Larsen, and me). I’m excited to say it’s grown since then, and not just because of my evangelism.
"I've been thinking of switching to R, do you have any opinions on that?" he asked me at lunch, ill-advisedly— David Robinson (@drob) March 1, 2017
Every Friday since last September, I’ve met with a group of developers to run internal “R sessions”, in which we analyze some of our data to develop insights and models. Together we’ve made discoveries that have led to real projects and features, for both the Data Team and other parts of the engineering department.
There are about half a dozen developers who regularly take part, and they all do great work. But I especially appreciate Ian Allen and Jisoo Shin for coming up with the idea of these sessions back in September, and for following through in the months since. Ian and Jisoo joined the company last summer, and were interested in learning R to complement their development of product features. Their curiosity, and that of others in the team, has helped prove that data analysis can be a part of every engineer’s workflow.
Writing production code
My relationship to production code (the C# that runs the actual Stack Overflow website) has also changed. In my first year I wrote much more R code than C#, but in the second I’ve stopped writing C# entirely. (My last commit to production was more than a year ago, and I often go weeks without touching my Windows partition). This wasn’t really a conscious decision; it came from a gradual shift in my role on the engineering team. I’d usually rather be analyzing data than shipping features, and focusing entirely on R rather than splitting attention across languages has been helpful for my productivity.
Instead, I work with engineers to implement product changes based on analyses and push models into production. One skill I’ve had to work on is writing technical specifications, both for data sources that I need to query or models that I’m proposing for production. One developer I’d like to acknowledge specifically Nick Larsen, who works with me on the Data Team. Many of the blog posts I mention above answer questions like “What tags are visited in New York vs San Francisco”, or “What tags are visited at what hour of the day”, and these wouldn’t have been possible without Nick. Until recently, this kind of traffic data was very hard to extract and analyze, but he developed processes that extract and transform the data into more readily queryable tables. This has many important analyses possible besides the blog posts, and I can’t appreciate this work enough.
(Nick also recently wrote an awesome post, How to talk about yourself in a developer interview, that’s worth checking out).
Working with other teams
One team that I’ve worked with that I hadn’t in the first year is Display Ads. Display Ads are separate from job ads, and are purchased by companies with developer-focused products and services.
For example, I’ve been excited to work closer with Steve Feldman on the Display Ad Operations team. If you’re wondering why I’m not ashamed to work on ads, please read Steve’s blog post on how we sell display ads at Stack Overflow- he explains it better than I could. We’ve worked on several new methods for display ad targeting and evaluation, and I think there’s a lot of potential for data to have a postive impact for the company.
Changes in the rest of my career
There’ve been other changes in my second year out of academia. In my first year, I attended only one conference (NYR 2016) but I’ve since had more of a chance to travel, including to useR and JSM 2017, PLOTCON, rstudio::conf 2017, and NYR 2017. I spoke at a few of these, about my broom package, about gganimate and about the history of R as seen by Stack Overflow.
Julia and I wrote and published an O’Reilly book, Text Mining with R (now available on Amazon and free online here). I also self-published an e-book, Introduction to Empirical Bayes: Examples from Baseball Statistics, based on a series of blog posts. I really enjoyed the experience of turning blog posts into a larger narrative, and I’d like to continue doing so this next year.
There are some goals I didn’t achieve. I’ve had a longstanding interest in getting R into production (and we’ve idly investigated some approaches like Microsoft R Server), but as of now we’re still productionizing models by rewriting them in C#. And there are many teams at Stack Overflow that I’d like to give better support to- prioritizing the Data Team’s time has been a challenge, though having a second data scientist has helped greatly. But I’m still happy with how my work has gone, and excited about the future.
In any case, this made the whole year worthwhile:
Please visit source website for post related comments.