Thank You for attending the 2020 New York R Conference. To see what the most recent virtual conference was like, you can keep scrolling.

Conference Program

Speakers

Andrew Gelman

Professor,
Department of Statistics and Department of Political Science, Columbia University
@StatModeling

Emily Robinson

Senior Data Scientist,
Warby Parker
@robinson_es

Max Kuhn

Scientist,
RStudio
@topepos

Heather Nolis

Principal ML Engineer,
T-Mobile
@heatherklus

Jon Krohn

Chief Data Scientist,
untapt
@JonKrohnLearns

Jared P. Lander

Chief Data Scientist,
Lander Analytics
@jaredlander

Ludmila Janda

Data Scientist,
Amplify
@ludmila_janda

David Robinson

Principal Data Scientist,
Heap
@drob

Erin LeDell

Chief Machine Learning Scientist,
H2O.ai
@ledell

Andreas Mueller

1,
Microsoft
@amuellerml

Jonah Gabry

Researcher in Statistics and Stan Developer,
Columbia University
@mcmc_stan

Rob J Hyndman

Head of the Department of Econometrics & Business Statistics,
Monash University
@robjhyndman

Wes McKinney

Director,
Ursa Labs
@wesmckinn

Jacqueline Nolis

Principal Data Scientist,
Brightloom
@skyetetra

Dan Chen

Doctoral Candidate,
Virginia Tech
@chendaniely

Vivian Peng

Data Scientist,
Lander Analytics
@create_self

Emily Dodwell

Principal Inventive Scientist,
AT&T Labs Research
@emdodwell

Camelia Hssaine

Data Scientist,
Codecademy
@cameliacassetet

Brooke Watson Madubuonwu

Director of Legal Analytics & Quantitative Research,
ACLU
@brookLYNevery1

Adam Obeng

Research Scientist,
Facebook
@Adam_Obeng

Kaz Sakamoto

Data Scientist,
Lander Analytics
@urbandigitized

Amanda Dobbyn

Data Engineer,
Deck
@dobbleobble

David Smith

Cloud Advocate,
Microsoft
@revodavid

Laura Gabrysiak

Data Science Manager,
Visa & R-Ladies Miami
@lauragabrysiak

Monica Thieu

PhD Student,
Department of Psychology, Columbia University
@monica_too_

Neal Richardson

Director of Engineering,
Ursa Labs
@enpiar

Sonia Ang

Microsoft Advanced Analytics and AI, Sr. Cloud Solutions Architect,
Microsoft
@galleontrade

Sebastian Teran Hidalgo

Data Scientist,
Vroom
@steranhidalgo

Tom Mock

Customer Success Rep,
RStudio
@thomas_mock

Virtual Stand-Up Comedy

Workshops

Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository.

Geospatial expert and Columbia Professor Kaz Sakamoto is leading this class on all things GIS. You'll learn all about map projections, spatial regression, plotting interactive heatmaps with leaflet and working with shapefiles. This course is designed for those who have familiarity with R and want to explore working spatial data into their work. The AM session will be an introduction to Geographic Information Systems(GIS), spatial features (sf package), Coordinate Reference Systems(CRS), and map making basics. The PM session will introduce spatial operations, geometric operations, statistical geography, spatial point pattern analysis and geostatistics. By the end of the day participants should be able to read/work with spatial data, understand projections, utilize geoprocessing techniques, and gain basic spatial statistics comprehension.

Daniel Chen, author of Pandas for Everyone, has given multiple talks at the New York R Conference about the data science workflow. In this workshop he'll teach how to use Git and project management for better organization and faster iteration. This workshop will have four parts: 1) Git on Your Own, 2) Working with Remotes, and 3) Git with Branches, and 4) Collaborating with Git. Part I will cover creating a git repository, adding and committing files, looking at differences between files, looking at your history, moving around your history, reverting changes, and undelete files. Part II will go over going from your computer to a remote (e.g., GitHub, BitBucket, GitLab), syncing your files by pushing and pulling, and conflicts. Part III will cover creating branches, moving around different branches, making commits in branches, merging branches, using branches with remotes, pull requests (aka, merge requests), merging pull requests, and syncing up with your remote. In Part IV, we will discuss how the skills you learned directly apply to collaboration with other people.

The tidyverse is a powerful collection of packages following a standard set of principles for usability. During this workshop David will demonstrate an exploratory data analysis in R using tidy tools. He will demonstrate the use of tools such as dplyr and ggplot2 for data transformation and visualization, as well as other packages from the tidyverse as they're needed. He'll narrate his thought process as attendees follow along and offer their own solutions. The workshop expects some familiarity with dplyr and ggplot2—enough to work with data using functions like mutate, group_by, and summarize and to create graphs like scatterplots or bar plots in ggplot2. These concepts will be re-introduced to ensure a smooth workshop, but it isn't designed for brand new R programmers. The workshop is designed to be interactive and participants are expected to type along on their own keyboards.

Andreas Mueller is a Principal Engineer at Microsoft and has been a core-developer of scikit-learn for over 7 years. He was previously a Research Scientist at Columbia University. He's also author of the book "Introduction to Machine Learning with Python", co-authored with Sarah Guido. The workshop will go through the basics of machine learning with Python, data representation and preprocessing, and then work through details of the scikit-learn API and how to build and evaluate machine learning models in Python. We will in particular look at model selection and tuning with cross-validation and grid-search, building complex machine learning workflows with pipelines, and how to evaluate classification models with a variety of metrics. The workshop requires working knowledge of numpy, matplotlib and pandas, and familiarity with working in Jupyter Notebooks.

During this workshop Amanda covers advanced web scraping with R, focusing on scraping dynamic sites using Selenium.

Introduction to data visualization, and the value of design to make data-informed decisions. Through case studies, we will explore how to frame business problems for design, understand target audience, and approach visualization. This course will cover Shiny at the introductory level.

Agenda

Registration & Opening Remarks: 8:00 AM - 9:00 AM EST

Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You’ll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. Pre-requisites: some experience with modeling in R and the tidyverse (don’t need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository.

Introduction to data visualization, and the value of design to make data-informed decisions. Through case studies, we will explore how to frame business problems for design, understand target audience, and approach visualization. This course will cover Shiny at the introductory level.

Geospatial expert and Columbia Professor Kaz Sakamoto is leading this class on all things GIS. You’ll learn all about map projections, spatial regression, plotting interactive heatmaps with leaflet and working with shapefiles. This course is designed for those who have familiarity with R and want to explore working spatial data into their work. The AM session will be an introduction to Geographic Information Systems(GIS), spatial features (sf package), Coordinate Reference Systems(CRS), and map making basics. The PM session will introduce spatial operations, geometric operations, statistical geography, spatial point pattern analysis and geostatistics. By the end of the day participants should be able to read/work with spatial data, understand projections, utilize geoprocessing techniques, and gain basic spatial statistics comprehension.

Daniel Chen, author of Pandas for Everyone, has given multiple talks at the New York R Conference about the data science workflow. In this workshop he’ll teach how to use Git and project management for better organization and faster iteration. This workshop will have four parts: 1) Git on Your Own, 2) Working with Remotes, and 3) Git with Branches, and 4) Collaborating with Git. Part I will cover creating a git repository, adding and committing files, looking at differences between files, looking at your history, moving around your history, reverting changes, and undelete files. Part II will go over going from your computer to a remote (e.g., GitHub, BitBucket, GitLab), syncing your files by pushing and pulling, and conflicts. Part III will cover creating branches, moving around different branches, making commits in branches, merging branches, using branches with remotes, pull requests (aka, merge requests), merging pull requests, and syncing up with your remote. In Part IV, we will discuss how the skills you learned directly apply to collaboration with other people.

Registration & Opening Remarks: 8:00 AM - 9:00 AM EST

Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You’ll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. Pre-requisites: some experience with modeling in R and the tidyverse (don’t need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository.

The tidyverse is a powerful collection of packages following a standard set of principles for usability. During this workshop David will demonstrate an exploratory data analysis in R using tidy tools. He will demonstrate the use of tools such as dplyr and ggplot2 for data transformation and visualization, as well as other packages from the tidyverse as they’re needed. He’ll narrate his thought process as attendees follow along and offer their own solutions. The workshop expects some familiarity with dplyr and ggplot2—enough to work with data using functions like mutate, group_by, and summarize and to create graphs like scatterplots or bar plots in ggplot2. These concepts will be re-introduced to ensure a smooth workshop, but it isn’t designed for brand new R programmers. The workshop is designed to be interactive and participants are expected to type along on their own keyboards.

Andreas Mueller is a Principal Engineer at Microsoft and has been a core-developer of scikit-learn for over 7 years. He was previously a Research Scientist at Columbia University. He’s also author of the book “Introduction to Machine Learning with Python”, co-authored with Sarah Guido. The workshop will go through the basics of machine learning with Python, data representation and preprocessing, and then work through details of the scikit-learn API and how to build and evaluate machine learning models in Python. We will in particular look at model selection and tuning with cross-validation and grid-search, building complex machine learning workflows with pipelines, and how to evaluate classification models with a variety of metrics. The workshop requires working knowledge of numpy, matplotlib and pandas, and familiarity with working in Jupyter Notebooks.

During this workshop Amanda covers advanced web scraping with R, focusing on scraping dynamic sites using Selenium.

Open Registration: 8:00 AM - 8:50 AM EST
Opening Remarks: 8:50 AM - 9:00 AM EST

For over 50 years we have known that ensemble forecasts perform better than individual methods, yet they are not as widely used as they should be. Perhaps this is because users think it is more work, or that it is hard to get prediction intervals, or that it is difficult to determine the relative weights of the component methods. The fable package solves these problems and makes it easy to produce

Coming Soon

This talk will quantify various elements of R-Ladies NYC since the group’s start in 2017. We will look at things like attendance, talk topics, and book club books. Along the way, we will make visualizations, conduct some analyses, and consider some useful takeaways from the data on this group of women that use data.

Break & Networking: 10:10 AM - 10:40 AM EST

Have you ever had a “first this then that” question? For example, maybe you want all the times people clicked on an item and then added it to their cart, or the last page they visited before registering. This talk will introduce funneljoin, an R package that makes it easy to analyze sequences of events. I’ll illustrate how the powerful type argument lets you switch quickly between different kinds of funnels and then do a live demo of using funneljoin to analyze Stack Overflow R questions. After this talk, you’ll be able to specify and code any type of funnel in R.

There are many ways to fit tree-based models in R, including the rpart, randomForest and xgboost packages. We compare their user interfaces and results to judge them on usability and accuracy.

Coming Soon

Lunch & Networking: 11:50 AM - 1:00 PM EST

This talk begins with a survey of the primary families of Deep Learning approaches: Convolutional Neural Networks, Recurrent Neural Networks, Generative Adversarial Networks, and Deep Reinforcement Learning. Via interactive demos, the meat of the talk will appraise the two leading Deep Learning libraries: TensorFlow and PyTorch. With respect to both model development and production deployment, the strengths and weaknesses of the two libraries will be covered – with a particular focus on TensorFlow 2 release that formally integrates the easy-to-use, high-level Keras API into the library.

“Open science” is more than data sharing, replication, preregistration, partial pooling, and version control. “Doing statistics right” is more than swapping Bayesian methods for p-values. To resolve the larger problems of the push-a-button, take-a-pill model of science, engineering, and policy., we need to move toward collaboration between researchers, data analysts, and people who design studies and analyze the people whose data are being collected. But this in turn requires the ability to simulate, graph, and analyze data using flexible platforms such as Stan and R.

Break & Networking: 2:05 PM - 2:35 PM EST

As predictive models and machine learning become key components of production applications in every industry, an end-to-end Machine Learning Operations (MLOPS) process becomes critical for reliable and efficient deployment of applications that depend on R-based models. In this talk, I’ll outline the basics of the DevOps process and focus on the areas where MLOPS diverges. The talk will show the complete process of building and deploying an application driven by a machine learning model implemented with R. We will show the process of developing models, triggering model training on code changes, and triggering the CI/CD process for an application when a new version of a model is registered. We will use the Azure Machine Learning service and the “azuremlsdk” package to orchestrate the model training and management process, but the principles will apply to MLOPS processes generally, especially for applications that involve large amounts of data or require significant computing resources.

Abstract Coming Soon

Dual epidemics have dominated America in 2020: the global pandemic of COVID-19, and the ongoing epidemic of police brutality. Both issues have disproportionately impacted Black and Brown communities. These two issues collide in the COVID pandemic that plagues America’s jails, prisons, and detention facilities. This talk will discuss a pre-print and accompanying report that used R to model the epidemic spread of COVID-19 in US jails, and highlight how this work is being used to advocate for the public health benefits of a reduced carceral system.

Break & Networking: 3:45 PM - 4:15 PM EST

This talk will be a crash course on how to estimate causal treatment effects. Usually in a randomized experiment or A/B test it is easy to estimate the average causal effect of intervention A compared to B with a simple difference in means. However, in many situations randomized data is not available but a data scientist might still try to estimate the causal effect of A compared to B. Doubly robust estimators can estimate this causal effect, under certain assumptions, and allow the data scientists two chances to get the correct answer. This estimator will be built from the ground up using R.

After spending two years entrenched in building models on customer care text message conversations between our experts and our customers, our stakeholders had a “simple” request - “Do this, but for phone calls… Also, at 10 times the volume.” In this talk, I will walk through a few of the experiments we performed in an attempt to get our beautiful text-trained R-Keras models, presented at last year’s R Conference | NY, to work on real-time voice conversations at scale. This includes an overview of telephony audio data, the state of speech-to-text technology, and the things you didn’t know you need to know if you’re diving into speech for the first time.

Closing Remarks: 5:00 PM - 5:10 PM EST
Happy Hour with Mathematical Standup Comedian Rachel Lander: 5:20 PM - 6:35 PM EST
Open Registration: 9:00 AM - 9:50 AM EST
Opening Remarks: 9:50 AM - 10:00 AM EST

In 2016 I created Tweet Mashup, a website that lets you combine the tweets of two different people. After spending a year making it in .NET, when I launched the site it became an immediate sensation and was mentioned in places like the Verge. Years later, I was getting more and more frustrated maintaining the F# code and decided to see if I could recreate it in Shiny. Doing so would require having Shiny integrate with the Twitter API in ways that hadn’t been done by anyone before. Could I pull it off? Come to this talk to find out!

The Microsoft Research division has created a breakthrough automated ML capability. The approach combines ideas from collaborative filtering and Bayesian optimization to search an enormous space of possible machine learning pipelines intelligently and efficiently. It’s essentially a recommender system for machine learning pipelines, similar to how streaming services recommend movies for users, automated ML recommends machine learning pipelines for data sets. Automated ML is designed to generate pipelines without having to see the customer’s data, preserving privacy.

Break & Networking: 10:45 AM - 11:15 AM EST

Visualization of multivariate spatiotemporal data has represented a nontrivial challenge. Displaying the data visually requires the capacity for the user to examine different perspectives and make comparisons among and between features, including use of multiple displays to re-focus attention on each component and linked brushing to connect the plots. In this talk, Emily will discuss considerations for slicing on space and time to reveal spatial and temporal dependencies at varying granularities, e.g. day of week vs. hour of day and urban v. rural. She will demonstrate a combination of graphical tools in R that support such an analysis, and their implementation to visualize the Australian bushfires that devastated parts of the country throughout the 2019-2020 season. Generous funding from ACEMS supported Emily’s visit to Melbourne this past spring for this collaboration with Professor Di Cook and Weihao Li of Monash University.

Abstract Coming Soon

Lunch & Networking: 12:25 PM - 1:35 PM EST

Abstract Coming Soon

The tidyverse offers powerful and usable tools for transformation, visualization, and modeling, based on the table as a shared data structure. Some data science tasks, however, are conceptually and computationally suited to “wide” matrices rather than “tidy” tables. Such operations include pairwise correlations, clustering, and dimensionality reduction, and generally any operation that compares across groups rather than within groups. In this talk I’ll introduce my widyr package, which fits those operations into the tidyverse by making the matrix operations invisible to the user. The package offers functions such as pairwise_count(), pairwise_cor(), and widely_svd(), each of which takes and returns a table. These functions are efficient but powerful, letting users answer questions like “which groups within this dataset are correlated” or “what are the most important principal components” as part of a tidy workflow. I’ll describe the widyr philosophy, share some examples of using widyr in a tidy analysis, and end with some glimpses of the future of the widyr package.

What happens when Bayesian (or non-Bayesian models) for multi-level and repeated measures designs are resampled using a full leave-subject-out scheme? Is the basic out-of-sample error consistent with the model-based estimate?

Break & Networking: 2:45 PM - 3:15 PM EST

The focus of this presentation is scalable and automatic machine learning in R using the H2O machine learning platform. H2O is an open source, distributed machine learning platform is designed to scale to very large datasets that may not fit into RAM on a single machine. We will provide a brief overview of the field of Automatic Machine Learning, followed by a detailed look inside H2O’s AutoML algorithm. H2O AutoML provides an easy-to-use interface which automates data pre-processing, training and tuning a large selection of candidate models (including multiple stacked ensemble models for superior model performance), and due to the distributed nature of the H2O platform, H2O AutoML can scale to very large datasets. The result of the AutoML run is a “leaderboard” of H2O models which can be easily exported for use in production.

R continues to be a popular programming language that consistently attracts new users. Effective teaching of R concepts and techniques is critical for new R users to gain basic competence quickly, and ultimately stick with the language beyond the growing pains of a novice learner. I will discuss principles of scientific teaching and educational psychology, and how incorporating these ideas can help us teach coding more effectively. Attendees will learn how to apply these principles to create, revise, and deliver an R lesson on the topic of their choice.

A/B testing is the gold standard for experimentation, but in practice randomization is not always feasible when certain biases are in play. It is common then for data scientists to leverage quasi-experimental methods to measure average treatment effects. In this talk we’ll go through examples of industry applications within ridesharing and edtech, where methods such as difference-in-differences, synthetic controls and counterfactual analysis and estimation were used to infer causality.

Closing Remarks: 4:25 PM - 4:35 PM EST

Job Board

Our Sponsors would like to share some of the available roles that may be of interest to R Conference attendees. To apply, please visit the sponsor’s website and submit an application directly.

Codecademy




Sponsors

Platinum

RStudio
Microsoft Azure

Silver

Codecademy

Supporting

R Consortium
Pearson
CloudFactory
NausicaaDistribution
Springer
JetBrains
Chapman & Hall/CRC,Taylor & Francis Group
O'Reilly
Manning

Vibe

Matcha Bar/Hustle
Westland Distillery
Bruichladdich Distillery