Code
# setup code: install packages
library(tidyverse)
# setup code: install packages
library(tidyverse)
A typical data science project:
In the next 1.5 days, we learn
the life cycle of a data science project
some R ecosystems (tidyverse, tidymodels, dml) for Open Data Science
basic machine learning
policy evaluation using double machine learning
Dr. Roch Nianogo will lead the second part, from June 25 afternoon to June 26, with in-depth discussions of simulation modeling, causal inference, and linking data science and systems science.
All course materials are available on GitHub. During the course, you can
read the static tutorial pages, make comments, and ask questions; or
interactively run qmd
files in RStudio on Posit Cloud (sign up for a free account); or
interactively run ipynb
files in Jupyter Notebook on Binder (can be slow)
Adventurous ones can reproduce, improve, and generalize all the examples on your own computer by the following steps:
git clone the course repository
revise and render the qmd files
Please feel free to ask questions and make comments. You can
use the “raise hand” feature (✋) in Zoom
type your questions in the Zoom chat (💬)
make comments or ask questions on tutorial pages (need to sign up an account on hypothes.is)
US Constitution Article I, Sections 2 and 9:
The actual enumeration shall be made within three years after the first meeting of the Congress of the United States, and within every subsequent term of ten years, in such manner as they shall by law direct.
Decennial census data. Every 10 years (1790, 1800, …, 2010, 2020) administered by the United States Census Bureau. Complete enumeration of the US population to assist with apportionment. A limited set of questions on race, ethnicity, age, sex, and housing tenure.
American Community Survey (ACS). Before the 2010 decennial Census, 1 in 6 Americans also received the Census long form, which asked a wider range of demographic questions on income, education, language, housing, and more. The Census long form has since been replaced by the American Community Survey, which is now the premier source of detailed demographic information about the US population. The ACS is mailed to approximately 3.5 million households per year (around 3% of the US population), allowing for annual data updates. The Census Bureau releases two ACS datasets to the public:
1-year ACS: covers areas of population 65,000 and greater
5-year ACS: moving average of data over a 5-year period that covers geographies down to the Census block group.
The Current Population Survey (CPS), sponsored jointly by the U.S. Census Bureau and the U.S. Bureau of Labor Statistics (BLS), is the primary source of labor force statistics for the population of the United States.
The CPS is one of the oldest, largest, and most well-recognized surveys in the United States. It is immensely important, providing information on many of the things that define us as individuals and as a society – our work, our earnings, and our education.
In addition to being the primary source of monthly labor force statistics, the CPS is used to collect data for a variety of other studies that keep the nation informed of the economic and social well-being of its people. This is done by adding a set of supplemental questions to the monthly basic CPS questions. Supplemental inquiries vary month to month and cover a wide variety of topics such as child support, volunteerism, health insurance coverage, school enrollment, and food security. A listing and brief description of the CPS supplements are available here.
Take the CPS Food Security Supplement December 2021 Public-Use Microdata File as an example. The Food Security Supplement was completed for 30,343 interviewed households with 71,571 person records.
The microdata file includes data in four general categories:
Food Security Supplement Questionnaire includes the following major sections:
It is worth noting that beginning in 2015 and continuing through 2021, there were changes from previous years in how the Census Bureau processes some variables. Details can be found in the technical documentation, which can be found here
WIC administrative data.
The Tidyverse suite of packages create and use data structures, functions and operators to make working with data more intuitive. The two most basic changes are in the use of pipes and tibbles.
for influential work in statistical computing, visualization, graphics, and data analysis; for developing and implementing an impressively comprehensive computational infrastructure for data analysis through R software; for making statistical thinking and computing accessible to large audience; and for enhancing an appreciation for the important role of statistics among data scientists.
Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.
To make R code more human readable, the Tidyverse tools use the pipe, %>%
, which was acquired from the magrittr package and comes installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.
# A single command
sqrt(83)
[1] 9.110434
# Base R method of running more than one command
round(sqrt(83), digit = 2)
[1] 9.11
# Running more than one command with piping
sqrt(83) %>% round(digit = 2)
[1] 9.11
The pipe represents a much easier way of writing and deciphering R code, and we will be taking advantage of it for all future activities.
R 4.1.0 introduced a native pipe operator |>
, which is mostly compatible with the pipe %>%
offered by the tidyverse package magrittr. For some subtle differences, see this post by Hadley Wickham.
# R base pipe
sqrt(83) |> round(digit = 2)
[1] 9.11
A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have column names that are not normally allowed, such as numbers/symbols.
The main differences between tibbles and data.frames relate to printing and subsetting.
iris
is a data frame available in base R# By default, R displays ALL rows of a regular data frame!
iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
21 5.4 3.4 1.7 0.2 setosa
22 5.1 3.7 1.5 0.4 setosa
23 4.6 3.6 1.0 0.2 setosa
24 5.1 3.3 1.7 0.5 setosa
25 4.8 3.4 1.9 0.2 setosa
26 5.0 3.0 1.6 0.2 setosa
27 5.0 3.4 1.6 0.4 setosa
28 5.2 3.5 1.5 0.2 setosa
29 5.2 3.4 1.4 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
31 4.8 3.1 1.6 0.2 setosa
32 5.4 3.4 1.5 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
36 5.0 3.2 1.2 0.2 setosa
37 5.5 3.5 1.3 0.2 setosa
38 4.9 3.6 1.4 0.1 setosa
39 4.4 3.0 1.3 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
43 4.4 3.2 1.3 0.2 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
47 5.1 3.8 1.6 0.2 setosa
48 4.6 3.2 1.4 0.2 setosa
49 5.3 3.7 1.5 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor
57 6.3 3.3 4.7 1.6 versicolor
58 4.9 2.4 3.3 1.0 versicolor
59 6.6 2.9 4.6 1.3 versicolor
60 5.2 2.7 3.9 1.4 versicolor
61 5.0 2.0 3.5 1.0 versicolor
62 5.9 3.0 4.2 1.5 versicolor
63 6.0 2.2 4.0 1.0 versicolor
64 6.1 2.9 4.7 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
66 6.7 3.1 4.4 1.4 versicolor
67 5.6 3.0 4.5 1.5 versicolor
68 5.8 2.7 4.1 1.0 versicolor
69 6.2 2.2 4.5 1.5 versicolor
70 5.6 2.5 3.9 1.1 versicolor
71 5.9 3.2 4.8 1.8 versicolor
72 6.1 2.8 4.0 1.3 versicolor
73 6.3 2.5 4.9 1.5 versicolor
74 6.1 2.8 4.7 1.2 versicolor
75 6.4 2.9 4.3 1.3 versicolor
76 6.6 3.0 4.4 1.4 versicolor
77 6.8 2.8 4.8 1.4 versicolor
78 6.7 3.0 5.0 1.7 versicolor
79 6.0 2.9 4.5 1.5 versicolor
80 5.7 2.6 3.5 1.0 versicolor
81 5.5 2.4 3.8 1.1 versicolor
82 5.5 2.4 3.7 1.0 versicolor
83 5.8 2.7 3.9 1.2 versicolor
84 6.0 2.7 5.1 1.6 versicolor
85 5.4 3.0 4.5 1.5 versicolor
86 6.0 3.4 4.5 1.6 versicolor
87 6.7 3.1 4.7 1.5 versicolor
88 6.3 2.3 4.4 1.3 versicolor
89 5.6 3.0 4.1 1.3 versicolor
90 5.5 2.5 4.0 1.3 versicolor
91 5.5 2.6 4.4 1.2 versicolor
92 6.1 3.0 4.6 1.4 versicolor
93 5.8 2.6 4.0 1.2 versicolor
94 5.0 2.3 3.3 1.0 versicolor
95 5.6 2.7 4.2 1.3 versicolor
96 5.7 3.0 4.2 1.2 versicolor
97 5.7 2.9 4.2 1.3 versicolor
98 6.2 2.9 4.3 1.3 versicolor
99 5.1 2.5 3.0 1.1 versicolor
100 5.7 2.8 4.1 1.3 versicolor
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
106 7.6 3.0 6.6 2.1 virginica
107 4.9 2.5 4.5 1.7 virginica
108 7.3 2.9 6.3 1.8 virginica
109 6.7 2.5 5.8 1.8 virginica
110 7.2 3.6 6.1 2.5 virginica
111 6.5 3.2 5.1 2.0 virginica
112 6.4 2.7 5.3 1.9 virginica
113 6.8 3.0 5.5 2.1 virginica
114 5.7 2.5 5.0 2.0 virginica
115 5.8 2.8 5.1 2.4 virginica
116 6.4 3.2 5.3 2.3 virginica
117 6.5 3.0 5.5 1.8 virginica
118 7.7 3.8 6.7 2.2 virginica
119 7.7 2.6 6.9 2.3 virginica
120 6.0 2.2 5.0 1.5 virginica
121 6.9 3.2 5.7 2.3 virginica
122 5.6 2.8 4.9 2.0 virginica
123 7.7 2.8 6.7 2.0 virginica
124 6.3 2.7 4.9 1.8 virginica
125 6.7 3.3 5.7 2.1 virginica
126 7.2 3.2 6.0 1.8 virginica
127 6.2 2.8 4.8 1.8 virginica
128 6.1 3.0 4.9 1.8 virginica
129 6.4 2.8 5.6 2.1 virginica
130 7.2 3.0 5.8 1.6 virginica
131 7.4 2.8 6.1 1.9 virginica
132 7.9 3.8 6.4 2.0 virginica
133 6.4 2.8 5.6 2.2 virginica
134 6.3 2.8 5.1 1.5 virginica
135 6.1 2.6 5.6 1.4 virginica
136 7.7 3.0 6.1 2.3 virginica
137 6.3 3.4 5.6 2.4 virginica
138 6.4 3.1 5.5 1.8 virginica
139 6.0 3.0 4.8 1.8 virginica
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica
# Convert iris to a tibble
<- as_tibble(iris)
iris_tb iris_tb
# A tibble: 150 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# ℹ 140 more rows
# If subsetting a single column from a data.frame, R will output a vector
"Sepal.Length"] iris[,
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
[19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
[37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
[55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
[73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
[91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
# If subsetting a single column from a tibble, R will output a tibble
"Sepal.Length"] iris_tb[,
# A tibble: 150 × 1
Sepal.Length
<dbl>
1 5.1
2 4.9
3 4.7
4 4.6
5 5
6 5.4
7 4.6
8 5
9 4.4
10 4.9
# ℹ 140 more rows
Also note that if you use piping to subset a tibble, then the notation is slightly different, requiring a placeholder .
prior to the [ ]
or $
.
# Return a vector
$Sepal.Length
iris_tb"Sepal.Length"]]
iris_tb[[1]]
iris_tb[[
# Return a tibble
"Sepal.Length"]
iris_tb[, 1]
iris_tb[,
# Use piping
%>% .$Sepal.Length
iris_tb %>% .[, "Sepal.Length"] iris_tb
The most useful tool in the tidyverse is dplyr. It’s a Swiss-army knife for data wrangling. dplyr has many handy functions that we recommend incorporating into your analysis.
arrange()
changes the ordering of the rows.filter()
picks cases based on their values.distinct()
removes duplicate entries.slice_*()
selects rows by position.select()
extracts columns and returns a tibble.mutate()
adds new variables that are functions of existing variables.rename()
easily changes the name of a column(s).pull()
extracts a single column as a vector.group_by()
aggregates data by one or more variables.summarise()
reduces multiple values down to a single summary._join()
functions that merge two data frames together, including inner_join()
, left_join()
, right_join()
, and full_join()
.The Posit dplyr cheat sheet is extremely intuitive and helpful.
Some examples of using dplyr functions.
Sepal.Length
greater than 5.0, arrange the data by Sepal.Length
in descending order, and create a new column Sepal.Length_2
that is the square of Sepal.Length
.|>
iris_tb filter(Sepal.Length > 5.0) |>
arrange(desc(Sepal.Length)) |>
mutate(Sepal.Length_2 = Sepal.Length^2) |>
print()
# A tibble: 118 × 6
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_2
<dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 7.9 3.8 6.4 2 virginica 62.4
2 7.7 3.8 6.7 2.2 virginica 59.3
3 7.7 2.6 6.9 2.3 virginica 59.3
4 7.7 2.8 6.7 2 virginica 59.3
5 7.7 3 6.1 2.3 virginica 59.3
6 7.6 3 6.6 2.1 virginica 57.8
7 7.4 2.8 6.1 1.9 virginica 54.8
8 7.3 2.9 6.3 1.8 virginica 53.3
9 7.2 3.6 6.1 2.5 virginica 51.8
10 7.2 3.2 6 1.8 virginica 51.8
# ℹ 108 more rows
Species
, and find the distinct values of Species
|>
iris_tb select(Species) |>
distinct()
# A tibble: 3 × 1
Species
<fct>
1 setosa
2 versicolor
3 virginica
|>
iris_tb group_by(Species) |>
summarize(n = n())
# A tibble: 3 × 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
# Shortcut for group_by() |> summarize(n = n())
|>
iris_tb count(Species)
# A tibble: 3 × 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50
Sepal.Length
for each Species
|>
iris_tb group_by(Species) |>
summarize(mean_Sepal_Length = mean(Sepal.Length))
# A tibble: 3 × 2
Species mean_Sepal_Length
<fct> <dbl>
1 setosa 5.01
2 versicolor 5.94
3 virginica 6.59
Sepal.Length
for each Species
|>
iris_tb group_by(Species) |>
slice_max(Sepal.Length)
# A tibble: 3 × 5
# Groups: Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.8 4 1.2 0.2 setosa
2 7 3.2 4.7 1.4 versicolor
3 7.9 3.8 6.4 2 virginica
<- tribble(
(x ~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
))
# A tibble: 3 × 2
key val_x
<dbl> <chr>
1 1 x1
2 2 x2
3 3 x3
<- tribble(
(y ~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
))
# A tibble: 3 × 2
key val_y
<dbl> <chr>
1 1 y1
2 2 y2
3 4 y3
inner_join(x, y, by = "key")
# A tibble: 2 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
An outer join keeps observations that appear in at least one of the tables.
Three types of outer joins: left join, right join, and full join.
x
.left_join(x, y, by = "key")
# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
y
.right_join(x, y, by = "key")
# A tibble: 3 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 4 <NA> y3
x
and y
.full_join(x, y, by = "key")
# A tibble: 4 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 3 x3 <NA>
4 4 <NA> y3
<- tribble(
x ~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)<- tribble(
y ~key, ~val_y,
1, "y1",
2, "y2"
)left_join(x, y, by = "key")
# A tibble: 4 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 2 x3 y2
4 1 x4 y1
<- tribble(
x ~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)<- tribble(
y ~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
left_join(x, y, by = "key")
Warning in left_join(x, y, by = "key"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
"many-to-many"` to silence this warning.
# A tibble: 6 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 2 x2 y3
4 2 x3 y2
5 2 x3 y3
6 3 x4 y4
semi_join(x, y)
keeps the rows in x
that have a match in y
.
<- tribble(
x ~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
<- tribble(
y ~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
semi_join(x, y, by = "key")
# A tibble: 2 × 2
key val_x
<dbl> <chr>
1 1 x1
2 2 x2
anti_join(x, y)
keeps the rows that don’t have a match.
anti_join(x, y, by = "key")
# A tibble: 1 × 2
key val_x
<dbl> <chr>
1 3 x3
ggplot2
is a powerful and flexible package for creating plots in R.
The basic idea is to map data to aesthetics (color, shape, size, etc.) and then add layers (points, lines, bars, etc.) to the plot.
ggplot(iris_tb) +
# Scatter plot
geom_point(mapping = aes(
x = Petal.Length,
y = Petal.Width,
color = Species
+
)) # Add one more layer of line plot
geom_smooth(mapping = aes(
x = Petal.Length,
y = Petal.Width,
color = Species
+
)) # Axis labels, title, caption, etc
labs(x = "Petal Length (cm)",
y = "Petal Width (cm)",
)
Again the Posit ggplot2 cheat sheet is extremely intuitive and helpful.
Interactive plots are easily achieved by plotly
package.
Interactive apps can be developed by the popular shiny
package.
Find the median Petal.Length
of each Iris species.
# try to use pipe
Plot the histograms of Petal.Length
, grouped by Species
.
# geom_histogram