Data Science and R

Author

Dr. Roch Nianogo, Bowen Zhang, Dr. Hua Zhou

Code
# setup code: install packages
library(tidyverse)

1 Tell me about you

Slido

2 Roadmap

A typical data science project:

2.1 Learning objectives

In the next 1.5 days, we learn

  • the life cycle of a data science project

  • some R ecosystems (tidyverse, tidymodels, dml) for Open Data Science

  • basic machine learning

  • policy evaluation using double machine learning

Dr. Roch Nianogo will lead the second part, from June 25 afternoon to June 26, with in-depth discussions of simulation modeling, causal inference, and linking data science and systems science.

2.2 Course materials

All course materials are available on GitHub. During the course, you can

  • read the static tutorial pages, make comments, and ask questions; or

  • interactively run qmd files in RStudio on Posit Cloud (sign up for a free account); or

  • interactively run ipynb files in Jupyter Notebook on Binder (can be slow)

Adventurous ones can reproduce, improve, and generalize all the examples on your own computer by the following steps:

2.3 Questions please

Please feel free to ask questions and make comments. You can

  • use the “raise hand” feature (✋) in Zoom

  • type your questions in the Zoom chat (💬)

  • make comments or ask questions on tutorial pages (need to sign up an account on hypothes.is)

3 Data source

Slido

Census Bureau

3.1 Census and ACS

US Constitution Article I, Sections 2 and 9:

The actual enumeration shall be made within three years after the first meeting of the Congress of the United States, and within every subsequent term of ten years, in such manner as they shall by law direct.

  • Decennial census data. Every 10 years (1790, 1800, …, 2010, 2020) administered by the United States Census Bureau. Complete enumeration of the US population to assist with apportionment. A limited set of questions on race, ethnicity, age, sex, and housing tenure.

  • American Community Survey (ACS). Before the 2010 decennial Census, 1 in 6 Americans also received the Census long form, which asked a wider range of demographic questions on income, education, language, housing, and more. The Census long form has since been replaced by the American Community Survey, which is now the premier source of detailed demographic information about the US population. The ACS is mailed to approximately 3.5 million households per year (around 3% of the US population), allowing for annual data updates. The Census Bureau releases two ACS datasets to the public:

    • 1-year ACS: covers areas of population 65,000 and greater

    • 5-year ACS: moving average of data over a 5-year period that covers geographies down to the Census block group.

3.2 Current population survey (CPS)

The Current Population Survey (CPS), sponsored jointly by the U.S. Census Bureau and the U.S. Bureau of Labor Statistics (BLS), is the primary source of labor force statistics for the population of the United States.

The CPS is one of the oldest, largest, and most well-recognized surveys in the United States. It is immensely important, providing information on many of the things that define us as individuals and as a society – our work, our earnings, and our education.

In addition to being the primary source of monthly labor force statistics, the CPS is used to collect data for a variety of other studies that keep the nation informed of the economic and social well-being of its people. This is done by adding a set of supplemental questions to the monthly basic CPS questions. Supplemental inquiries vary month to month and cover a wide variety of topics such as child support, volunteerism, health insurance coverage, school enrollment, and food security. A listing and brief description of the CPS supplements are available here.

3.3 Food security supplement (CPS-FSS)

Take the CPS Food Security Supplement December 2021 Public-Use Microdata File as an example. The Food Security Supplement was completed for 30,343 interviewed households with 71,571 person records.

The microdata file includes data in four general categories:

  • Monthly labor force survey data (geographic, demographic, income, employment)
  • Food Security Supplement data (household food expenditures, use of food assistance programs, experiences and behaviors related to food security)
  • Food security status
  • Weighting variables

Food Security Supplement Questionnaire includes the following major sections:

  • Food Spending
  • Minimum Food Spending Needed
  • Food Assistance Program Participation
  • Food Sufficiency and Food Security
  • Ways of Avoiding or Ameliorating Food Deprivation

It is worth noting that beginning in 2015 and continuing through 2021, there were changes from previous years in how the Census Bureau processes some variables. Details can be found in the technical documentation, which can be found here

3.4 Other data sources

4 Introduction to R

Slido

4.1 Tidyverse

  • tidyverse is a collection of R packages for data ingestion, wrangling, and visualization.

Tidyverse

The Tidyverse suite of packages create and use data structures, functions and operators to make working with data more intuitive. The two most basic changes are in the use of pipes and tibbles.

  • The lead developer Hadley Wickham won the 2019 COPSS Presidents’ Award (the Nobel Prize of Statistics)

for influential work in statistical computing, visualization, graphics, and data analysis; for developing and implementing an impressively comprehensive computational infrastructure for data analysis through R software; for making statistical thinking and computing accessible to large audience; and for enhancing an appreciation for the important role of statistics among data scientists.

4.1.1 Pipes

Stringing together commands in R can be quite daunting. Also, trying to understand code that has many nested functions can be confusing.

To make R code more human readable, the Tidyverse tools use the pipe, %>%, which was acquired from the magrittr package and comes installed automatically with Tidyverse. The pipe allows the output of a previous command to be used as input to another command instead of using nested functions.

# A single command
sqrt(83)
[1] 9.110434
# Base R method of running more than one command
round(sqrt(83), digit = 2)
[1] 9.11
# Running more than one command with piping
sqrt(83) %>% round(digit = 2)
[1] 9.11

The pipe represents a much easier way of writing and deciphering R code, and we will be taking advantage of it for all future activities.

Tip

R 4.1.0 introduced a native pipe operator |>, which is mostly compatible with the pipe %>% offered by the tidyverse package magrittr. For some subtle differences, see this post by Hadley Wickham.

# R base pipe
sqrt(83) |> round(digit = 2)
[1] 9.11

4.1.2 Tibbles

A core component of the tidyverse is the tibble. Tibbles are a modern rework of the standard data.frame, with some internal improvements to make code more reliable. They are data frames, but do not follow all of the same rules. For example, tibbles can have column names that are not normally allowed, such as numbers/symbols.

The main differences between tibbles and data.frames relate to printing and subsetting.

  • iris is a data frame available in base R

iris

# By default, R displays ALL rows of a regular data frame!
iris
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
6            5.4         3.9          1.7         0.4     setosa
7            4.6         3.4          1.4         0.3     setosa
8            5.0         3.4          1.5         0.2     setosa
9            4.4         2.9          1.4         0.2     setosa
10           4.9         3.1          1.5         0.1     setosa
11           5.4         3.7          1.5         0.2     setosa
12           4.8         3.4          1.6         0.2     setosa
13           4.8         3.0          1.4         0.1     setosa
14           4.3         3.0          1.1         0.1     setosa
15           5.8         4.0          1.2         0.2     setosa
16           5.7         4.4          1.5         0.4     setosa
17           5.4         3.9          1.3         0.4     setosa
18           5.1         3.5          1.4         0.3     setosa
19           5.7         3.8          1.7         0.3     setosa
20           5.1         3.8          1.5         0.3     setosa
21           5.4         3.4          1.7         0.2     setosa
22           5.1         3.7          1.5         0.4     setosa
23           4.6         3.6          1.0         0.2     setosa
24           5.1         3.3          1.7         0.5     setosa
25           4.8         3.4          1.9         0.2     setosa
26           5.0         3.0          1.6         0.2     setosa
27           5.0         3.4          1.6         0.4     setosa
28           5.2         3.5          1.5         0.2     setosa
29           5.2         3.4          1.4         0.2     setosa
30           4.7         3.2          1.6         0.2     setosa
31           4.8         3.1          1.6         0.2     setosa
32           5.4         3.4          1.5         0.4     setosa
33           5.2         4.1          1.5         0.1     setosa
34           5.5         4.2          1.4         0.2     setosa
35           4.9         3.1          1.5         0.2     setosa
36           5.0         3.2          1.2         0.2     setosa
37           5.5         3.5          1.3         0.2     setosa
38           4.9         3.6          1.4         0.1     setosa
39           4.4         3.0          1.3         0.2     setosa
40           5.1         3.4          1.5         0.2     setosa
41           5.0         3.5          1.3         0.3     setosa
42           4.5         2.3          1.3         0.3     setosa
43           4.4         3.2          1.3         0.2     setosa
44           5.0         3.5          1.6         0.6     setosa
45           5.1         3.8          1.9         0.4     setosa
46           4.8         3.0          1.4         0.3     setosa
47           5.1         3.8          1.6         0.2     setosa
48           4.6         3.2          1.4         0.2     setosa
49           5.3         3.7          1.5         0.2     setosa
50           5.0         3.3          1.4         0.2     setosa
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
53           6.9         3.1          4.9         1.5 versicolor
54           5.5         2.3          4.0         1.3 versicolor
55           6.5         2.8          4.6         1.5 versicolor
56           5.7         2.8          4.5         1.3 versicolor
57           6.3         3.3          4.7         1.6 versicolor
58           4.9         2.4          3.3         1.0 versicolor
59           6.6         2.9          4.6         1.3 versicolor
60           5.2         2.7          3.9         1.4 versicolor
61           5.0         2.0          3.5         1.0 versicolor
62           5.9         3.0          4.2         1.5 versicolor
63           6.0         2.2          4.0         1.0 versicolor
64           6.1         2.9          4.7         1.4 versicolor
65           5.6         2.9          3.6         1.3 versicolor
66           6.7         3.1          4.4         1.4 versicolor
67           5.6         3.0          4.5         1.5 versicolor
68           5.8         2.7          4.1         1.0 versicolor
69           6.2         2.2          4.5         1.5 versicolor
70           5.6         2.5          3.9         1.1 versicolor
71           5.9         3.2          4.8         1.8 versicolor
72           6.1         2.8          4.0         1.3 versicolor
73           6.3         2.5          4.9         1.5 versicolor
74           6.1         2.8          4.7         1.2 versicolor
75           6.4         2.9          4.3         1.3 versicolor
76           6.6         3.0          4.4         1.4 versicolor
77           6.8         2.8          4.8         1.4 versicolor
78           6.7         3.0          5.0         1.7 versicolor
79           6.0         2.9          4.5         1.5 versicolor
80           5.7         2.6          3.5         1.0 versicolor
81           5.5         2.4          3.8         1.1 versicolor
82           5.5         2.4          3.7         1.0 versicolor
83           5.8         2.7          3.9         1.2 versicolor
84           6.0         2.7          5.1         1.6 versicolor
85           5.4         3.0          4.5         1.5 versicolor
86           6.0         3.4          4.5         1.6 versicolor
87           6.7         3.1          4.7         1.5 versicolor
88           6.3         2.3          4.4         1.3 versicolor
89           5.6         3.0          4.1         1.3 versicolor
90           5.5         2.5          4.0         1.3 versicolor
91           5.5         2.6          4.4         1.2 versicolor
92           6.1         3.0          4.6         1.4 versicolor
93           5.8         2.6          4.0         1.2 versicolor
94           5.0         2.3          3.3         1.0 versicolor
95           5.6         2.7          4.2         1.3 versicolor
96           5.7         3.0          4.2         1.2 versicolor
97           5.7         2.9          4.2         1.3 versicolor
98           6.2         2.9          4.3         1.3 versicolor
99           5.1         2.5          3.0         1.1 versicolor
100          5.7         2.8          4.1         1.3 versicolor
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
103          7.1         3.0          5.9         2.1  virginica
104          6.3         2.9          5.6         1.8  virginica
105          6.5         3.0          5.8         2.2  virginica
106          7.6         3.0          6.6         2.1  virginica
107          4.9         2.5          4.5         1.7  virginica
108          7.3         2.9          6.3         1.8  virginica
109          6.7         2.5          5.8         1.8  virginica
110          7.2         3.6          6.1         2.5  virginica
111          6.5         3.2          5.1         2.0  virginica
112          6.4         2.7          5.3         1.9  virginica
113          6.8         3.0          5.5         2.1  virginica
114          5.7         2.5          5.0         2.0  virginica
115          5.8         2.8          5.1         2.4  virginica
116          6.4         3.2          5.3         2.3  virginica
117          6.5         3.0          5.5         1.8  virginica
118          7.7         3.8          6.7         2.2  virginica
119          7.7         2.6          6.9         2.3  virginica
120          6.0         2.2          5.0         1.5  virginica
121          6.9         3.2          5.7         2.3  virginica
122          5.6         2.8          4.9         2.0  virginica
123          7.7         2.8          6.7         2.0  virginica
124          6.3         2.7          4.9         1.8  virginica
125          6.7         3.3          5.7         2.1  virginica
126          7.2         3.2          6.0         1.8  virginica
127          6.2         2.8          4.8         1.8  virginica
128          6.1         3.0          4.9         1.8  virginica
129          6.4         2.8          5.6         2.1  virginica
130          7.2         3.0          5.8         1.6  virginica
131          7.4         2.8          6.1         1.9  virginica
132          7.9         3.8          6.4         2.0  virginica
133          6.4         2.8          5.6         2.2  virginica
134          6.3         2.8          5.1         1.5  virginica
135          6.1         2.6          5.6         1.4  virginica
136          7.7         3.0          6.1         2.3  virginica
137          6.3         3.4          5.6         2.4  virginica
138          6.4         3.1          5.5         1.8  virginica
139          6.0         3.0          4.8         1.8  virginica
140          6.9         3.1          5.4         2.1  virginica
141          6.7         3.1          5.6         2.4  virginica
142          6.9         3.1          5.1         2.3  virginica
143          5.8         2.7          5.1         1.9  virginica
144          6.8         3.2          5.9         2.3  virginica
145          6.7         3.3          5.7         2.5  virginica
146          6.7         3.0          5.2         2.3  virginica
147          6.3         2.5          5.0         1.9  virginica
148          6.5         3.0          5.2         2.0  virginica
149          6.2         3.4          5.4         2.3  virginica
150          5.9         3.0          5.1         1.8  virginica
  • Convert a regular data frame to tibble, which by default only displays the first 10 rows of data.
# Convert iris to a tibble
iris_tb <- as_tibble(iris)
iris_tb
# A tibble: 150 × 5
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
 1          5.1         3.5          1.4         0.2 setosa 
 2          4.9         3            1.4         0.2 setosa 
 3          4.7         3.2          1.3         0.2 setosa 
 4          4.6         3.1          1.5         0.2 setosa 
 5          5           3.6          1.4         0.2 setosa 
 6          5.4         3.9          1.7         0.4 setosa 
 7          4.6         3.4          1.4         0.3 setosa 
 8          5           3.4          1.5         0.2 setosa 
 9          4.4         2.9          1.4         0.2 setosa 
10          4.9         3.1          1.5         0.1 setosa 
# ℹ 140 more rows
# If subsetting a single column from a data.frame, R will output a vector
iris[, "Sepal.Length"]
  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9
# If subsetting a single column from a tibble, R will output a tibble
iris_tb[, "Sepal.Length"]
# A tibble: 150 × 1
   Sepal.Length
          <dbl>
 1          5.1
 2          4.9
 3          4.7
 4          4.6
 5          5  
 6          5.4
 7          4.6
 8          5  
 9          4.4
10          4.9
# ℹ 140 more rows

Also note that if you use piping to subset a tibble, then the notation is slightly different, requiring a placeholder . prior to the [ ] or $.

# Return a vector
iris_tb$Sepal.Length
iris_tb[["Sepal.Length"]]
iris_tb[[1]]

# Return a tibble
iris_tb[, "Sepal.Length"]
iris_tb[, 1]

# Use piping
iris_tb %>% .$Sepal.Length
iris_tb %>% .[, "Sepal.Length"]

4.1.3 dplyr

The most useful tool in the tidyverse is dplyr. It’s a Swiss-army knife for data wrangling. dplyr has many handy functions that we recommend incorporating into your analysis.

  • Operations on rows:
    • arrange() changes the ordering of the rows.
    • filter() picks cases based on their values.
    • distinct() removes duplicate entries.
    • slice_*() selects rows by position.
  • Operations on columns:
    • select() extracts columns and returns a tibble.
    • mutate() adds new variables that are functions of existing variables.
    • rename() easily changes the name of a column(s).
    • pull() extracts a single column as a vector.
  • Grouped operations:
    • group_by() aggregates data by one or more variables.
    • summarise() reduces multiple values down to a single summary.
  • _join() functions that merge two data frames together, including inner_join(), left_join(), right_join(), and full_join().

The Posit dplyr cheat sheet is extremely intuitive and helpful.

Some examples of using dplyr functions.

  • Filter observations with Sepal.Length greater than 5.0, arrange the data by Sepal.Length in descending order, and create a new column Sepal.Length_2 that is the square of Sepal.Length.
iris_tb |>
  filter(Sepal.Length > 5.0) |>
  arrange(desc(Sepal.Length)) |>
  mutate(Sepal.Length_2 = Sepal.Length^2) |>
  print()
# A tibble: 118 × 6
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   Sepal.Length_2
          <dbl>       <dbl>        <dbl>       <dbl> <fct>              <dbl>
 1          7.9         3.8          6.4         2   virginica           62.4
 2          7.7         3.8          6.7         2.2 virginica           59.3
 3          7.7         2.6          6.9         2.3 virginica           59.3
 4          7.7         2.8          6.7         2   virginica           59.3
 5          7.7         3            6.1         2.3 virginica           59.3
 6          7.6         3            6.6         2.1 virginica           57.8
 7          7.4         2.8          6.1         1.9 virginica           54.8
 8          7.3         2.9          6.3         1.8 virginica           53.3
 9          7.2         3.6          6.1         2.5 virginica           51.8
10          7.2         3.2          6           1.8 virginica           51.8
# ℹ 108 more rows
  • Select columns Species, and find the distinct values of Species
iris_tb |>
  select(Species) |>
  distinct()
# A tibble: 3 × 1
  Species   
  <fct>     
1 setosa    
2 versicolor
3 virginica 
  • Count the number of rows in each species
iris_tb |>
  group_by(Species) |> 
  summarize(n = n())
# A tibble: 3 × 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50
# Shortcut for group_by() |> summarize(n = n())
iris_tb |>
  count(Species)
# A tibble: 3 × 2
  Species        n
  <fct>      <int>
1 setosa        50
2 versicolor    50
3 virginica     50
  • Calculate the mean of Sepal.Length for each Species
iris_tb |>
  group_by(Species) |>
  summarize(mean_Sepal_Length = mean(Sepal.Length))
# A tibble: 3 × 2
  Species    mean_Sepal_Length
  <fct>                  <dbl>
1 setosa                  5.01
2 versicolor              5.94
3 virginica               6.59
  • Find the observation with the maximum Sepal.Length for each Species
iris_tb |>
  group_by(Species) |>
  slice_max(Sepal.Length)
# A tibble: 3 × 5
# Groups:   Species [3]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
         <dbl>       <dbl>        <dbl>       <dbl> <fct>     
1          5.8         4            1.2         0.2 setosa    
2          7           3.2          4.7         1.4 versicolor
3          7.9         3.8          6.4         2   virginica 

4.1.4 Combine variables (columns)

  • Demo tables
(x <- tribble(
  ~key, ~val_x,
  1, "x1",
  2, "x2",
  3, "x3"
))
# A tibble: 3 × 2
    key val_x
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3   
(y <- tribble(
  ~key, ~val_y,
  1, "y1",
  2, "y2",
  4, "y3"
))
# A tibble: 3 × 2
    key val_y
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     4 y3   
  • An inner join matches pairs of observations whenever their keys are equal:

inner_join(x, y, by = "key")
# A tibble: 2 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
  • An outer join keeps observations that appear in at least one of the tables.

  • Three types of outer joins: left join, right join, and full join.

  • A left join keeps all observations in x.
left_join(x, y, by = "key")
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
  • A right join keeps all observations in y.
right_join(x, y, by = "key")
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y3   
  • A full join keeps all observations in x and y.
full_join(x, y, by = "key")
# A tibble: 4 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y3   
  • One table has duplicate keys.

x <- tribble(
  ~key, ~val_x,
  1, "x1",
  2, "x2",
  2, "x3",
  1, "x4"
)
y <- tribble(
  ~key, ~val_y,
  1, "y1",
  2, "y2"
)
left_join(x, y, by = "key")
# A tibble: 4 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     2 x3    y2   
4     1 x4    y1   
  • Both tables have duplicate keys. You get all possible combinations, the Cartesian product:

x <- tribble(
  ~key, ~val_x,
  1, "x1",
  2, "x2",
  2, "x3",
  3, "x4"
)
y <- tribble(
  ~key, ~val_y,
  1, "y1",
  2, "y2",
  2, "y3",
  3, "y4"
)

left_join(x, y, by = "key")
Warning in left_join(x, y, by = "key"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 2 of `x` matches multiple rows in `y`.
ℹ Row 2 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
# A tibble: 6 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     2 x2    y3   
4     2 x3    y2   
5     2 x3    y3   
6     3 x4    y4   

4.1.5 Combine cases (rows)

  • semi_join(x, y) keeps the rows in x that have a match in y.

x <- tribble(
  ~key, ~val_x,
  1, "x1",
  2, "x2",
  3, "x3"
)

y <- tribble(
  ~key, ~val_y,
  1, "y1",
  2, "y2",
  4, "y3"
)

semi_join(x, y, by = "key")
# A tibble: 2 × 2
    key val_x
  <dbl> <chr>
1     1 x1   
2     2 x2   
  • anti_join(x, y) keeps the rows that don’t have a match.

anti_join(x, y, by = "key")
# A tibble: 1 × 2
    key val_x
  <dbl> <chr>
1     3 x3   

4.2 Data visualization using ggplot

  • ggplot2 is a powerful and flexible package for creating plots in R.

  • The basic idea is to map data to aesthetics (color, shape, size, etc.) and then add layers (points, lines, bars, etc.) to the plot.

ggplot(iris_tb) +
  # Scatter plot
  geom_point(mapping = aes(
    x = Petal.Length, 
    y = Petal.Width, 
    color = Species
    )) + 
  # Add one more layer of line plot
  geom_smooth(mapping = aes(
    x = Petal.Length, 
    y = Petal.Width, 
    color = Species
    )) +
  # Axis labels, title, caption, etc
  labs(x = "Petal Length (cm)",
       y = "Petal Width (cm)",
  )

  • Again the Posit ggplot2 cheat sheet is extremely intuitive and helpful.

  • Interactive plots are easily achieved by plotly package.

  • Interactive apps can be developed by the popular shiny package.

4.3 Exercises

  1. Find the median Petal.Length of each Iris species.

    # try to use pipe
  2. Plot the histograms of Petal.Length, grouped by Species.

    # geom_histogram

5 Feedback

Slido