STA 3100: Programming with Data

Class Number (Section): 26313 (3DTA) |
---|

Meets: MWF 3:00 – 3:50 PM (Period 8) |

Location: FLO 100 |

Web: https://ufl.instructure.com/courses/498971 |

Instructor: Dr. Brett Presnell |

Teaching Assistant: Dipshi Roychowdhury |

Dr. Brett Presnell | Dipshi Roychowdhury |
---|---|

Email: presnell@ufl.edu | Email: droychowdhury@ufl.edu |

Web: https://www.stat.ufl.edu/~presnell/ | Office: FLO 117D |

Office: FLO 225 | Virtual Office: Zoom 658 792 6980 |

Virtual Office: Zoom 940 1233 3509 | Office Hrs: Tue 2-3:30 (in person) |

Office Hrs: MW 4-5 PM | Fri 1-2:30 (online) |

An introduction to statistical computing and programming with data. Topics include basic programming in R; data types and data structures in R; importing and cleaning data; specifying statistical models in R; statistical graphics; statistical simulation using pseudo-random numbers; reproducible research and the documentation of statisical analyses.

STA 3032 (B-) or STA 2023 (B) or AP Statistics (4).

You will learn to do the following:

Import data into R and prepare the data for analysis.

Write functions in R making effective use of data structures and control structures.

Formulate statistical models in the R language.

Perform, document, and interpret common statistical analyses.

Carry out statistical/probabalistic simulations.

Determine statistical graphics appropriate to a statistical analysis and produce them using R.

Document and report the results of data analyses and simulations in a reproducible way.

We will use a variety of on-line texts and other resources. Class notes and other materials will be made available on the course website. Most readings will be taken from the following (free, on-line) texts, which students are encouraged to peruse on their own:

r4ds2e : R for Data Science (2e): Visualize, Model, Transform, Tidy, and Import Data

rp4ds : R Programming for Data Science

hopr : Hands-On Programming with R : Write Your Own Functions and Simulations

advr : Advanced R (2nd Ed)

rgraphics : R Graphics Cookbook, 2nd edition

Chang, Winston. 2018. *R Graphics Cookbook: Practical Recipes for
Visualizing Data*. 2nd ed. Sebastopol, California:
O’Reilly Media, Inc. https://r-graphics.org/.

Grolemund, Garrett. 2014. *Hands-on Programming with R:
Write Your Own Functions and Simulations*. Sebastopol, CA:
O’Reilly Media, Inc. https://rstudio-education.github.io/hopr/.

Healy, Kieran. 2018. *Data Visualization: A Practical
Introduction*. Princeton University Press. https://socviz.co/.

Peng, Roger D. 2016. *R Programming for Data Science*. 5+ ed.
Lulu.com. https://bookdown.org/rdpeng/rprogdatascience/.

Wickham, Hadley. 2019. *Advanced R*. 2nd ed. Boca
Raton, Florida: Chapman; Hall/CRC. https://adv-r.hadley.nz/.

Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2016.
*R for Data Science: Import, Tidy, Transform, Visualize, and Model
Data*. 2nd ed. Sebastopol, California:
O’Reilly Media, Inc. https://r4ds.hadley.nz/.

Wickham, Hadley, Danielle Navarro, and Thomas Lin Pedersen. 2022.
*Ggplot2: Elegant Graphics for Data Analysis*. 3rd ed. Springer.
https://ggplot2-book.org/.

There will be regular online quizzes to help you refine your knowledge and understanding of the course material. Homework assignments and projects will put this knowledge to use. These will be weighted in the final course average (percentage) as follows:

- 80% Homework/Projects
- 20% Quizzes

Letter grades in the course will be determined from the final course average according to the following scale (after rounding to the nearest integer):

A | A- | B+ | B | B- | C+ | C | D | E |
---|---|---|---|---|---|---|---|---|

94-100 | 90-93 | 87-89 | 84-86 | 80-83 | 77-79 | 67-76 | 60-66 | 0-59 |

Further information may be found in the university’s grades and grading policies.

Homework and projects must be submitted on time, and it is the student’s reponsibility to allocate sufficient time to complete each assignment by the due date.

Late assignments will be accepted in cases of documented emergency or illness, but you must inform the instructor in advance of any illness which may lead to a late submission.

In all other cases, acceptance of late assignments will be at the discretion of the instructor. Scores on late submissions which are accepted will be reduced by 10% plus an additional 5% for each additional day between the due date and the time of submission.

Nota bene, it is the student’s responsibility to correctly submit
their work for every assignment, so **always double check that you
have submitted the correct file(s) for each assignment**.
Similarly, losing the internet connection in your residence at the last
minute is **not** an acceptable excuse for a late
submission. (If you insist on submitting your assigments at the last
hour, then be sure that you know how to use your mobile phone as a WIFI
hotspot.)

**If you have not submitted the correct file(s) by the due
date, then any subsequent submission will be treated as a late
submission.**

If you feel that an error has been made in grading an assignment, please first contact your TA during their office hours or by email. If, after consulting with the TA, you still feel that your assignment has been graded incorrectly, you may submit a written (typed, not handwritten) appeal to the instructor detailing precisely how your assignment was misgraded.

Students will be held accountable to the UF Honor Code.

Unless otherwise specified in writing by the instructor, students are
expected to work independently or in *assigned* groups. General
discussion of the course material is encouraged, but offering or
accepting solutions from others is plagiarism. When in doubt, direct
your questions to the instructor or TA.

As in all courses at UF, unauthorized recording and unauthorized sharing of recorded materials by students or any other party is prohibited.

Except in special circumstances, class sessions will not be recorded by the instructor. In case a class session is recorded, students who participate with their camera engaged or who utilize a profile image are agreeing to have their video or image recorded. If you are unwilling to consent to have your profile or video image recorded, be sure to keep your camera off and do not use a profile image. Likewise, students who un-mute during class and participate orally are agreeing to have their voice recorded. If you are not willing to consent to have your voice recorded during class, you will need to keep your mute button activated and communicate exclusively using the “chat” feature, which allows students to type questions and comments live. The chat will not be recorded or shared.

Students requesting accommodation for disabilities must register with UF’s Disability Resource Center. The DRC will provide documentation to the students who must then provide this documentation to the instructor when requesting information. You must submit this documentation prior to submitting any assignments or taking any exam or quiz for which you are requesting accommodation.

Students are expected to provide feedback on the quality of instruction in this course by completing course evaluations online via GatorEvals. Guidance on how to give feedback in a professional and respectful manner is available at https://gatorevals.aa.ufl.edu/students/. Students will be notified when the evaluation period opens, and can complete evaluations through the email they receive from GatorEvals, in their Canvas course menu under GatorEvals, or via https://ufl.bluera.com/ufl/. Summaries of course evaluation results are available to students at https://gatorevals.aa.ufl.edu/public-results/.

This is an aspirational schedule for the course. It may be altered or rearranged to adapt to the backgrounds, abilities, and interests of the students in the class. There are 43 scheduled class meetings.

Getting started

Vectors and Vectorized Operations

- Introduction to R Markdown

Distributions and Descriptive Statistics

Writing Your Own Functions

- Matrices and Arrays

- Lists

Data frames (and tibbles)

Importing and Exporting Data

Column and row operations on data frames

Pipes and more operations on data frames

Joining/Merging Data Frames

Dates and times in base R

The lubridate package

- Tidy Data and Pivoting

Character strings and the stringr package

String matching with regular expressions

Detecting string matches

Extracting string matches

String replacement and string splitting

An extended example with character strings

- Introduction to Data Scraping

Factors in base R

The forcats package

Elementary statistical inference

Simple linear regression

Multiple regression

Factors and dummy variables in regression

Interactions

Simple logistic regression

Multiple logistic regression

More graphics in R

- Working with lists: the purrr package

- More on data scraping and/or simulation.

Links to slides and reading assignments for each lecture will be added here throughout the semester.

- Day 1 (Mon, Jan 8)
- Review syllabus
- Get R and Friends (PDF version)
- Diagnosing Problems with Your R Setup (PDF version)
- Introduction to R and RStudio
- A Brief History of S and R (PDF)
- R Basics (Slides 1–10) (PDF)
(R
code)
- New R functions:
`+`

,`-`

,`*`

,`/`

,`^`

,`%/%`

,`%%`

,`c`

,`<-`

,`length`

,`abs`

,`round`

,`log`

,`log10`

,`log2`

,`sin`

,`asin`

,`factorial`

,`choose`

,`lfactorial`

,`exp`

,`gamma`

,`incr`

,`sqrt`

,`function`

- New R functions:
- Readings
- rp4ds: Chapter 2 ( History and Overview of R), Chapter 3 (Getting Started with R), and Sections 4.1–4.6 (Entering Input; Evaluation; R Objects; Numbers; Attributes; Creating Vectors)
- r4ds2e: Chapter 1 (Introduction)
- advr: Chapter 1 (Introduction) (only so you will know what this book is about; otherwise optional)

- Day 2 (Wed, Jan 10)
- Files, Directories, and Paths
- Finish R Basics (Slides 11–19)

- Day 3 (Fri, Jan 12)
- Atomic Types (PDF)
(R
code)
- New R functions:
`print`

,`writeLines`

,`is.double`

,`is.integer`

,`is.character`

,`is.logical`

,`is.list`

,`get`

,`typeof`

,`as.double`

,`sum`

,`mean`

,`>`

,`paste`

,`as.character`

,`as.logical`

,`rnorm`

- New R functions:
- Readings
- rp4ds: Sections 4.7–4.8 (Mixing Objects; Explicit Coercion)
- r4ds2e: Chapter 3 (Workflow: basics)
- (optional) advr: Sections 3.1–3.3.2

- Atomic Types (PDF)
(R
code)

- Day 4 (Wed, Jan 17)
- Finish Atomic Types
- Sequences and Repetition (PDF)
(R
code)
- New R functions:
`:`

,`seq`

,`rep`

- New R functions:
- Indexing (PDF)
(R
code)
- New R functions:
`[`

,`names`

,`sample`

,`sort`

- New R functions:
- Readings
- rp4ds: Section 9.1 (Subsetting a Vector)

- Day 5 (Fri, Jan 19)
- Probability Distributions (PDF)
(R
code)
- New R functions:
`dnorm`

,`pnorm`

,`qnorm`

,`set.seed`

- New R functions:
- Single Variable Statistics (PDF)
(R
code)
- New R functions:
`summary`

,`median`

,`min`

,`max`

,`range`

,`quantile`

,`sd`

,`IQR`

,`t.test`

,`ecdf`

,`qt`

- New R functions:
- Readings
- rp4ds: Sections 20.1–20.2 (Generating Random Numbers, Setting the Random Number Seed)
- r4ds2e: Chapter 1 (Data Visualization)
- rgraphics: Chapter 2 (Quickly Exploring Data)
- (optional) socviz: Chapters 1–3

- Probability Distributions (PDF)
(R
code)

- Day 6 (Mon, Jan 22)
- Finish Single Variable Statistics
- Readings and References for R Markdown
- R Markdown Cheatsheet
- R Markdown Reference Guide
- R Markdown: The Definitive Guide (for reference).
- r4ds: Chapter 27
- r4ds2e: Chapter 29 (Quarto, optional)

- Day 7 (Wed, Jan 24)
- Logic (PDF)
(R
code) - New R functions:
`<`

,`<=`

,`>=`

,`==`

,`!=`

,`!`

,`|`

,`&`

,`xor`

,`table`

,`all`

,`any`

,`runif`

- Simulation Example: Monte Carlo Estimation of \(\pi\) (PDF) (R code) (R markdown file)

- Logic (PDF)
(R
code) - New R functions:

- Day 8 (Fri, Jan 26)
- Matrices (PDF)
(R
code)
- New R functions:
`matrix`

,`is.matrix`

,`attributes`

,`dim`

,`dimnames`

,`crossprod`

,`%*%`

,`apply`

,`t`

,`diag`

,`solve`

,`cbind`

,`rbind`

,`attr`

,`colnames`

,`list`

,`rownames`

,`drop`

- New R functions:
- Readings
- rp4ds: Section 4.9, Section 9.2
- (optional) advr: Sections 3.3.3, 4.2.3

- Matrices (PDF)
(R
code)

- Day 9 (Mon, Jan 29)
- Finish Matrices.

- Day 10 (Wed, Jan 31)
- Work in class on a preliminary version of assignment 030-Matrix-Index-Calc

- Day 11 (Fri, Feb 2)

- Day 12 (Mon, Feb 5)
- Lists (PDF)
(R
code)
- New R functions:
`[[`

,`$`

,`str`

,`lm`

,`as.Date`

- New R functions:
- Readings
- rp4ds: Sections 9.3–9.5
- r4ds2e: Section 24.2 through 24.2.1 (Lists and Hierarchy)
- (optional) advr: Sections 3.5, 4.2.2, 4.3–4.4, Chapter 13

- Lists (PDF)
(R
code)

- Day 13 (Wed, Feb 7)
- Finish lists.
- S3 Primer (PDF)
(R
code)
- New R functions:
`getS3method`

,`methods`

- New R functions:
- Readings
- (optional) advr: Chapter 13

- Day 14 (Fri, Feb 9):
- Random Numbers and Simulation (PDF)
(R
code)
- New R functions:
`sample.int`

,`%in%`

,`for`

,`noquote`

,`replicate`

,`rle`

,`vector`

,`unclass`

,`rmax_run_len`

- New R functions:
- Readings
- r4ds2e: Sections 26.1 and 26.2
- rp4ds: Chapters 13 and 14
- (optional) advr: Chapters 5 and 6

- Random Numbers and Simulation (PDF)
(R
code)

- Day 15 (Mon, Feb 12):
- Discussion and practice with
`apply()`

and anonymous functions. - Finish Random Numbers and Simulation.

- Discussion and practice with

- Day 16 (Wed, Feb 14):
- Timing and Functions (PDF)
(R
code)
- New R functions:
`system.time`

,`fmaxrl`

,`if`

,`is.null`

- New R functions:

- Timing and Functions (PDF)
(R
code)

- Day 17 (Fri, Feb 16):
- Evaluating Simulations (PDF)
(R
code)
- New R functions:
`binom.test`

,`prop.test`

,`simtosses`

,`seq_along`

,`curve`

,`repeat`

,`invisible`

,`readline`

,`as.numeric`

,`break`

,`paste0`

,`dt`

- New R functions:

- Evaluating Simulations (PDF)
(R
code)

- Day 18 (Mon, Feb 19):
- Finish Evaluating Simulations
- Data Frames (PDF)
(R
code)
- New R functions:
`data.frame`

,`factor`

- New R functions:
- Readings
- rp4ds: Sections 4:13–4:15
- (optional) advr: Sections 3.6.1–3.6.5 and Sections 4.2.4–4.2.5

- Day 19 (Wed, Feb 21):
- Tibbles (PDF)
(R
code)
- New R functions:
`library`

,`tibble`

,`as_tibble`

,`I`

- New R functions:
- Readings
- r4ds2e: Section 8.6
- (optional) advr: Sections 3.66–3.67

- Tibbles (PDF)
(R
code)

- Day 20 (Fri, Feb 23):
- Finish Tibbles
- Row and Column Operations in Base R (PDF)
(R
code)
- New R functions:
`subset`

,`order`

,`head`

,`tail`

,`tapply`

,`with`

,`aggregate`

,`transform`

,`union`

,`setdiff`

- New R functions:
- Readings

- Day 21 (Mon, Feb 26):

- Day 22 (Wed, Feb 28):
- Row and Column Operations with dplyr (PDF)
(R
code)
- New R functions:
`filter`

,`select`

,`arrange`

,`slice`

,`slice_head`

,`slice_tail`

,`slice_max`

,`slice_sample`

,`summarise`

,`group_vars`

,`is.array`

,`summarize`

,`mutate`

,`pivot_wider`

,`rename`

,`relocate`

,`desc`

,`group_by`

,`n`

,`ungroup`

,`xtabs`

- New R functions:
- Readings

- Row and Column Operations with dplyr (PDF)
(R
code)

- Day 23 (Fri, Mar 1):
- Finish “Row and Column Operations with dplyr”
- Joining/Merging Data Frames (PDF)
(R
code)
- New R functions:
`merge`

,`inner_join`

,`anti_join`

,`full_join`

,`left_join`

,`right_join`

,`tribble`

,`as.data.frame`

,`ifelse`

,`join_by`

,`case_match`

- New R functions:
- Readings
- r4ds2e: chapter 21
- dplyr
<-> Base R: section
*Two-table verbs* - Two-Table Verbs Vignette
- dplyr 1.1.0: Joins

- Day 24 (Mon, Mar 4):
- Finish “Joining/Merging Data Frames”
- Importing and Exporting Data (PDF)
(R
code)
- Data: guitars.csv, houseSalesGNV.csv, ufAcademicPrograms.xlsx
- New R functions:
`cat`

,`glimpse`

,`write_csv`

,`write_rds`

,`identical`

,`readLines`

,`read.csv`

,`read_csv`

,`read.table`

,`read_excel`

,`fill`

,`read_sheet`

,`read_rds`

,`which`

- Readings

- Day 25 (Wed, Mar 6):
- Finish “Importing and Exporting Data”

- Day 26 (Fri, Mar 8):
- Importing the FEC Santos Data (PDF)
(R
code)
- Data: schedule_b-2023-02-11T18 48 59.csv,
- New R functions:
`problems`

,`spec`

,`cols_condense`

,`cols`

,`hour`

,`minute`

,`second`

,`cols_only`

,`col_character`

,`pick`

,`col_integer`

,`col_datetime`

,`col_double`

,`vapply`

,`character`

- Importing the FEC Santos Data (PDF)
(R
code)

- Day 27 (Mon, Mar 18):
- Dates and Times in Base R (PDF)
(R
code)
- New R functions:
`ISOdate`

,`Sys.timezone`

,`as.POSIXct`

,`difftime`

,`as.difftime`

,`is.numeric`

,`Sys.Date`

,`Sys.time`

,`as.POSIXlt`

- New R functions:
- Readings

- Dates and Times in Base R (PDF)
(R
code)

- Day 28 (Wed, Mar 20):
- The Lubridate Package (PDF)
(R
code)
- Data: tvRndOf32-2019.csv
- New R functions:
`as_date`

,`mdy`

,`dmy`

,`make_date`

,`pull`

,`ymd_hm`

,`strftime`

,`year`

,`quarter`

,`month`

,`day`

,`wday`

,`mday`

,`qday`

,`yday`

,`is.factor`

,`is.ordered`

,`dyears`

,`years`

,`dhours`

,`hours`

,`isS4`

,`int_overlaps`

,`intersect`

,`today`

,`now`

,`ymd_hms`

,`interval`

,`ymd_h`

,`%--%`

,`int_shift`

,`int_end`

,`str_replace`

,`days`

,`str_c`

,`weeks`

- Readings
- Extra Lubridate Slides (PDF)
(R
code)
- Data: tvRndOf32-2019.csv
- New R functions:
`separate_wider_regex`

,`rows_update`

,`str_replace_all`

,`dminutes`

,`coalesce`

- The Lubridate Package (PDF)
(R
code)

- Day 29 (Fri, Mar 22):
- Tidy Data with the tidyr Package (PDF)
(R
code)
- Data: annual_border_crossings.tsv, annual_border_crossings.xlsx
- New R functions:
`Sys.getlocale`

,`locale`

,`distinct`

,`read_tsv`

,`pivot_longer`

,`janitor::clean_names`

,`rename_with`

- Readings

- Tidy Data with the tidyr Package (PDF)
(R
code)

- Day 30 (Mon, Mar 25):
- String Basics (PDF)
(R
code)
- New R functions:
`str_length`

,`str_flatten`

,`str_sub`

,`rev`

- New R functions:
- Readings

- String Basics (PDF)
(R
code)

- Day 31 (Wed, Mar 27):
- Regular Expressions (PDF)
(R
code)
- New R functions:
`str_view`

,`str_view_all`

,`rphone`

- New R functions:
- Readings
- r4ds2e: Chapter 16
- Regular Expressions Vignette
- RegexOne: Learn Regular Expressions with simple, interactive exercises.
- Regular-Expressions.info Encyclopedic information about regular expressions.

- Regular Expressions (PDF)
(R
code)

- Day 32 (Fri, Mar 29):
- Finish Regular Expressions
- In-class regular expression exercises

- Day 33 (Mon, Apr 1):
- Using Regular Expressions (PDF)
(R
code)
- New R functions:
`str_detect`

,`str_which`

,`str_subset`

,`str_count`

,`slice_min`

,`str_extract`

,`str_extract_all`

,`str_match`

,`str_match_all`

,`str_split`

,`regex`

,`boundary`

,`list_c`

,`str_to_lower`

- New R functions:
- Readings
- r4ds2e: Chapter 16
- Regular Expressions Vignette
- RegexOne: Learn Regular Expressions with simple, interactive exercises.
- Regular-Expressions.info Encyclopedic information about regular expressions.

- Using Regular Expressions (PDF)
(R
code)

- Day 34 (Wed, Apr 3):
- Finish Using Regular Expressions
- Meta String (PDF)
(R
code)
- Data: stringr_lines.txt
- New R functions:
`separate_wider_position`

,`unique`

,`apropos`

,`semi_join`

,`separate_wider_delim`

,`read_lines`

,`unnest_longer`

,`str_split_i`

,`as.integer`

,`map_int`

- Readings
- r4ds2e: Section 15.4 and 16.7

- Day 35 (Fri, Apr 5):
- Finish Meta String
- Work on web scraping example.

- Day 36 (Mon, Apr 8):
- Factors (PDF)
(R
code)
- Data: hcv.csv
- New R functions:
`levels`

,`options`

,`count`

,`fct_rev`

,`saveRDS`

,`relevel`

,`ordered`

,`cumsum`

,`optim`

,`cut`

,`qweibull`

,`labs`

,`rweibull`

,`as_factor`

,`fct_collapse`

,`fct_lump_min`

,`fct_lump_prop`

,`fct_lump_n`

,`fct_infreq`

,`fct_lump_lowfreq`

,`fct_relevel`

,`ggplot`

,`geom_function`

- Readings
- r4ds2e: Chapter 17
- rp4ds: Section 4.11
- (optional) advr: Section 3.4.1

- Factors (PDF)
(R
code)

- Day 37 (Wed, Apr 10):
- Continue Factors

- Day 38 (Fri, Apr 12):
- Finish Factors
- Linear Regression (PDF)
(R
code)
- Data: houseSalesGNV2020.rds
- New R functions:
`source`

,`knitr::include_graphics`

,`coef`

,`confint`

,`anova`

,`=`

,`readRDS`

,`geom_histogram`

,`geom_density`

,`formatC`

,`update`

,`scale_colour_viridis_d`

,`expand_grid`

,`predict`

,`str_to_upper`

,`aes`

,`nrow`

,`fmt`

,`geom_line`

,`geom_point`

,`bind_cols`

- Day 39 (Mon, Apr 15):
- Continue Linear Regression

- Day 40 (Wed, Apr 17):
- Continue Linear Regression

- Day 41 (Fri, Apr 19):
- Finish Linear Regression
- Discuss Assignment 080

- Day 42 (Mon, Apr 22):
- Logistic Regression (PDF)
(R
code)
- Data: hcv.csv
- New R functions:
`contr.treatment`

,`contr.poly`

,`glm`

- Logistic Regression (PDF)
(R
code)

- Day 43 (Wed, Apr 24):
- Finish Logistic Regression