Relearning R
A hodgepodge of notes for learning R for my reference, segmented so it is easy to read.
–
Visualisations
ggplot(data = <DATA>) +
<GEOM_FUNCTION>(
mapping = aes(<MAPPINGS>),
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
 use
facets
: subplots that display subsets of data (categorical variables)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)

also can use
facet_grid()
to plot combination of two variables:ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)

geom_point
vsgeom_smooth
; which object to map plots; every geom function takesmapping
as an argument; instead oflinetype
below, can also usegroup
to show different types of categories with same linetypeggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

repetition isn’t necessary:
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

changing order of bar chart for categorical mappings (so doesn’t order based on increasing frequency):
demo < tribble( ~cut, ~freq, "Fair", 1610, "Good", 4906, "Very Good", 12082, "Premium", 13791, "Ideal", 21551 ) ggplot(data = demo) + geom_bar(mapping = aes(x = cut, y = freq), stat = "identity" #use y = stat(prop) for proportion

lineplots (w stats)/ boxplots:
ggplot(data = diamonds) + stat_summary( mapping = aes(x = cut, y = depth), fun.min = min, fun.max = max, fun = median ) ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + geom_boxplot() #add + coord_flip() to make it horizontal

can use
fill
(also for categorical) to fill whole bars, or normalcolor
for outlines 
position = "identity"
will place each object exactly where it falls in the context of the graph. This is not very useful for bars, because it overlaps them. To see that overlapping we either need to make the bars slightly transparent by settingalpha
to a small value, or completely transparent by settingfill = NA
; basically doesn’t stack like fill, but overlaps on top of each other to see original value compared to y 
position = "fill"
works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups 
position = "dodge"
places overlapping objects directly beside one another. This makes it easier to compare individual values; below image shows this 
use position =
jitter
for scatterplot when points over plot with each other; + geom_jitter() 
can use
bar + coord_polar()
to make bar charts into pie charts 
explore labs for tags/titles/etc
–
The Basics

Do not use
=
, instead doALT
+
to create<
to assign variables 
Use snake case for styling convention (underscore to separate lowercase):
example:
i_use_snake_case

Use
CTRL
+UP_ARROW
to look at all commands executed in terminal (fastease to redo past commands) 
Variables are casesensitive – typos will happen

When shifting through different functions using
TAB
autofill, pressF1
to get more detailed documentation 
ALT
+SHIFT
+K
to access all keyboard shortcuts 
dplyr
overwrites base functions of R – to access original, use:stats::function

Easy access of functions / objects using
::
 one example:
nycflights13::flights
to access dataframe
 one example:

To print and save assignments, wrap in parentheses:
 example:
(dec25 < filter(flights, month == 12, day == 25))
 example:

When using
==
to check for boolean conditions, usenear
instead:sqrt(2) ^ 2 == 2 #> [1] FALSE 1 / 49 * 49 == 1 #> [1] FALSE  near(sqrt(2) ^ 2, 2) #> [1] TRUE near(1 / 49 * 49, 1) #> [1] TRUE

Ctrl + Shift + P to resend the complete block that was recently used in console– useful when editing a block constantly.
Data Transformation

tibbles
are dataframes but tweaked to work undertidyverse

&
= “and”,
= “or”,!
= “not”
Boolean operations in picture:


the following abbreviations for types of variables:
int
: integer,dbl
: doubles/real numberschr
: character vectors, or stringsdttm
: datetimes (date+time)lgl
: logical (booleans), vectors with onlyTRUE
orFALSE
fctr
: factors (used to represent categorical variables with fixed possible values)
–

the basics of
dplyr
, in which the first argument is adf
:
Pick observations by their values filter().

example:
filter(flights, !(arr_delay > 120  dep_delay > 120)) filter(flights, arr_delay <= 120, dep_delay <= 120) #Demorgans law: !(x & y) is the same as !x  !y, #and !(x  y) is the same as !x & !y

can use
between(x, left, right)
instead of boolean condition statements: ` x >= left & x <= right`


Reorder the rows: arrange()
 selects rows in ascending order, can use
desc()
to reorder in descending order:arrange(flights, desc(dep_delay))
 selects rows in ascending order, can use

Pick variables by their names:
select()
. Can select columns between two variables:
select(flights, year:day)
 Can select columns not between two variables (inclusive):
select(flights, (year:day))
 Some helper functions, see
?select
for more:starts_with("abc")
ends_with("xyz")
contains("ijk")
everything()
pulls all the other variables to reordernum_range("x", 1:3)
matches variablesx1
tox3
 Use
rename
to keep all variables not mentioned (as select drops it)rename(flights, tail_num = tailnum)
 Can select columns between two variables:

Create new variables with functions of existing variables: mutate().

similar to
apply()
from python; can also refer to columns that you’ve created in the same line; ONLY TEMPORARY unless stored 
function must be vectorized: it must take a vector of values as input, return a vector with the same number of values as output.

adds new variable to end of dataset; use
View(df)
to see all columns 
Example:
flights_sml < select(flights, year:day, ends_with("delay"), distance, air_time ) mutate(flights_sml, gain = dep_delay  arr_delay, speed = distance / air_time * 60 ) #> # A tibble: 336,776 x 9 #> year month day dep_delay arr_delay distance air_time gain speed #> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 2013 1 1 2 11 1400 227 9 370.

if you only want to keep the new variables, use
transmute()

%/%
= integer division, dividing and then rounding down to nearest integer 
%%
= mod, or just remainder 
Other functions:
lead()
,lag()
(refers to the number that leads them or is behind them),cumsum()
,cumprod()
,cummin()
,cummax()
,cummean()

Also ranking functions:
min_rank()
,row_number()
,dense_rank()
,percent_rank()
,cume_dist()
,ntile()
transmute(flights, dep_time, hour = dep_time %/% 100, minute = dep_time %% 100 ) #> # A tibble: 336,776 x 3 #> dep_time hour minute #> <int> <dbl> <dbl> #> 1 517 5 17 #> 2 533 5 33 #> 3 542 5 42 #> 4 544 5 44 #> 5 554 5 54 #> 6 554 5 54 #> # … with 336,770 more rows


Collapse many values down to a single summary: summarise().
 use in tandem with
group_by
to compute unit of analysis from complete dataset to individual groups. count = n()
as a sub function to calculate # of data pointssum(!is.na(x))
as a sub function to calculate # of missing valuesn_distinct(x)
for distinct values
 or can calculate statistics of overall dataframe like mean:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
 use in tandem with

And then can operate by groups: group_by().
 use
ungroup()
to ungroup back to one layer  also use filters and mutations on groups
 use

**Piping: **
%>%
or read as “then,” which combines all transformations into one code block and it’s easier to read: sequential:
x %>% f(y)
turns intof(x, y)
, andx %>% f(y) %>% g(z)
turns intog(f(x, y), z)
and so on
by_dest < group_by(flights, dest) delay < summarise(by_dest, count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) #> `summarise()` ungrouping output (override with `.groups` argument) delay < filter(delay, count > 20, dest != "HNL") ## VS. Piping way delays < flights %>% group_by(dest) %>% summarise( count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE) ) %>% filter(count > 20, dest != "HNL")
 sequential:

sum of one variable (total # of miles a certain plane flew), where
wt
refers to weight of the variable being counted. think of this as a sum of a variable based on another variable
not_cancelled %>% count(tailnum, wt = distance) #> # A tibble: 4,037 x 2 #> tailnum n #> <chr> <dbl> #> 1 D942DN 3418 #> 2 N0EGMQ 239143 #> 3 N10156 109664 #> 4 N102UW 25722 #> 5 N103US 24619 #> 6 N104UW 24616 #> # … with 4,031 more rows

Handling Missing Values

NA
represents unknown values, and any operations involving will also be unknown
NA == NA
will result inNA
given this same premise 
use
is.na(x)
to determine if value is missing 
can be persevered when filtering:
filter(df, is.na(x)  x > 1)

When working with
arrange()
, missing values will always be sorted at the end, use indicators to have them at the beginning:arrange(flights, desc(is.na(dep_time)), dep_time)

–
Data Exploration
–
For more resources:
 Notes taken from an online book here