[R] Convert Data between Wide and Long Format

Here I summarize how to convert data between wide and long format based on tidyverse grammar.

1. Wide to Long format

(1) spread function.

spread(data, key, value, fill)

key는 wide type으로 바꿨을 때 변수들로 바뀔 내용들이 있는 변수
value는 그 변수들의 칼럼의 값이 될 것
fill은 빈칸에 넣을 것

(2) pivot_wider() #새로 나온 spread의 업그레이드 버전. 이제 요걸 사용하도록 하자!

pivot_wider (data, names_from = 변수, values_from = 변수, values_fill = 0)

key를 names_from으로, value를 values_from, fill을 values_fill이라고 변경하여 보다 직관적인 의미를 제공하고 있다.
generate column names from multiple variables 의 경우, values_from = c(변수1, 변수2) 요렇게 하여 가능!


us_rent_income original dataset	us_rent_income 1 column wide us_rent_income %>% pivot_wider(names_from = variable, values_from = estimate)	us_rent_income 2 columns wide pivot_wider(names_from = variable, values_from = c(estimate,moe))

You can control whether 'names_from' values vary fastest or slowest relative to the 'values_from' column names using 'names_vary'. 여러개의 values_from 변수가 있을 때, 어떤 순서로 컬럼 결과를 보여줄지에 대한 옵션 구문.
- "fastest" of the form: value1_name1, value1_name2, value2_name1, value2_name2
- "slowest" of the form: value1_name1, value2_name1, value1_name2, value2_name2
When there are multiple 'names_from' or 'values_from', you can use 'names_sep' or 'names_glue' to control the output variable names
- names_sep = "." -> ex) estimate.income, estimate.rent, moe.income, moe.rent
- names_glue = "{variable}_{.value}" -> ex)income_estimate, income_moe, rent_estimate, rent_moe
Can perform aggregation with 'values_fn'
- values_fn = mean will return mean of the value per row
- values_fn = ~mean(.x, na.rm = TRUE) will return mean of the value ignoring NA rows (mean could go up)

Ref: https://tidyr.tidyverse.org/reference/pivot_wider.html

2. Long to Wide format

(1) gather function.

gather(data, key, value, na.rm)

key는 long format으로 바꿨을 때 변수들의 이름을 data로 갖는 변수의 이름
value는 변수들의 값을 넣을 변수의 이름
na.rm은 NA인 행을 없앰

(2) pivot_longer #updated approach to gather()

cols: colums to pivot into longer format. ex. cols = starts_with("wk")
names_to: a character vector specifying the new column(s) to create from the information stored in column names of data specified by cols.
names_prefix: a regular expression used to remove matching text from the start of each variable name.
values_to: a string specifying the name of the column to create from the data stored in cell values.
values_drop_na: if TRUE, will drop rows that contain only NAs in the value_to column.

who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)", values_to = "count")

*minor additional information (tibble vs. dataframe)

printing: only the first 10 rows and all the columns that fit on screen. name + type information
subsetting: no need to do drop=FALSE for maintaining dimension of subsetted dataframe.
A handful of functions are not working with tibbles as those fuctions expect df[,1] to return a vector, not a data frame. In this case use as.data.frame() to turn a tibble back to a data frame
Ref: https://www.rstudio.com/blog/tibble-1-0-0/#:~:text=Tibbles%20vs%20data%20frames,to%20work%20with%20large%20data.

'Coding 공부기록' 카테고리의 다른 글

Leetcode 뿌시기 - Tree 문제 (0)	2022.11.17
Monotonic Stack - Identify Pattern (0)	2022.10.24
타임리 부트캠프 시즌2 리트코드 숙제1 (0)	2022.09.23
Top 8 Data Structures for Coding Interviews (0)	2022.09.03
Behavior Questions Preparation (~6/28) (0)	2022.06.28

My little finding & feeling

[R] Convert Data between Wide and Long Format

'Coding 공부기록' 카테고리의 다른 글

티스토리툴바

[R] Convert Data between Wide and Long Format

'Coding 공부기록' 카테고리의 다른 글

'Coding 공부기록' Related Articles

티스토리툴바