Here I summarize how to convert data between wide and long format based on tidyverse grammar.
1. Wide to Long format
(1) spread function.
spread(data, key, value, fill)
- key는 wide type으로 바꿨을 때 변수들로 바뀔 내용들이 있는 변수
- value는 그 변수들의 칼럼의 값이 될 것
- fill은 빈칸에 넣을 것
(2) pivot_wider() #새로 나온 spread의 업그레이드 버전. 이제 요걸 사용하도록 하자!
pivot_wider (data, names_from = 변수, values_from = 변수, values_fill = 0)
- key를 names_from으로, value를 values_from, fill을 values_fill이라고 변경하여 보다 직관적인 의미를 제공하고 있다.
- generate column names from multiple variables 의 경우, values_from = c(변수1, 변수2) 요렇게 하여 가능!
us_rent_income original dataset | us_rent_income 1 column wide us_rent_income %>% pivot_wider(names_from = variable, values_from = estimate) |
us_rent_income 2 columns wide pivot_wider(names_from = variable, values_from = c(estimate,moe)) |
- You can control whether 'names_from' values vary fastest or slowest relative to the 'values_from' column names using 'names_vary'. 여러개의 values_from 변수가 있을 때, 어떤 순서로 컬럼 결과를 보여줄지에 대한 옵션 구문.
- "fastest" of the form: value1_name1, value1_name2, value2_name1, value2_name2
- "slowest" of the form: value1_name1, value2_name1, value1_name2, value2_name2
- When there are multiple 'names_from' or 'values_from', you can use 'names_sep' or 'names_glue' to control the output variable names
- names_sep = "." -> ex) estimate.income, estimate.rent, moe.income, moe.rent
- names_glue = "{variable}_{.value}" -> ex)income_estimate, income_moe, rent_estimate, rent_moe
- Can perform aggregation with 'values_fn'
- values_fn = mean will return mean of the value per row
- values_fn = ~mean(.x, na.rm = TRUE) will return mean of the value ignoring NA rows (mean could go up)
Ref: https://tidyr.tidyverse.org/reference/pivot_wider.html
2. Long to Wide format
(1) gather function.
gather(data, key, value, na.rm)
- key는 long format으로 바꿨을 때 변수들의 이름을 data로 갖는 변수의 이름
- value는 변수들의 값을 넣을 변수의 이름
- na.rm은 NA인 행을 없앰
(2) pivot_longer #updated approach to gather()
- cols: colums to pivot into longer format. ex. cols = starts_with("wk")
- names_to: a character vector specifying the new column(s) to create from the information stored in column names of data specified by cols.
- names_prefix: a regular expression used to remove matching text from the start of each variable name.
- values_to: a string specifying the name of the column to create from the data stored in cell values.
- values_drop_na: if TRUE, will drop rows that contain only NAs in the value_to column.
who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = c("diagnosis", "gender", "age"), names_pattern = "new_?(.*)_(.)(.*)", values_to = "count") |
*minor additional information (tibble vs. dataframe)
- printing: only the first 10 rows and all the columns that fit on screen. name + type information
- subsetting: no need to do drop=FALSE for maintaining dimension of subsetted dataframe.
- A handful of functions are not working with tibbles as those fuctions expect df[,1] to return a vector, not a data frame. In this case use as.data.frame() to turn a tibble back to a data frame
- Ref: https://www.rstudio.com/blog/tibble-1-0-0/#:~:text=Tibbles%20vs%20data%20frames,to%20work%20with%20large%20data.
'Coding 공부기록' 카테고리의 다른 글
Leetcode 뿌시기 - Tree 문제 (0) | 2022.11.17 |
---|---|
Monotonic Stack - Identify Pattern (0) | 2022.10.24 |
타임리 부트캠프 시즌2 리트코드 숙제1 (0) | 2022.09.23 |
Top 8 Data Structures for Coding Interviews (0) | 2022.09.03 |
Behavior Questions Preparation (~6/28) (0) | 2022.06.28 |