본문 바로가기

Coding 공부기록

[R] Convert Data between Wide and Long Format

Here I summarize how to convert data between wide and long format based on tidyverse grammar.

 

1. Wide to Long format

(1) spread function.

spread(data, key, value, fill) 

  • key는 wide type으로 바꿨을 때 변수들로 바뀔 내용들이 있는 변수
  • value는 그 변수들의 칼럼의 값이 될 것
  • fill은 빈칸에 넣을 것

(2) pivot_wider() #새로 나온 spread의 업그레이드 버전. 이제 요걸 사용하도록 하자!

pivot_wider (data, names_from = 변수, values_from = 변수, values_fill = 0)

  • key를 names_from으로, value를 values_from, fill을 values_fill이라고 변경하여 보다 직관적인 의미를 제공하고 있다.
  • generate column names from multiple variables 의 경우, values_from = c(변수1, 변수2) 요렇게 하여 가능!
us_rent_income original dataset us_rent_income 1 column wide
us_rent_income %>%
pivot_wider(names_from = variable,
values_from = estimate)
us_rent_income 2 columns wide
pivot_wider(names_from = variable,
values_from = c(estimate,moe))
  • You can control whether 'names_from' values vary fastest or slowest relative to the 'values_from' column names using 'names_vary'. 여러개의 values_from 변수가 있을 때, 어떤 순서로 컬럼 결과를 보여줄지에 대한 옵션 구문.
    • "fastest" of the form: value1_name1, value1_name2, value2_name1, value2_name2
    • "slowest" of the form: value1_name1, value2_name1, value1_name2, value2_name2
  • When there are multiple 'names_from' or 'values_from', you can use 'names_sep' or 'names_glue' to control the output variable names
    • names_sep = "." -> ex) estimate.income, estimate.rent, moe.income, moe.rent
    • names_glue = "{variable}_{.value}" -> ex)income_estimate, income_moe, rent_estimate, rent_moe
  •  Can perform aggregation with 'values_fn'
    • values_fn = mean will return mean of the value per row
    • values_fn = ~mean(.x, na.rm = TRUE) will return mean of the value ignoring NA rows (mean could go up)

Ref: https://tidyr.tidyverse.org/reference/pivot_wider.html

 

2. Long to Wide format

(1) gather function.

gather(data, key, value, na.rm)

  • key는 long format으로 바꿨을 때 변수들의 이름을 data로 갖는 변수의 이름
  • value는 변수들의 값을 넣을 변수의 이름
  • na.rm은 NA인 행을 없앰

(2) pivot_longer #updated approach to gather()

  • cols: colums to pivot into longer format. ex. cols = starts_with("wk")
  • names_to: a character vector specifying the new column(s) to create from the information stored in column names of data specified by cols.
  • names_prefix: a regular expression used to remove matching text from the start of each variable name.
  • values_to: a string specifying the name of the column to create from the data stored in cell values.
  • values_drop_na: if TRUE, will drop rows that contain only NAs in the value_to column. 


who %>% pivot_longer(cols = new_sp_m014:newrel_f65, names_to = c("diagnosis", "gender", "age"),
                                      names_pattern = "new_?(.*)_(.)(.*)", values_to = "count")

*minor additional information (tibble vs. dataframe)