How to use cast or another function to create a binary table in R
Original data:
x <- data.frame(id=c(1,1,2,3,3), region=factor(c(2,3,2,1,1)))
> x
id region
1 1 2
2 1 3
3 2 2
4 3 1
5 3 1
Group up the data:
aggregate(model.matrix(~ region - 1, data=x), x["id"], max)
Result:
id region1 region2 region3
1 1 0 1 1
2 2 0 1 0
3 3 1 0 0
How to programmatically create binary columns based on a categorical variable in data.table?
data.table has its own dcast
implementation using data.table's internals and should be fast. Give this a try:
dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L)
# id a b c d e
# 1: 1 0 1 1 1 1
# 2: 2 1 0 1 0 1
# 3: 3 1 0 1 1 1
Just thought of another way to handle this by preallocating and updating by reference (perhaps dcast's logic should be done like this to avoid intermediates).
ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
All that's left is to fill existing combinations with 1L
.
dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
# id b d c e a
# 1: 1 1 1 1 1 0
# 2: 2 0 0 1 1 1
# 3: 3 0 1 1 1 1
Okay, I've gone ahead on benchmarked on OP's data dimensions with ~10 million rows and 10 columns.
require(data.table)
set.seed(45L)
y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))
system.time(ans1 <- AnsFunction()) # 2.3s
system.time(ans2 <- dcastFunction()) # 2.2s
system.time(ans3 <- TableFunction()) # 6.2s
setcolorder(ans1, names(ans2))
setcolorder(ans3, names(ans2))
setorder(ans1, id)
setkey(ans2, NULL)
setorder(ans3, id)
identical(ans1, ans2) # TRUE
identical(ans1, ans3) # TRUE
where,
AnsFunction <- function() {
ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
ans
# reorder columns outside
}
dcastFunction <- function() {
# no need to load reshape2. data.table has its own dcast as well
# no need for setDT
df <- dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")
}
TableFunction <- function() {
# need to return integer results for identical results
# fixed 1 -> 1L; as.numeric -> as.integer
df <- as.data.frame.matrix(table(dt$id, dt$y))
df[df > 1L] <- 1L
df <- cbind(id = as.integer(row.names(df)), df)
setDT(df)
}
R, change a character string in dataframe to binary values
Just use ifelse
#your data
data = data.frame(Landtype = c("Rural", "Urban", "Rural", "Urban"))
#ifelse condition
data$Landtype = ifelse(data$Landtype == "Rural", 1,0)
selecting columns using a binary table in R
You can use apply
to iterate over the columns of a binary matrix, bin
, sub-settings a dataframe, dat
:
# create test data
set.seed(1)
dat <- as.data.frame(matrix(rnorm(18), nrow=2))
colnames(dat) <- paste0('c', 1:9)
dat
# c1 c2 c3 c4 c5 c6 c7 c8
# 1 -0.6264538 -0.8356286 0.3295078 0.4874291 0.5757814 1.5117812 -0.6212406 1.12493092
# 2 0.1836433 1.5952808 -0.8204684 0.7383247 -0.3053884 0.3898432 -2.2146999 -0.04493361
# c9
# 1 -0.01619026
# 2 0.94383621
bin <- matrix(sample(0:1, 27, replace = TRUE), nrow = 9)
bin
# [,1] [,2] [,3]
# [1,] 1 1 0
# [2,] 0 0 0
# [3,] 1 0 0
# [4,] 0 1 1
# [5,] 1 1 1
# [6,] 1 0 0
# [7,] 1 1 1
# [8,] 1 0 0
# [9,] 1 0 0
# subset columns of dat, using binary vector columns defined in bin;
# drop = FALSE is included to prevent any columns with only a single "1" from
# being cast to a vector
apply(bin, 2, function(x) { dat[, as.logical(x), drop = FALSE] })
# [[1]]
# c1 c3 c5 c6 c7 c8 c9
# 1 -0.6264538 0.3295078 0.5757814 1.5117812 -0.6212406 1.12493092 -0.01619026
# 2 0.1836433 -0.8204684 -0.3053884 0.3898432 -2.2146999 -0.04493361 0.94383621
#
# [[2]]
# c1 c4 c5 c7
# 1 -0.6264538 0.4874291 0.5757814 -0.6212406
# 2 0.1836433 0.7383247 -0.3053884 -2.2146999
#
# [[3]]
# c4 c5 c7
# 1 0.4874291 0.5757814 -0.6212406
# 2 0.7383247 -0.3053884 -2.2146999
#
R - Function to make a binary variable
You can use :
df[] <- +(df == 4 | df == 5)
df
# var1 var2 var3
#1 0 0 NA
#2 1 0 1
#3 0 1 1
#4 0 1 0
Comparison of df == 4 | df == 5
returns logical values (TRUE
/FALSE
), +
here turns those logical values to integer values (1
/0
) respectively.
If you want to apply this for selected columns you can subset the columns by position or by name.
cols <- 1:3 #Position
#cols <- grep('var', names(df)) #Name
df[cols] <- +(df[cols] == 4 | df[cols] == 5)
As far as your function is concerned you can do :
making_binary <- function (var){
var <- as.integer(var >= 4)
#which is faster version of
#var <- ifelse(var >= 4, 1, 0)
return(var)
}
df[] <- lapply(df, making_binary)
data
df <- structure(list(var1 = c(1L, 4L, 3L, 2L), var2 = c(1L, 3L, 4L,
5L), var3 = c(NA, 4L, 5L, 3L)), class = "data.frame", row.names = c(NA, -4L))
Reshape data in R, cast function arguments
The OP asked for help with the arguments to the cast()
function of the reshape
package. However, the reshape
package was superseded by the reshape2
package from the same package author. According to the package description, the reshape2
package is
A Reboot of the Reshape Package
Using reshape2
, the desired result can be produced with
reshape2::dcast(wc, PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length,
value.var = "TARGET_TYPE")
# PARENT_MOL_CHEMBL_ID ABL EGFR TP53
#1 C10 1 1 0
#2 C939 0 0 1
BTW: The data.table
package has implemented (and enhanced) dcast()
as well. So, the same result can be produced with
data.table::dcast(wc, PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length,
value.var = "TARGET_TYPE")
Additional columns
The OP mentioned other columns in the data frame which should be shown together with the spread or wide data. Unfortunately, the OP hasn't supplied particular sample data, so we have to consider two use cases.
Case 1: Additional columns go along with the id column
The data could look like
wc
# PARENT_MOL_CHEMBL_ID TARGET_TYPE extra_col1
#1 C10 ABL a
#2 C10 EGFR a
#3 C939 TP53 b
Note that the values in extra_col1
are in line with PARENT_MOL_CHEMBL_ID
.
This is an easy case, because the formula in dcast()
accepts ...
which represents all other variables not used in the formula:
reshape2::dcast(wc, ... ~ TARGET_TYPE, fun.aggregate = length,
value.var = "TARGET_TYPE")
# PARENT_MOL_CHEMBL_ID extra_col1 ABL EGFR TP53
#1 C10 a 1 1 0
#2 C939 b 0 0 1
The resulting data.frame does contain all other columns.
Case2: Additional columns don't go along with the id column
Now, another column is added:
wc
# PARENT_MOL_CHEMBL_ID TARGET_TYPE extra_col1 extra_col2
#1 C10 ABL a 1
#2 C10 EGFR a 2
#3 C939 TP53 b 3
Note that extra_col2
has two different values for C10
. This will cause the simple approach to fail. So, a two step approach has to be implemented: reshaping first and joining afterwards with the original data frame. The data.table
package is used for both steps, now:
library(data.table)
# reshape from long to wide, result has only one row per id column
wide <- dcast(setDT(wc), PARENT_MOL_CHEMBL_ID ~ TARGET_TYPE, fun.aggregate = length,
value.var = "TARGET_TYPE")
# right join, i.e., all rows of wc are included
wide[wc, on = "PARENT_MOL_CHEMBL_ID"]
# PARENT_MOL_CHEMBL_ID ABL EGFR TP53 TARGET_TYPE extra_col1 extra_col2
#1: C10 1 1 0 ABL a 1
#2: C10 1 1 0 EGFR a 2
#3: C939 0 0 1 TP53 b 3
The result shows the aggregated values in wide format together with any other columns.
How to convert two character columns to a binary matrix?
You can use:
library(tidyverse)
df %>%
pivot_wider(y,
names_from = x,
values_from = x,
values_fn = list(x = length),
values_fill = list(x = 0))
y A B C
<chr> <int> <int> <int>
1 m 1 0 0
2 n 1 0 0
3 o 0 1 0
4 p 0 0 1
5 q 0 0 1
6 r 0 0 1
Related Topics
Gantt Style Time Line Plot (In Base R)
Subsetting a Matrix by Row.Names
Row Operations in Data.Table Using 'By = .I'
How to Not Show All Labels on Ggplot Axis
How to Change Positions of X and Y Axis in Ggplot2
Print Pretty Data.Frames/Tables to Console
Issue with Ggplot2, Geom_Bar, and Position="Dodge": Stacked Has Correct Y Values, Dodged Does Not
An Na in Subsetting a Data.Frame Does Something Unexpected
Leaflet Legend for Custom Markers in R
What Does the Function Invisible() Do
How to Read the Header But Also Skip Lines - Read.Table()
Read Gzipped CSV Directly from a Url in R
R Tm Package Vcorpus: Error in Converting Corpus to Data Frame
Solving Non-Square Linear System with R