R data.table duplicate rows with a pair of columns
The linked answer ( https://stackoverflow.com/a/25151395/496803) is nearly a duplicate, and so is https://stackoverflow.com/a/25298863/496803 , but here goes again, with a slight twist:
dt[!duplicated(data.table(pmin(Gene1,Gene2),pmax(Gene1,Gene2)))]
# Gene1 Gene2 Ens.ID.1 Ens.ID.2 CORR
#1: FOXA1 MYC ENSG000000129.13. ENSG000000129.11 0.9953311
#2: EGFR CD4 ENSG000000129 ENSG000000129.12 0.9947215
If you have >2 or many keys to dedup by, you are probably best off converting to a long file, sorting, back to a wide file and then de-duplicating. Like so:
dupvars <- c("Gene1","Gene2")
sel <- !duplicated(
dcast(
melt(dt[, c(.SD,id=.(.I)), .SDcols=dupvars], id.vars="id")[
order(id,value), grp := seq_len(.N), by=id],
id ~ grp
)[,-1])
dt[sel,]
Keep first row by multiple columns in an R data.table
data.table
provides S3 methods for unique
, duplicated
and anyDuplicated
unique(dt, by = c('x','y'))
will give you what you want.
Extracting unique rows from a data table in R
Before data.table v1.9.8, the default behavior of unique.data.table
method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key
was NULL
(the default), one would get the original data set back (as in OPs situation).
As of data.table 1.9.8+, unique.data.table
method uses all columns by default which is consistent with the unique.data.frame
in base R. To have it use the key columns, explicitly pass by = key(DT)
into unique
(replacing DT
in the call to key with the name of the data.table).
Hence, old behavior would be something like
library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b))
## [1] 8 3
While for data.table v1.9.8+, just
b <- data.table(a)
dim(unique(b))
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them
Or without a copy
setDT(a)
dim(unique(a))
## [1] 8 3
Removing duplicate rows from data frame in R
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by the pmin(A, B)
and pmax(A,B)
, if
the number of rows is greater than 1, we get the first row or else
return the rows.
library(data.table)
setDT(df1)[, if(.N >1) head(.SD, 1) else .SD ,.(A=pmin(A, B), B= pmax(A, B))]
# A B prob
#1: 1 2 0.1
#2: 1 3 0.2
#3: 1 4 0.3
#4: 2 3 0.1
#5: 2 4 0.4
Or we can just used duplicated
on the pmax
, pmin
output to return a logical index and subset the data based on that.
setDT(df1)[!duplicated(cbind(pmax(A, B), pmin(A, B)))]
# A B prob
#1: 1 2 0.1
#2: 1 3 0.2
#3: 1 4 0.3
#4: 2 3 0.1
#5: 2 4 0.4
Unique rows, considering two columns, in R, without order
There are lot's of ways to do this, here is one:
unique(t(apply(df, 1, sort)))
duplicated(t(apply(df, 1, sort)))
One gives the unique rows, the other gives the mask.
Related Topics
How to Sort Letters in a String
Why Would R Use the "L" Suffix to Denote an Integer
Is There a More Elegant Way to Convert Two-Digit Years to Four-Digit Years with Lubridate
"Correct" Way to Specifiy Optional Arguments in R Functions
Load Multiple Packages at Once
Remove Multiple Objects with Rm()
How to Avoid Warning When Introducing Nas by Coercion
Creating a Prompt/Answer System to Input Data into R
Function to Calculate Geospatial Distance Between Two Points (Lat,Long) Using R
How to Install Development Version of R Packages Github Repository
How to Run R on a Server Without X11, and Avoid Broken Dependencies