Creating co-occurrence matrix
I'd use a combination of the reshape2 package and matrix algebra:
#read in your data
dat <- read.table(text="TrxID Items Quant
Trx1 A 3
Trx1 B 1
Trx1 C 1
Trx2 E 3
Trx2 B 1
Trx3 B 1
Trx3 C 4
Trx4 D 1
Trx4 E 1
Trx4 A 1
Trx5 F 5
Trx5 B 3
Trx5 C 2
Trx5 D 1", header=T)
#making the boolean matrix
library(reshape2)
dat2 <- melt(dat)
w <- dcast(dat2, Items~TrxID)
x <- as.matrix(w[,-1])
x[is.na(x)] <- 0
x <- apply(x, 2, function(x) as.numeric(x > 0)) #recode as 0/1
v <- x %*% t(x) #the magic matrix
diag(v) <- 0 #repalce diagonal
dimnames(v) <- list(w[, 1], w[,1]) #name the dimensions
v
For the graphing maybe...
g <- graph.adjacency(v, weighted=TRUE, mode ='undirected')
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
plot(g)
Constructing a co-occurrence matrix in python pandas
It's a simple linear algebra, you multiply matrix with its transpose (your example contains strings, don't forget to convert them to integer):
>>> df_asint = df.astype(int)
>>> coocc = df_asint.T.dot(df_asint)
>>> coocc
Dop Snack Trans
Dop 4 2 3
Snack 2 3 2
Trans 3 2 4
if, as in R answer, you want to reset diagonal, you can use numpy's fill_diagonal
:
>>> import numpy as np
>>> np.fill_diagonal(coocc.values, 0)
>>> coocc
Dop Snack Trans
Dop 0 2 3
Snack 2 0 2
Trans 3 2 0
How to calculate a (co-)occurrence matrix from a data frame with several columns using R?
There may be better ways to do this, but try:
library(tidyverse)
df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>%
crossprod(., .)
df_diag <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>%
diag()
diag(df1) <- df_diag
df1
Country China England Greece USA
China 2 2 2 0
England 2 6 1 1
Greece 2 1 3 1
USA 0 1 1 1
How to create a co-occurrence matrix calculated from combinations by ID/row in R?
DATA
I modified your data so that data can represent your actual situation.
# ID CTR1 CTR2 CTR3 CTR4 CTR5 CTR6
#1: 1 England England England China USA England
#2: 2 England China China USA England China
#3: 3 England China China USA USA USA
#4: 4 China England England China USA England
#5: 5 Sweden <NA> <NA> <NA> <NA>
df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England",
"England", "China", "Sweden"), CTR2 = c("England", "China", "China",
"England", NA), CTR3 = c("England", "China", "China", "England",
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA",
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA",
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA,
-5L))
UPDATE
After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.
# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA.
melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo
# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2
# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix
# Seeing this question, you want to create a matrix with crossprod().
crossprod(table(foo2[, c(1,3)])) -> mymat
# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.
diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)
#value
#value China England Sweden USA
#China 4 4 0 4
#England 4 4 0 4
#Sweden 0 0 0 0
#USA 4 4 0 4
Creating a co-occurence matrix
You can do this in a straight-forward way using OneHotEncoder()
and np.dot()
- Turn each element in dataframe to a string
- Use a one-hot encoder to convert the dataframe into one-hots over a unique vocabulary of the categorical elements
- Take a dot product with itself to get count of co-occurance
- Recreate a dataframe using the co-occurance matrix and the
feature_names
from the one hot encoder
#assuming this is your dataset
0 1 2 3
0 (-1.774, 1.145] (-3.21, 0.533] (0.0166, 2.007] (2.0, 3.997]
1 (-1.774, 1.145] (-3.21, 0.533] (2.007, 3.993] (2.0, 3.997]
from sklearn.preprocessing import OneHotEncoder
df = df.astype(str) #turn each element to string
#get one hot representation of the dataframe
l = OneHotEncoder()
data = l.fit_transform(df.values)
#get co-occurance matrix using a dot product
co_occurance = np.dot(data.T, data)
#get vocab (columns and indexes) for co-occuance matrix
#get_feature_names() has a weird suffix which I am removing for better readibility here
vocab = [i[3:] for i in l.get_feature_names()]
#create co-occurance matrix
ddf = pd.DataFrame(co_occurance.todense(), columns=vocab, index=vocab)
print(ddf)
(-1.774, 1.145] (-3.21, 0.533] (0.0166, 2.007] \
(-1.774, 1.145] 2.0 2.0 1.0
(-3.21, 0.533] 2.0 2.0 1.0
(0.0166, 2.007] 1.0 1.0 1.0
(2.007, 3.993] 1.0 1.0 0.0
(2.0, 3.997] 2.0 2.0 1.0
(2.007, 3.993] (2.0, 3.997]
(-1.774, 1.145] 1.0 2.0
(-3.21, 0.533] 1.0 2.0
(0.0166, 2.007] 0.0 1.0
(2.007, 3.993] 1.0 1.0
(2.0, 3.997] 1.0 2.0
As you can verify from the output above, its exactly what the co-occurance matrix should be.
Advantages of this approach are that you can scale this using the transform
method of the one-hot encoder object and most of the processing happens in sparse matrices until the final step of creating the dataframe so its memory efficient.
Related Topics
What Specifically Are the Dangers of Eval(Parse(...))
Drop Data Frame Columns by Name
Convert a List to a Data Frame
Left Align Two Graph Edges (Ggplot)
Cluster Analysis in R: Determine the Optimal Number of Clusters
How to Name Variables on the Fly
Ggplot2 - Bar Plot With Both Stack and Dodge
Does Ifelse Really Calculate Both of Its Vectors Every Time? Is It Slow
How to Remove All Duplicates So That None Are Left in a Data Frame
How to Save a Plot as Image on the Disk
Why Is It Not Advisable to Use Attach() in R, and What Should I Use Instead
Geographic/Geospatial Distance Between 2 Lists of Lat/Lon Points (Coordinates)
Installing Older Version of R Package
Show Percent % Instead of Counts in Charts of Categorical Variables
Looping Over a Date or Posixct Object Results in a Numeric Iterator
How to Generate Permutations or Combinations of Object in R
Select the Top N Values by Group
Count Occurrences of Value in a Set of Variables in R (Per Row)