Creating Co-Occurrence Matrix

Creating co-occurrence matrix

I'd use a combination of the reshape2 package and matrix algebra:

#read in your data
dat <- read.table(text="TrxID Items Quant
Trx1 A 3
Trx1 B 1
Trx1 C 1
Trx2 E 3
Trx2 B 1
Trx3 B 1
Trx3 C 4
Trx4 D 1
Trx4 E 1
Trx4 A 1
Trx5 F 5
Trx5 B 3
Trx5 C 2
Trx5 D 1", header=T)

#making the boolean matrix   
library(reshape2)
dat2 <- melt(dat)
w <- dcast(dat2, Items~TrxID)
x <- as.matrix(w[,-1])
x[is.na(x)] <- 0
x <- apply(x, 2,  function(x) as.numeric(x > 0))  #recode as 0/1
v <- x %*% t(x)                                   #the magic matrix 
diag(v) <- 0                                      #repalce diagonal
dimnames(v) <- list(w[, 1], w[,1])                #name the dimensions
v

For the graphing maybe...

g <- graph.adjacency(v, weighted=TRUE, mode ='undirected')
g <- simplify(g)
# set labels and degrees of vertices
V(g)$label <- V(g)$name
V(g)$degree <- degree(g)
plot(g)

Constructing a co-occurrence matrix in python pandas

It's a simple linear algebra, you multiply matrix with its transpose (your example contains strings, don't forget to convert them to integer):

>>> df_asint = df.astype(int)
>>> coocc = df_asint.T.dot(df_asint)
>>> coocc
       Dop  Snack  Trans
Dop      4      2      3
Snack    2      3      2
Trans    3      2      4

if, as in R answer, you want to reset diagonal, you can use numpy's fill_diagonal:

>>> import numpy as np
>>> np.fill_diagonal(coocc.values, 0)
>>> coocc
       Dop  Snack  Trans
Dop      0      2      3
Snack    2      0      2
Trans    3      2      0

How to calculate a (co-)occurrence matrix from a data frame with several columns using R?

There may be better ways to do this, but try:

library(tidyverse)

df1 <- df %>%
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
xtabs(~ID + Country, data = ., sparse = FALSE) %>% 
crossprod(., .) 

df_diag <- df %>% 
pivot_longer(-ID, names_to = "Category", values_to = "Country") %>%
mutate(Country2 = Country) %>%
xtabs(~Country + Country2, data = ., sparse = FALSE) %>% 
diag()

diag(df1) <- df_diag 

df1

Country   China England Greece USA
  China       2       2      2   0
  England     2       6      1   1
  Greece      2       1      3   1
  USA         0       1      1   1

How to create a co-occurrence matrix calculated from combinations by ID/row in R?

DATA

I modified your data so that data can represent your actual situation.

#   ID    CTR1    CTR2    CTR3  CTR4    CTR5    CTR6
#1:  1 England England England China     USA England
#2:  2 England   China   China   USA England   China
#3:  3 England   China   China   USA     USA     USA
#4:  4   China England England China     USA England
#5:  5  Sweden    <NA>    <NA>  <NA>            <NA>


df <- structure(list(ID = c(1, 2, 3, 4, 5), CTR1 = c("England", "England", 
"England", "China", "Sweden"), CTR2 = c("England", "China", "China", 
"England", NA), CTR3 = c("England", "China", "China", "England", 
NA), CTR4 = c("China", "USA", "USA", "China", NA), CTR5 = c("USA", 
"England", "USA", "USA", ""), CTR6 = c("England", "China", "USA", 
"England", NA)), class = c("data.table", "data.frame"), row.names = c(NA, 
-5L))

UPDATE

After seeing the OP's previous question, I got a clear picture in my mind. I think this is what you want, Seb.

# Transform the data to long-format data. Remove rows that have zero character (i.e, "") or NA. 

melt(setDT(df), id.vars = "ID", measure = patterns("^CTR"))[nchar(value) > 0 & complete.cases(value)] -> foo

# Get distinct value (country) in each ID group (each row)
unique(foo, by = c("ID", "value")) -> foo2

# https://stackoverflow.com/questions/13281303/creating-co-occurrence-matrix
# Seeing this question, you want to create a matrix with crossprod().

crossprod(table(foo2[, c(1,3)])) -> mymat

# Finally, you need to change diagonal values. If a value is equal to one,
# change it to zero. Otherwise, keep the original value.

diag(mymat) <- ifelse(diag(mymat) <= 1, 0, mymat)

#value
#value     China England Sweden USA
#China       4       4      0   4
#England     4       4      0   4
#Sweden      0       0      0   0
#USA         4       4      0   4

Creating a co-occurence matrix

You can do this in a straight-forward way using OneHotEncoder() and np.dot()

Turn each element in dataframe to a string
Use a one-hot encoder to convert the dataframe into one-hots over a unique vocabulary of the categorical elements
Take a dot product with itself to get count of co-occurance
Recreate a dataframe using the co-occurance matrix and the feature_names from the one hot encoder

#assuming this is your dataset
                 0               1                2             3
0  (-1.774, 1.145]  (-3.21, 0.533]  (0.0166, 2.007]  (2.0, 3.997]
1  (-1.774, 1.145]  (-3.21, 0.533]   (2.007, 3.993]  (2.0, 3.997]

from sklearn.preprocessing import OneHotEncoder

df = df.astype(str) #turn each element to string

#get one hot representation of the dataframe
l = OneHotEncoder() 
data = l.fit_transform(df.values)

#get co-occurance matrix using a dot product
co_occurance = np.dot(data.T, data)

#get vocab (columns and indexes) for co-occuance matrix
#get_feature_names() has a weird suffix which I am removing for better readibility here
vocab = [i[3:] for i in l.get_feature_names()]

#create co-occurance matrix
ddf = pd.DataFrame(co_occurance.todense(), columns=vocab, index=vocab)
print(ddf)

                 (-1.774, 1.145]  (-3.21, 0.533]  (0.0166, 2.007]  \
(-1.774, 1.145]              2.0             2.0              1.0   
(-3.21, 0.533]               2.0             2.0              1.0   
(0.0166, 2.007]              1.0             1.0              1.0   
(2.007, 3.993]               1.0             1.0              0.0   
(2.0, 3.997]                 2.0             2.0              1.0   

                 (2.007, 3.993]  (2.0, 3.997]  
(-1.774, 1.145]             1.0           2.0  
(-3.21, 0.533]              1.0           2.0  
(0.0166, 2.007]             0.0           1.0  
(2.007, 3.993]              1.0           1.0  
(2.0, 3.997]                1.0           2.0

As you can verify from the output above, its exactly what the co-occurance matrix should be.

Advantages of this approach are that you can scale this using the transform method of the one-hot encoder object and most of the processing happens in sparse matrices until the final step of creating the dataframe so its memory efficient.

Creating Co-Occurrence Matrix