Merging Two Files by a Single Column in Unix

How to merge two .txt file in unix based on one common column. Unix

Thanks for adding your own attempts to solve the problem - it makes troubleshooting a lot easier.

This answer is a bit convoluted, but here is a potential solution (GNU join):

join -t $'\t' -1 2 -2 1 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sam_ID Sub_ID v1 code V3 V4
#2253734 1878372 SAMN06396112 20481 NA DNA
#2275341 1884646 SAMN06432785 20483 NA DNA
#2277481 1860945 SAMN06407597 20488 NA DNA

Explanation:

  • join uses a single character as a separator, so you can't use "\t", but you can use $'\t' (as far as I know)
  • the -1 2 and -2 1 means "for the first file, use the second field" and "for the second file, use the first field" when combining the files
  • in each subprocess (<()), sort the file by the Sam_ID column but exclude the header from the sort (per Is there a way to ignore header lines in a UNIX sort?)

Edit

To specify the order of the columns in the output (to put the Sub_ID before the Sam_ID), you can use the -o option, e.g.

join -t $'\t' -1 2 -2 1 -o 1.1,1.2,1.3,2.2,2.3,2.4 <(head -n 1 File1.txt && tail -n +2 File1.txt | sort -k2,2 ) <(head -n 1 File2.txt && tail -n +2 File2.txt | sort -k1,1)

#Sub_ID Sam_ID v1 code V3 V4
#1878372 2253734 SAMN06396112 20481 NA DNA
#1884646 2275341 SAMN06432785 20483 NA DNA
#1860945 2277481 SAMN06407597 20488 NA DNA

How to merge two files based on one column and print both matching and non-matching?

Assuming your real files are sorted like your samples are:

$ join -o 0,1.2,2.2 -e0 -a1 -a2 tmptest1.txt tmptest2.txt
aaa 231 222
bbb 132 0
ccc 111 0
ddd 0 132

If not sorted and using bash, zsh, ksh93 or another shell that understands <(command) redirection:

join -o 0,1.2,2.2 -e0 -a1 -a2 <(sort temptest1.txt) <(sort tmptest2.txt)

Merging two files by a single column in unix

Check out join(1). In your case, you don't even need any flags:

$ join file_b file_a
subjectid prob_disease name age
12 0.009 Jane 16
24 0.738 Kristen 90
15 0.392 Clarke 78
23 1.2E-5 Joann 31

Merging two files based on 1st matching columns using awk command

Could you please try following(tested with provided samples only).

awk '
BEGIN{
FS=OFS=","
}
FNR>1 && FNR==NR{
a[$1]=$2 OFS $3
next
}
FNR>1{
print $1,$2,$3,a[$1]?a[$1]:","
}
' Test2.txt Test1.txt

Explanation: Adding explanation for above code now.

awk '
BEGIN{ ##Starting BEGIN section from here, which will be executed before reading Input_file(s).
FS=OFS="," ##Setting FS and OFS value as comma here.
} ##Closing BEGIN section here.
FNR>1 && FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when 1st Input_file is being read and FNR>1 will skip its 1st line.
a[$1]=$2 OFS $3 ##Creating an array named a whose index is $1 and value is $2 OGS $3.
next ##next will skip all further statements from here.
}
FNR>1{ ##Checking condition FNR>1 which will run for all lines except 1st line of 2nd Input_file.
print $1,$2,$3,a[$1]?a[$1]:"," ##Printing $1 $2 $3 and value of array a value whose index is $1 if its value is NULL then print comma there.
}
' Test2.txt Test1.txt ##Mentioning Input_file names here.

Merge two files based on two common columns, and replace the blank to 0

Could you please try following, written and tested with shown samples only in GNU awk.

awk '
FNR==NR{
a[$1 OFS $2]=$NF
next
}
{
if(($1 OFS $2) in a){
d[$1 OFS $2]
$(NF+1)=a[$1 OFS $2]
}
else{
$(NF+1)=0
}
print
}
END{
for(i in a){
if(!(i in d)){
print i,"0",a[i]
}
}
}
' Input_file2 Input_file1 | sort -k1

Output will be as follows.

chr1 1000001 135 377
chr1 5500002 0 320
chr2 1000002 57 0
chr2 4400002 117 0
chr6 1000003 172 432

Unix: How to combine separate columns into one column

You can put in string literal inside awk print command.

Here's an example:

$ cat a
1 2 3 [AUTORESTART] Mar 17 21:21:32 GMT 2022
$ cat a | awk '{print $4 "," $6 " " $7 " " $8 " " $9 " " $10}'
[AUTORESTART],17 21:21:32 GMT 2022

You can see that I print 4th column, then a literal comma, then 6th column, then literal space, and so on until 10th column

You can then redirect it to a csv file

$ cat a | awk '{print $4 "," $6 " " $7 " " $8 " " $9 " " $10}' > mycsv.csv

Merge Two files of columns but insert columns of second file into columns of first file

You can use a loop in awk, for example

paste file_A file_B | awk '{ 
half = NF/2;
for(i = 1; i < half; i++)
{
printf("%s %s ", $i, $(i+half));
}
printf("%s %s\n", $half, $NF);
}'

or

paste file_A file_B | awk '{ 
i = 1; j = NF/2 + 1;
while(j < NF)
{
printf("%s %s ", $i, $j);
i++; j++;
}
printf("%s %s\n", $i, $j);
}'

The code assumes that the number of columns in awk's input is even.



Related Topics



Leave a reply



Submit