Using Awk to Count the Number of Occurrences of a Word in a Column

using awk to count the number of occurrences of pattern from another file

awk '
NR==FNR{a[$0]; next}
{
for(i=1; i<=NF; i++){
if ($i in a){ a[$i]++ }
}
}
END{
for(key in a){ printf "%s %d\n", key, a[key] }
}
' list.txt target.txt
  • NR==FNR{a[$0]; next} The condition NR==FNR is only true for the first file, so
    the keys of array a are lines of list.txt.

  • for(i=1; i<=NF; i++) Now for the second file, this loops over all
    its fields.

    • if ($i in a){ a[$i]++ } This checks if the field $i is present as a key
      in the array a. If yes, the value (initially zero) associated with that key is incremented.
  • At the END, we just print the key followed by the number of occurrences a[key] and a newline (\n).

Output:

blonde 2
red 0
black 0

Notes:

  1. Because of %d, the printf statement forces the conversion of a[key] to an integer in case it is still unset. The whole statement could be replaced by a simpler print key, a[key]+0. I missed that when writing the answer, but now you know two ways of doing the same thing. ;)

  2. In your attempt you were, for some reason, only addressing field 2 ($2), ignoring other columns.

awk to count number of occurrences in a Column and print in new column

Could you please try following.

awk '{print $0,"exon_number:"++a[$9]}'  Input_file

Explanation of above code:

print: Is awk's out of the box utility for print the variable/line.

$0: In awk language $0 is current line(so printing current line).

,: comma is separator here which will enter a space between $0 and next string on output.

"exon_number:": Printing string exon_numbernow as per OP's output.

++a[$9]: Here I am creating an array named a whose index is 9th column and ++ before it makes sure first its value increases and then it prints its value of array a(which will be simply occurrence number of 9th column).

In case you need to have output as TAB separated then change awk to awk BEGIN{OFS="\t"} in above code too.

count the number of occurrences in a files for particular column using awk

Storing the values you want in the array key might be sufficient.

$ awk -F, '{a[$2 FS $3]++} END {for(i in a){print i,a[i]}}' OFS=, input.txt
1_2_34_47.csv,2345,1
1_2_34_46.csv,2346,1
1_2_34_45.csv,2345,3

Note that with an awk script this simple, the output order cannot be guaranteed. (That is, array order is not guaranteed.) If you want to control the order, you'd be best to use an additional array:

$ awk -F, '{k=$2 FS $3} !a[k]++{o[i++]=k} END {for(j=0;j<i;j++){print o[j],a[o[j]]}}' OFS=, input.txt
1_2_34_45.csv,2345,3
1_2_34_46.csv,2346,1
1_2_34_47.csv,2345,1

The second array has an incrementing key that we can step through using a for loop as a counter. The counter preserves the original order of "new" keys in the input stream.

count the occurrences of value in column 1 for each string in column 2 using awk

with both languages it is easy (any language really).... all depends on your knowledge

awk

awk '{
count[$7]++;
memory_1[NR] = $1;
memory_7[NR] = $7;
}
END{
for(i=1; i<=NR; ++i) print memory_1[i] OFS memory_7[i] OFS count[memory_7[i]]
}' file

python

records = [line.split() for line in open("file").readlines()]
from collections import Counter
count = Counter(r[6] for r in records)
print "\n".join("\t".join((r[0], r[6], str(count[r[6]]))) for r in records)

you get:


chr1:66997989-67000678 geneA 2
chr1:66997824-67000456 geneA 2
chr2:33544389-33548489 geneB 3
chr2:33546285-33547055 geneB 3
chr2:44567890-44568980 geneB 3

Awk: Count occurrences of negative values in each column and transpose CSV

If you can use GNU awk, you can control array traversal with the PROCINFO["sorted_in"] setting:

#!gawk
BEGIN {FS = OFS = ", "}

NR == 1 {
for (i = 2; i <= NF; i++) quality[i] = $i
next
}

{
for (i = 2; i <= NF; i++) {
if ($i + 0 <= 0) {
countries[i] = countries[i] OFS $1
count[i]++
}
}
}

END {
PROCINFO["sorted_in"] = "@val_num_desc"
for (i in count) {
printf "%d %s: %s\n", count[i], quality[i], gensub(OFS, "", 1, countries[i])
}
}

then

gawk -f script.gawk file.csv

outputs

4 FREEDOM TO MAKE LIFE CHOICES: Afghanistan, Albania, Algeria, Argentina
4 GENEROSITY: Afghanistan, Albania, Algeria, Argentina
3 DELIVERY QUALITY: Afghanistan, Albania, Algeria
2 CONFIDENCE IN NATIONAL GOVERNMENT: Afghanistan, Albania
2 DEMOCRATIC QUALITY: Afghanistan, Algeria

How to count occurrences no matter its case?

There is a little problem in your syntax: you either say var == "string" or var ~ regexp, but you are saying var ~ /"string"/. Using the correct combination makes your command work:

$ awk '$7 ~ /^[Cc][Aa]/{++count} END {print count+0}' file
5
$ awk 'BEGIN {IGNORECASE = 1} $7=="CA" {++count} END {print count+0}' file
5

Also, you may want to use toupper() (or tolower()) to check this, instead of using the IGNORECASE flag:

awk 'toupper($7) == "CA" {++count} END {print count+0}' file

Note the trick to print count + 0 instead of just count. This way, we cast the variable to 0 if it wasn't set before. With this, it will print 0 whenever there was no matches; if we would just print count, it would return an empty string.

awk - pull out pair columns and get the count of occurrences

Here is single pass awk to get it done:

 awk '/^x_/ {xk[$0]; next} {s=$0; sub(/[0-9]+$/, "", s); xv[$0]=s} END {for (i in xv) if ("x_" i in xk) {print "x_" i, i; ++fq[xv[i]]}; print "== Summary =="; for (i in fq) print i, fq[i]}' file

x_rev1 rev1
x_rate1 rate1
x_rate2 rate2
x_rate3 rate3
x_rate_r1 rate_r1
x_pay1 pay1
x_rate_r2 rate_r2
x_pay2 pay2
== Summary ==
rate_r 2
rate 3
rev 1
pay 2

A more readable form:

awk '
/^x_/ {
xk[$0]
next
}
{
s = $0
sub(/[0-9]+$/, "", s)
xv[$0] = s
}
END {
for (i in xv)
if ("x_" i in xk) {
print "x_" i, i
++fq[xv[i]]
}
print "== Summary =="
for (i in fq)
print i, fq[i]
}' file

Awk: How do I count occurrences of a string across columns and find the maximum across rows?

What about this?

awk '{ for (i=2;i<NF;i++) { if ($i=="y") { a[$1" "$i]++} } } END { print "Yes tally"; l=0; for (i in a) { print i,a[i]; if (l>a[i]) { l=l } else { l=a[i];name=i }   } split(name,a," "); print "Winner is ",a[1],"with ",l,"votes"  } ' f
Yes tally
name3 y 6
Markopoulos y 6
Karydhs y 7
Winner is Karydhs with 7 votes

awk Count number of occurrences

Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:

awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
  • Awk scripts consist of condition { statement } pairs, so you can do away with the if entirely -- it's implicit.
  • n++ increments a counter whenever the condition is matched.
  • The magic condition END is true after the last line of input has been processed.

Is this what you're after? Why were you adding NR to your output if all you wanted was the line count?

Oh, and you might want to confirm whether you really need -F" ". By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.


UPDATE #1 based on the edited question...

If what you're really after is a pair counter, an awk array may be the way to go. Something like this:

awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt

Here's the breakdown.

  • The first statement runs on every line, and increments a counter that is the index on an array (a[]) whose key is build from $4 and $5.
  • In the END block, we step through the array in a for loop, and for each index, print the index name and the value.

The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.

Example:

$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2

UPDATE #2 based on the revised input data and previously undocumented requirements.

With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:

$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2

This works by first (in the magic BEGIN block) defining an array, v[], to record "valid" records. The condition on the counter simply verifies that both $4 and $5 contain members of the array. All else works the same.

At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.

#!/usr/bin/awk -f

BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}

$4 in v && $5 in v {
a[$4 $5]++
}

END {
for (p in a)
printf("%s %d\n", p, a[p])
}

Much easier to read that way.

And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.

#!/usr/bin/awk -f

BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}

($4 $5) in a {
a[$4 $5]++
}

END {
for (p in a)
printf("%s %d\n", p, a[p])
}

This only validates things that already have array indices, which are NULL per BEGIN.

The parentheses in the increment condition are not required, and are included only for clarity.



Related Topics



Leave a reply



Submit