using awk to count the number of occurrences of pattern from another file
awk '
NR==FNR{a[$0]; next}
{
for(i=1; i<=NF; i++){
if ($i in a){ a[$i]++ }
}
}
END{
for(key in a){ printf "%s %d\n", key, a[key] }
}
' list.txt target.txt
NR==FNR{a[$0]; next}
The conditionNR==FNR
is only true for the first file, so
the keys of arraya
are lines oflist.txt
.for(i=1; i<=NF; i++)
Now for the second file, this loops over all
its fields.if ($i in a){ a[$i]++ }
This checks if the field$i
is present as a key
in the arraya
. If yes, the value (initially zero) associated with that key is incremented.
At the
END
, we just print thekey
followed by the number of occurrencesa[key]
and a newline (\n
).
Output:
blonde 2
red 0
black 0
Notes:
Because of
%d
, theprintf
statement forces the conversion ofa[key]
to an integer in case it is still unset. The whole statement could be replaced by a simplerprint key, a[key]+0
. I missed that when writing the answer, but now you know two ways of doing the same thing. ;)In your attempt you were, for some reason, only addressing field 2 (
$2
), ignoring other columns.
awk to count number of occurrences in a Column and print in new column
Could you please try following.
awk '{print $0,"exon_number:"++a[$9]}' Input_file
Explanation of above code:
print
: Is awk
's out of the box utility for print the variable/line.
$0
: In awk
language $0
is current line(so printing current line).
,
: comma is separator here which will enter a space between $0
and next string on output.
"exon_number:"
: Printing string exon_number
now as per OP's output.
++a[$9]
: Here I am creating an array named a whose index is 9th column and ++
before it makes sure first its value increases and then it prints its value of array a(which will be simply occurrence number of 9th column).
In case you need to have output as TAB separated then change awk
to awk BEGIN{OFS="\t"}
in above code too.
count the number of occurrences in a files for particular column using awk
Storing the values you want in the array key might be sufficient.
$ awk -F, '{a[$2 FS $3]++} END {for(i in a){print i,a[i]}}' OFS=, input.txt
1_2_34_47.csv,2345,1
1_2_34_46.csv,2346,1
1_2_34_45.csv,2345,3
Note that with an awk script this simple, the output order cannot be guaranteed. (That is, array order is not guaranteed.) If you want to control the order, you'd be best to use an additional array:
$ awk -F, '{k=$2 FS $3} !a[k]++{o[i++]=k} END {for(j=0;j<i;j++){print o[j],a[o[j]]}}' OFS=, input.txt
1_2_34_45.csv,2345,3
1_2_34_46.csv,2346,1
1_2_34_47.csv,2345,1
The second array has an incrementing key that we can step through using a for loop as a counter. The counter preserves the original order of "new" keys in the input stream.
count the occurrences of value in column 1 for each string in column 2 using awk
with both languages it is easy (any language really).... all depends on your knowledge
awk
awk '{
count[$7]++;
memory_1[NR] = $1;
memory_7[NR] = $7;
}
END{
for(i=1; i<=NR; ++i) print memory_1[i] OFS memory_7[i] OFS count[memory_7[i]]
}' file
python
records = [line.split() for line in open("file").readlines()]
from collections import Counter
count = Counter(r[6] for r in records)
print "\n".join("\t".join((r[0], r[6], str(count[r[6]]))) for r in records)
you get:
chr1:66997989-67000678 geneA 2
chr1:66997824-67000456 geneA 2
chr2:33544389-33548489 geneB 3
chr2:33546285-33547055 geneB 3
chr2:44567890-44568980 geneB 3
Awk: Count occurrences of negative values in each column and transpose CSV
If you can use GNU awk, you can control array traversal with the PROCINFO["sorted_in"]
setting:
#!gawk
BEGIN {FS = OFS = ", "}
NR == 1 {
for (i = 2; i <= NF; i++) quality[i] = $i
next
}
{
for (i = 2; i <= NF; i++) {
if ($i + 0 <= 0) {
countries[i] = countries[i] OFS $1
count[i]++
}
}
}
END {
PROCINFO["sorted_in"] = "@val_num_desc"
for (i in count) {
printf "%d %s: %s\n", count[i], quality[i], gensub(OFS, "", 1, countries[i])
}
}
then
gawk -f script.gawk file.csv
outputs
4 FREEDOM TO MAKE LIFE CHOICES: Afghanistan, Albania, Algeria, Argentina
4 GENEROSITY: Afghanistan, Albania, Algeria, Argentina
3 DELIVERY QUALITY: Afghanistan, Albania, Algeria
2 CONFIDENCE IN NATIONAL GOVERNMENT: Afghanistan, Albania
2 DEMOCRATIC QUALITY: Afghanistan, Algeria
How to count occurrences no matter its case?
There is a little problem in your syntax: you either say var == "string"
or var ~ regexp
, but you are saying var ~ /"string"/
. Using the correct combination makes your command work:
$ awk '$7 ~ /^[Cc][Aa]/{++count} END {print count+0}' file
5
$ awk 'BEGIN {IGNORECASE = 1} $7=="CA" {++count} END {print count+0}' file
5
Also, you may want to use toupper()
(or tolower()
) to check this, instead of using the IGNORECASE
flag:
awk 'toupper($7) == "CA" {++count} END {print count+0}' file
Note the trick to print count + 0
instead of just count
. This way, we cast the variable to 0
if it wasn't set before. With this, it will print 0
whenever there was no matches; if we would just print count
, it would return an empty string.
awk - pull out pair columns and get the count of occurrences
Here is single pass awk
to get it done:
awk '/^x_/ {xk[$0]; next} {s=$0; sub(/[0-9]+$/, "", s); xv[$0]=s} END {for (i in xv) if ("x_" i in xk) {print "x_" i, i; ++fq[xv[i]]}; print "== Summary =="; for (i in fq) print i, fq[i]}' file
x_rev1 rev1
x_rate1 rate1
x_rate2 rate2
x_rate3 rate3
x_rate_r1 rate_r1
x_pay1 pay1
x_rate_r2 rate_r2
x_pay2 pay2
== Summary ==
rate_r 2
rate 3
rev 1
pay 2
A more readable form:
awk '
/^x_/ {
xk[$0]
next
}
{
s = $0
sub(/[0-9]+$/, "", s)
xv[$0] = s
}
END {
for (i in xv)
if ("x_" i in xk) {
print "x_" i, i
++fq[xv[i]]
}
print "== Summary =="
for (i in fq)
print i, fq[i]
}' file
Awk: How do I count occurrences of a string across columns and find the maximum across rows?
What about this?
awk '{ for (i=2;i<NF;i++) { if ($i=="y") { a[$1" "$i]++} } } END { print "Yes tally"; l=0; for (i in a) { print i,a[i]; if (l>a[i]) { l=l } else { l=a[i];name=i } } split(name,a," "); print "Winner is ",a[1],"with ",l,"votes" } ' f
Yes tally
name3 y 6
Markopoulos y 6
Karydhs y 7
Winner is Karydhs with 7 votes
awk Count number of occurrences
Yes, everything you're trying to do can likely be done within the awk script. Here's how I'd count lines based on a condition:
awk -F" " '$4=="A" && $5=="G" {n++} END {printf("AG = %d\n", n)}' file.txt
- Awk scripts consist of
condition { statement }
pairs, so you can do away with theif
entirely -- it's implicit. n++
increments a counter whenever the condition is matched.- The magic condition
END
is true after the last line of input has been processed.
Is this what you're after? Why were you adding NR
to your output if all you wanted was the line count?
Oh, and you might want to confirm whether you really need -F" "
. By default, awk splits on whitespace. This option would only be required if your fields contain embedded tabs, I think.
UPDATE #1 based on the edited question...
If what you're really after is a pair counter, an awk array may be the way to go. Something like this:
awk '{a[$4 $5]++} END {for (pair in a) printf("%s %d\n", pair, a[pair])}' file.txt
Here's the breakdown.
- The first statement runs on every line, and increments a counter that is the index on an array (
a[]
) whose key is build from$4
and$5
. - In the
END
block, we step through the array in afor
loop, and for each index, print the index name and the value.
The output will not be in any particular order, as awk does not guarantee array order. If that's fine with you, then this should be sufficient. It should also be pretty efficient, because its max memory usage is based on the total number of combinations available, which is a limited set.
Example:
$ cat file
>seq1 284 284 A G 27 100 16 11 16 11
>seq1 266 266 C T 27 100 16 11 16 11
>seq1 227 227 T C 25 100 13 12 13 12
>seq1 194 194 A G 24 100 12 12 12 12
>seq1 185 185 T A 24 100 10 14 10 14
$ awk '/^>seq/ {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' file
CT 1
TA 1
TC 1
AG 2
UPDATE #2 based on the revised input data and previously undocumented requirements.
With the extra data, you can still do this with a single run of awk, but of course the awk script is getting more complex with each new requirement. Let's try this as a longer one-liner:
$ awk 'BEGIN{v["G"]; v["A"]; v["C"]; v["T"]} $4 in v && $5 in v {a[$4 $5]++} END {for (p in a) printf("%s %d\n", p, a[p])}' i
CT 1
TA 1
TC 1
AG 2
This works by first (in the magic BEGIN
block) defining an array, v[]
, to record "valid" records. The condition on the counter simply verifies that both $4
and $5
contain members of the array. All else works the same.
At this point, with the script running onto multiple lines anyway, I'd probably separate this into a small file. It could even be a stand-alone script.
#!/usr/bin/awk -f
BEGIN {
v["G"]; v["A"]; v["C"]; v["T"]
}
$4 in v && $5 in v {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
Much easier to read that way.
And if your goal is to count ONLY the combinations you mentioned in your question, you can handle the array slightly differently.
#!/usr/bin/awk -f
BEGIN {
a["AG"]; a["TA"]; a["CT"]; a["TC"]
}
($4 $5) in a {
a[$4 $5]++
}
END {
for (p in a)
printf("%s %d\n", p, a[p])
}
This only validates things that already have array indices, which are NULL per BEGIN
.
The parentheses in the increment condition are not required, and are included only for clarity.
Related Topics
Recursively Counting Files in a Linux Directory
How to Use 'Cp' Command to Exclude a Specific Directory
How to Find All Serial Devices (Ttys, Ttyusb, ..) on Linux Without Opening Them
How to Create a File With a Given Size in Linux
Ipc Performance: Named Pipe VS Socket
Linux Bash: Multiple Variable Assignment
How to Run Multiple Background Commands in Bash in a Single Line
How to Disassemble Raw 16-Bit X86 Machine Code
Does Malloc Lazily Create the Backing Pages For an Allocation on Linux (And Other Platforms)
How to Split CSV Files as Per Number of Rows Specified
How to Start Solr Automatically
Use Bluez Stack as a Peripheral (Advertiser)
Linux and I/O Completion Ports
Pipe Only Stderr Through a Filter