Counters in Ruby
I just read ↩ Pontakorn's blog about the usefulness of Python's collections.Counter
class. I think it is very useful, so I want to find ways to do the same thing in Ruby.
Ruby does not have a Counter
class, but the same functionalities can be accomplished using just Ruby's built-in APIs and data structures.
Functional counters
Post-Ruby 2.7
Here, we have collection of data that we want to count. For example, given a string s = "abcabcabczza"
, we want to count the occurrence of each character:
Separate the string into characters^[We can skip creating an intermediate array in this step by using
.each_char
instead of.chars
]:> s.chars #=> ["a", "b", "c", "a", "b", "c", "a", "b", "c", "z", "z", "a"]
Use
Enumerable#tally
:> s.chars.tally #=> {"a"=>4, "b"=>3, "c"=>3, "z"=>2}
Pre-Ruby 2.7
Here, we have collection of data that we want to count. For example, given a string s = "abcabcabczza"
, we want to count the occurrence of each character:
Separate the string into characters^[We can skip creating an intermediate array in this step by using
.each_char
instead of.chars
]:> s.chars #=> ["a", "b", "c", "a", "b", "c", "a", "b", "c", "z", "z", "a"]
Group by itself:
> s.chars.group_by(&:itself) #=> {"a"=>["a", "a", "a", "a"], # "b"=>["b", "b", "b"], # "c"=>["c", "c", "c"], # "z"=>["z", "z"]}
Transform the grouping into count:
> s.chars.group_by(&:itself).transform_values(&:count) #=> {"a"=>4, "b"=>3, "c"=>3, "z"=>2}
Hamlet word counting
This is the example from Python's docs for Counter
^[The numbers are different because I used a different hamlet.txt
than the one used in the example.]:
>>> from collections import Counter
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
# [('the', 993), ('and', 863), ('to', 685), ('of', 610), ('i', 574),
# ('you', 527), ('a', 511), ('my', 502), ('it', 419), ('in', 400)]
Let's do the same in Ruby.
Read the file:
> File.read('hamlet.txt') #=> "[The Tragedie of Hamlet by William Shakespeare 1599]\n\n\nActus Primus..."
Transform to lowercase^[In
irb
(Interactive Ruby), the_
variable corresponds to the result of the previous line.]:> _.downcase #=> "[the tragedie of hamlet by william shakespeare 1599]\n\n\nactus primus..."
Get all the words:
> _.scan(/\w+/) #=> ["the", "tragedie", "of", "hamlet", "by", "william", "shakespeare", "1599", # "actus", "primus", "scoena", "prima", "enter", "barnardo", "and", ...]
Group by itself and get the count:
> _.group_by(&:itself).transform_values(&:count) #=> {"the"=>993, "tragedie"=>4, "of"=>610, "hamlet"=>100, "by"=>107, # "william"=>1, "shakespeare"=>1, "1599"=>1, "actus"=>2, "primus"=>1, ...}
Sort by count^[A hash, when enumerated, each entry is represented as a
[key, value]
pair. So,item.first
is the key anditem.last
is the value. Consequently,hash.sort_by(&:first)
means sort by key andhash.sort_by(&:last)
means sort by value.]:> _.sort_by(&:last) #=> [... ["his", 285], ["not", 300], ["is", 328], ["ham", 337], ["that", 377], # ["in", 400], ["it", 419], ["my", 502], ["a", 511], ["you", 527], # ["i", 574], ["of", 610], ["to", 685], ["and", 863], ["the", 993]]
We only need the top ten. So take the last 10 entries and reverse to get the final result:
> _.last(10).reverse #=> [["the", 993], ["and", 863], ["to", 685], ["of", 610], ["i", 574], # ["you", 527], ["a", 511], ["my", 502], ["it", 419], ["in", 400]]
Combining the previous steps into a single line program.
File.read('hamlet.txt').downcase.scan(/\w+/).group_by(&:itself).transform_values(&:count).sort_by(&:last).last(10).reverse
As you see, in Ruby many problems can be solved by writing code strictly from left to right. Meanwhile in Python, oftentimes you need to mix function calls with methods (e.g. str.lower()
, len(str)
) and list comprehensions.
Imperative counters
When counting imperatively we start from zero and tally up the counts.
This is the example from Python's docs for Counter
:
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
... cnt[word] += 1
>>> cnt
# Counter({'blue': 3, 'red': 2, 'green': 1})
In Ruby, hashes can have a default value. We can use a hash with a default value of 0 to simulate a counter, no specialized data structure needed:
> cnt = Hash.new(0)
* ['red', 'blue', 'red', 'green', 'blue', 'blue'].each do |word|
* cnt[word] += 1
> end
> cnt
#=> {"red"=>2, "blue"=>3, "green"=>1}