Profiling Clojure Programs with Tufte

December 23, 2019

Hi, Friends! 👋

In our previous adventures, we were working on making our Advent of Code Day 04 solution faster. Here’s a quick recap of the situation:

Our initial solution took over 9000(!) ms to run on an Intel i7-8700K.
We employed type hinting to get the runtime down to ~510ms
Then we used reducers to further double the speed, resulting in a current benchmark of ~230ms

So now we’re at a point where, if we want to go faster, we need to be intelligent about the changes we make. It’s time to start measuring the performance of individual functions!

Using Tufte to Profile Clojure Code

A quick search for profiling tools in Clojure turned up Tufte, and in the short time that I’ve spent with it so far, I have come away quite impressed. Make sure you give @ptaoussanis some love on Twitter!

Let’s look at setting it up for use in our code.

Step 1: Add Project Dependency

In our project.clj file, we need to update our dependencies to include Tufte:

--- a/project.clj
+++ b/project.clj
@@ -3,7 +3,8 @@
   :url "https://github.com/ajpierce/advent-2019"
   :license {:name "MIT"
             :url "https://opensource.org/licenses/MIT"}
-  :dependencies [[org.clojure/clojure "1.10.1"]]
+  :dependencies [[org.clojure/clojure "1.10.1"]
+                 [com.taoensso/tufte "2.1.0"]]
   :global-vars {*warn-on-reflection* true}
   :profiles {:uberjar {:aot :all}
              :day01 {:main advent-2019.day01

Step 2: Configure Tufte in our Namespace

Next, we’ll require Tufte into the namespace we’re interested in profiling, and then configure it to print out stats using the basic println handler:

(ns advent-2019.day04
  (:gen-class)
  (:require [clojure.core.reducers :as r]
            [taoensso.tufte :as tufte :refer (defnp p profile)]))

(tufte/add-basic-println-handler! ;; 👈👇☝ Added these three lines
  {:format-pstats-opts {:columns [:n-calls :max :mean :mad :clock :total]}})

Step 3: Specify Functions to Profile

Selecting functions to profile is very straightforward. According to the Tufte docs:

Just wrap them with p and give them an id (namespaced keyword).

I also noticed in the Tufte README usage of the defnp form that allows you to omit the p wrapping.

Ok, so we need to pick some functions where we think our program is most likely to spend its time. Here’s our original code:

(defn part1 [numbers]
  (->> numbers
       vec
       (r/map get-digits)
       (r/filter valid-number?)
       r/foldcat
       count))

I’m guessing most of the work is done in the get-digits and valid-number? functions, so let’s change their declarations from defn to defnp and see what happens:

-(defn get-digits
+(defnp get-digits
   "Transforms a string or a number into a list of individual digits"
   [^Integer number] (map get-digit (str number)))

-(defn valid-number? [digits]
+(defnp valid-number? [digits]
   (and (apply <= digits)
        (not (apply distinct? digits))))

I’m also going to add a p binding to part1 so we can see the total time it takes to run:

 (defn part1 [numbers]
-  (->> numbers
-       vec
-       (r/map get-digits)
-       (r/filter valid-number?)
-       r/foldcat
-       count))
+  (p ::part1
+     (->> numbers
+          vec
+          (r/map get-digits)
+          (r/filter valid-number?)
+          r/foldcat
+          count)))

Step 4: Profile!

The final step is to set up a profile form and evaluate it. It’s a good idea when profiling to run the same code multiple times, so we’re going to use dotimes form to ensure that our code gets sufficiently exercised:

(let [numbers (range 402328 (inc 864247))]
  (profile {} (dotimes [_ 20] ;; Run "part1" 20 times
                (part1 numbers))))

After evaluating this block, the following prints out in my REPL:

pId                     nCalls        Max       Mean   MAD      Clock  Total
part1                       20     1.04s    879.33ms   ±6%    17.59s    100%
defn_valid-number?     9238400   189.72ms     1.53μs  ±44%    14.15s     80%
:tufte/compaction           23    93.60ms    65.46ms  ±22%     1.51s      9%
defn_get-digits        9238400    65.84ms    40.52ns ±135%   374.30ms     2%
Accounted                                                     33.62s    191%
Clock                                                         17.59s    100%

Cool! We’re profiling individual functions!

From this chart it becomes QUITE apparent that we are spending most (80%!) of our time in the valid-number? function. Let’s see if we can refactor it to make it faster.

Here’s our original implementation:

(defnp valid-number? [digits]
  (and (not (apply distinct? digits))
       (apply <= digits)))

The first thing that jumps out at me is the order of these checks. I wonder: would the function fail faster if we did the check to see if our digits are in ascending order before we checked to see if it contains duplicates? Let’s refactor it and see:

--- a/src/advent_2019/day04.clj
+++ b/src/advent_2019/day04.clj
@@ -13,8 +13,8 @@
 (defnp valid-number? [digits]
-  (and (not (apply distinct? digits))
-       (apply <= digits)))
+  (and (apply <= digits)
+       (not (apply distinct? digits))))

Let’s profile it again and see if our times are any different:

pId                     nCalls        Max       Mean   MAD      Clock  Total
part1                       20   624.12ms   503.04ms  ±12%    10.06s    100%
defn_valid-number?     9238400   185.27ms   687.43ns  ±63%     6.35s     63%
:tufte/compaction           23   116.17ms    69.65ms  ±24%     1.60s     16%
defn_get-digits        9238400    92.09ms    42.18ns ±139%   389.64ms     4%
Accounted                                                     18.40s    183%
Clock                                                         10.06s    100%

Oh nice, we now spend only 63% of our time in the function instead of 80%. Our mean execution time has dropped from 879ms to 503ms! It’s still a very hot spot in our code, though, so we should look to see if there are any other optimizations we can do. Let’s figure out which of these two checks are taking the most time:

--- a/src/advent_2019/day04.clj
+++ b/src/advent_2019/day04.clj
@@ -13,8 +13,8 @@
 (defn valid-number? [digits]
-  (and (apply <= digits)
-       (not (apply distinct? digits))))
+  (and (p ::ascending-digits? (apply <= digits))
+       (p ::repeat-digits? (not (apply distinct? digits)))))

When we run our profile again, we now see:

pId                     nCalls        Max       Mean   MAD      Clock  Total

part1                       20   802.12ms   571.52ms  ±15%    11.43s    100%
defn_valid-number?     9238400   202.84ms   842.79ns  ±74%     7.79s     68%
ascending-digits?      9238400   127.25ms   651.57ns  ±56%     6.02s     53%
:tufte/compaction           34   249.79ms    79.53ms  ±41%     2.70s     24%
defn_get-digits        9238400    97.20μs    31.72ns ±138%   293.07ms     3%
repeat-digits?            9100    17.10μs   567.46ns  ±70%     5.16ms     0%
vector-transformation       20   300.00ns   190.00ns  ±24%     3.80μs     0%

Accounted                                                     28.24s    247%
Clock                                                         11.43s    100%

Wow! We spend 53% of our time checking to see if the digits are in ascending order! I definitely thought the issue was going to be finding duplicates. But this makes sense if you consider that we’re inside an (and) block and we’re only passing the first condition 0.1% of the time. nCalls for ascending-digits? is over 1000x larger than for repeat-digits?.

So we want to speed up our ascending-digits? check. Well, apply is notoriously expensive. I wonder if there’s a faster way to assert that the digits are in ascending order? Let’s do a quick test to compare performance of each approach:

(profile {} (dotimes [_ 9000000]
             (p ::expanded (<= 1 2 3 4 5 6))
             (p ::with-apply (apply <= '(1 2 3 4 5 6)))))

pId           nCalls        Max       Mean   MAD      Clock  Total

expanded      9000000   191.50ms   321.38ns  ±76%     2.89s     34%
with-apply    9000000    89.80μs   236.68ns  ±22%     2.13s     25%
:tufte/compaction  22   228.93ms    93.36ms  ±42%     2.05s     24%

Accounted                                             7.08s     84%
Clock                                                 8.43s    100%

Well that was a red herring. Funny story: I spent an hour learning how to write macros to expand our digits before testing that hypothesis 😅. So maybe we just can’t make that ascending digits check any faster.

If we look at our initial times, you can see that there’s quite a bit of unaccounted-for work:

pId                       nCalls        Max       Mean   MAD      Clock  Total
part1                         20   802.12ms   571.52ms  ±15%    11.43s    100%
defn_valid-number?     9,238,400   202.84ms   842.79ns  ±74%     7.79s     68%
...
defn_get-digits        9,238,400    97.20μs    31.72ns ±138%   293.07ms     3%

It looks like we’re missing 29% of all the work the part1 function does 🤔. Let’s refactor part1 with excessive let bindings to be a little more granular about each step:

 (defn part1 [numbers]
-  (->> numbers
-       vec
-       (r/map (p ::get-digits get-digits))
-       (r/filter (p ::valid-number? valid-number?))
-       r/foldcat
-       count))
+  (let [vec-nums (p ::to-vector (vec numbers))
+        digits (p ::to-digits (r/map get-digits vec-nums))
+        filtered (p ::filter (r/filter valid-number? digits))
+        collected (p ::collection (r/foldcat filtered))]
+    (p ::count (count collected))))

When we run our profile again, we get:

pId            nCalls        Max       Mean   MAD      Clock  Total

collection         20    62.29ms    47.55ms   ±7%   950.91ms    89%
to-vector          20    11.24ms     5.87ms   ±9%   117.41ms    11%
to-digits          20     7.30μs     4.85μs   ±9%    96.90μs     0%
filter             20     1.90μs     1.37μs  ±10%    27.30μs     0%
count              20     1.00μs   670.00ns  ±21%    13.40μs     0%

Accounted                                             1.07s    100%
Clock                                                 1.07s    100%

By moving the profiling step outside of our r/map and r/filter functions, it looks like the profiler is no longer stepping down to the ::repeat-digits? and ::ascending-digits? blocks, and most of the computational energy is now being spent on the r/foldcat step. I wonder if instead we can use folding to just sum a value for every result and return that instead?

Let’s do some refactoring:

@@ -1,6 +1,14 @@
-(defn part1 [numbers]
+(defn part1-reducers-count [numbers]
   (let [vec-nums (p ::to-vector (vec numbers))
         digits (p ::to-digits (r/map get-digits vec-nums))
         filtered (p ::filter (r/filter valid-number? digits))
         collected (p ::collection (r/foldcat filtered))]
     (p ::count-collected (count collected))))
+
+(defn part1-reducers-fold [numbers]
+  (let [vec-nums (p ::to-vector (vec numbers))
+        digits (p ::to-digits (r/map get-digits vec-nums))
+        filtered (p ::filter (r/filter valid-number? digits))]
+    (p ::count-folded (->> filtered
+                           (r/map (constantly 1))
+                           (r/fold +)))))

Let’s compare them:

pId                      nCalls        Max       Mean   MAD      Clock  Total

part1-reducers-count         20    66.14ms    52.11ms   ±5%     1.04s     50%
part1-reducers-fold          20    59.42ms    51.29ms   ±2%     1.03s     50%
collection                   20    54.38ms    46.18ms   ±5%   923.64ms    45%
count-folded                 20    53.69ms    45.66ms   ±3%   913.24ms    44%
to-vector                    40    11.61ms     5.76ms   ±5%   230.52ms    11%
to-digits                    40     8.00μs     1.86μs  ±28%    74.20μs     0%
filter                       40     1.80μs   907.50ns  ±13%    36.30μs     0%
count-collected              20     9.70μs   965.00ns  ±91%    19.30μs     0%
Accounted                                                       4.14s    200%
Clock                                                           2.07s    100%

Only marginally faster. Although at this point I’m curious to see how far we’ve come since we started. Let’s see how our performance looks compared with how it used to be before we added reducers!

Profiling Three Implementations

(defn part1-raw
  "Seq-based map/filter without using reducers"
  [numbers]
  (let [digits (p ::to-digits (map get-digits numbers))
        filtered (p ::filter (filter valid-number? digits))]
    (count filtered)))

(defn part1-reducers-count
  "Our first foldcat-based reducers implementation"
  [numbers]
  (let [vec-nums (p ::to-vector (vec numbers))
        digits (p ::to-digits (r/map get-digits vec-nums))
        filtered (p ::filter (r/filter valid-number? digits))
        collected (p ::collection (r/foldcat filtered))]
    (p ::count-collected (count collected))))

(defn part1-reducers-fold
  "Our new fold-based counting implementation"
  [numbers]
  (let [vec-nums (p ::to-vector (vec numbers))
        digits (p ::to-digits (r/map get-digits vec-nums))
        filtered (p ::filter (r/filter valid-number? digits))]
    (p ::count-folded (->> filtered
                           (r/map (constantly 1))
                           (r/fold +)))))

Let’s see what we get when we run it like this:

(let [numbers (range 402328 (inc 864247))]
  (profile {} (dotimes [_ 20]
                (p ::part1-raw (part1-raw numbers))
                (p ::part1-reducers-cat (part1-reducers-count numbers))
                (p ::part1-reducers-fold (part1-reducers-fold numbers)))))

pId               nCalls        Max       Mean   MAD      Clock  Total

part1-raw             20   664.77ms   573.35ms   ±9%    11.47s     72%
...
part1-reducers-cat    20   247.77ms   126.40ms  ±31%     2.53s     16%
...
part1-reducers-fold   20   224.75ms   100.47ms  ±29%     2.01s     13%
...
Accounted                                               37.64s    235%
Clock                                                   16.00s    100%

Oh wow, over 20 runs, our new folding reducer is averaging just over 100ms per run!

Bear in mind that some of this speedup was thanks to the JVM warming up. You can see that our original reducer implementation that we thought took around 230ms to run actually clocks in at 126ms (with profiling and after some changes). But this repeated profiling also confirms that our original implementation (with the type hinting) clocks in at well over half a second.

Bear in mind as well that the above metrics are after we swapped the order of the checks in our valid-digits? function. If we set it back, our speed looks like this:

Performance using the old `valid-digits?` implementation:

pId                    nCalls        Max       Mean   MAD      Clock  Total

part1-raw                  20     1.08s      1.01s    ±3%    20.13s     74%
...
part1-reducers-cat-count   20   267.80ms   181.09ms  ±26%     3.62s     13%
...
part1-reducers-fold        20   236.38ms   173.09ms  ±22%     3.46s     13%
...
Accounted                                                     1.13m    248%
Clock                                                        27.21s    100%

Conclusion

So by leveraging Tufte and profiling our functions, we were able to:

More accurately measure the performance of our application (181ms avg execution, not 236ms)
Shave 45% off our execution time! (181ms down to 100ms).

What an adventure!

Over the last five days we’ve brought our execution time down from 9000ms to 100ms — very nearly two orders of magnitude faster. I think by now we’re reaching the point of diminishing returns. Do you know any other tricks that we could use to speed this up? Please let me know on Twitter!

Thanks, and see you next time! 👋