Profiling Clojure Programs with Tufte
December 23, 2019
Hi, Friends! š
In our previous adventures, we were working on making our Advent of Code Day 04 solution faster. Hereās a quick recap of the situation:
- Our initial solution took over 9000(!) ms to run on an Intel i7-8700K.
- We employed type hinting to get the runtime down to ~510ms
- Then we used reducers to further double the speed, resulting in a current benchmark of ~230ms
So now weāre at a point where, if we want to go faster, we need to be intelligent about the changes we make. Itās time to start measuring the performance of individual functions!
Using Tufte to Profile Clojure Code
A quick search for profiling tools in Clojure turned up Tufte, and in the short time that Iāve spent with it so far, I have come away quite impressed. Make sure you give @ptaoussanis some love on Twitter!
Letās look at setting it up for use in our code.
Step 1: Add Project Dependency
In our project.clj
file, we need to update our dependencies to include Tufte:
--- a/project.clj
+++ b/project.clj
@@ -3,7 +3,8 @@
:url "https://github.com/ajpierce/advent-2019"
:license {:name "MIT"
:url "https://opensource.org/licenses/MIT"}
- :dependencies [[org.clojure/clojure "1.10.1"]]
+ :dependencies [[org.clojure/clojure "1.10.1"]
+ [com.taoensso/tufte "2.1.0"]]
:global-vars {*warn-on-reflection* true}
:profiles {:uberjar {:aot :all}
:day01 {:main advent-2019.day01
Step 2: Configure Tufte in our Namespace
Next, weāll require Tufte into the namespace weāre interested in profiling, and then configure it to print out stats using the basic println
handler:
(ns advent-2019.day04
(:gen-class)
(:require [clojure.core.reducers :as r]
[taoensso.tufte :as tufte :refer (defnp p profile)]))
(tufte/add-basic-println-handler! ;; ššā Added these three lines
{:format-pstats-opts {:columns [:n-calls :max :mean :mad :clock :total]}})
Step 3: Specify Functions to Profile
Selecting functions to profile is very straightforward. According to the Tufte docs:
Just wrap them with
p
and give them an id (namespaced keyword).
I also noticed in the Tufte README usage of the defnp
form that allows you to omit the p
wrapping.
Ok, so we need to pick some functions where we think our program is most likely to spend its time. Hereās our original code:
(defn part1 [numbers]
(->> numbers
vec
(r/map get-digits)
(r/filter valid-number?)
r/foldcat
count))
Iām guessing most of the work is done in the get-digits
and valid-number?
functions, so letās change their declarations from defn
to defnp
and see what happens:
-(defn get-digits
+(defnp get-digits
"Transforms a string or a number into a list of individual digits"
[^Integer number] (map get-digit (str number)))
-(defn valid-number? [digits]
+(defnp valid-number? [digits]
(and (apply <= digits)
(not (apply distinct? digits))))
Iām also going to add a p
binding to part1
so we can see the total time it takes to run:
(defn part1 [numbers]
- (->> numbers
- vec
- (r/map get-digits)
- (r/filter valid-number?)
- r/foldcat
- count))
+ (p ::part1
+ (->> numbers
+ vec
+ (r/map get-digits)
+ (r/filter valid-number?)
+ r/foldcat
+ count)))
Step 4: Profile!
The final step is to set up a profile
form and evaluate it.
Itās a good idea when profiling to run the same code multiple times, so weāre going to use dotimes
form to ensure that our code gets sufficiently exercised:
(let [numbers (range 402328 (inc 864247))]
(profile {} (dotimes [_ 20] ;; Run "part1" 20 times
(part1 numbers))))
After evaluating this block, the following prints out in my REPL:
pId nCalls Max Mean MAD Clock Total
part1 20 1.04s 879.33ms Ā±6% 17.59s 100%
defn_valid-number? 9238400 189.72ms 1.53Ī¼s Ā±44% 14.15s 80%
:tufte/compaction 23 93.60ms 65.46ms Ā±22% 1.51s 9%
defn_get-digits 9238400 65.84ms 40.52ns Ā±135% 374.30ms 2%
Accounted 33.62s 191%
Clock 17.59s 100%
Cool! Weāre profiling individual functions!
From this chart it becomes QUITE apparent that we are spending most (80%!) of our time in the valid-number?
function.
Letās see if we can refactor it to make it faster.
Hereās our original implementation:
(defnp valid-number? [digits]
(and (not (apply distinct? digits))
(apply <= digits)))
The first thing that jumps out at me is the order of these checks. I wonder: would the function fail faster if we did the check to see if our digits are in ascending order before we checked to see if it contains duplicates? Letās refactor it and see:
--- a/src/advent_2019/day04.clj
+++ b/src/advent_2019/day04.clj
@@ -13,8 +13,8 @@
(defnp valid-number? [digits]
- (and (not (apply distinct? digits))
- (apply <= digits)))
+ (and (apply <= digits)
+ (not (apply distinct? digits))))
Letās profile it again and see if our times are any different:
pId nCalls Max Mean MAD Clock Total
part1 20 624.12ms 503.04ms Ā±12% 10.06s 100%
defn_valid-number? 9238400 185.27ms 687.43ns Ā±63% 6.35s 63%
:tufte/compaction 23 116.17ms 69.65ms Ā±24% 1.60s 16%
defn_get-digits 9238400 92.09ms 42.18ns Ā±139% 389.64ms 4%
Accounted 18.40s 183%
Clock 10.06s 100%
Oh nice, we now spend only 63% of our time in the function instead of 80%. Our mean execution time has dropped from 879ms to 503ms! Itās still a very hot spot in our code, though, so we should look to see if there are any other optimizations we can do. Letās figure out which of these two checks are taking the most time:
--- a/src/advent_2019/day04.clj
+++ b/src/advent_2019/day04.clj
@@ -13,8 +13,8 @@
(defn valid-number? [digits]
- (and (apply <= digits)
- (not (apply distinct? digits))))
+ (and (p ::ascending-digits? (apply <= digits))
+ (p ::repeat-digits? (not (apply distinct? digits)))))
When we run our profile again, we now see:
pId nCalls Max Mean MAD Clock Total
part1 20 802.12ms 571.52ms Ā±15% 11.43s 100%
defn_valid-number? 9238400 202.84ms 842.79ns Ā±74% 7.79s 68%
ascending-digits? 9238400 127.25ms 651.57ns Ā±56% 6.02s 53%
:tufte/compaction 34 249.79ms 79.53ms Ā±41% 2.70s 24%
defn_get-digits 9238400 97.20Ī¼s 31.72ns Ā±138% 293.07ms 3%
repeat-digits? 9100 17.10Ī¼s 567.46ns Ā±70% 5.16ms 0%
vector-transformation 20 300.00ns 190.00ns Ā±24% 3.80Ī¼s 0%
Accounted 28.24s 247%
Clock 11.43s 100%
Wow! We spend 53% of our time checking to see if the digits are in ascending order! I definitely thought the issue was going to be finding duplicates.
But this makes sense if you consider that weāre inside an (and)
block and weāre only passing the first condition 0.1% of the time.
nCalls
for ascending-digits?
is over 1000x larger than for repeat-digits?
.
So we want to speed up our ascending-digits?
check.
Well, apply
is notoriously expensive.
I wonder if thereās a faster way to assert that the digits are in ascending order?
Letās do a quick test to compare performance of each approach:
(profile {} (dotimes [_ 9000000]
(p ::expanded (<= 1 2 3 4 5 6))
(p ::with-apply (apply <= '(1 2 3 4 5 6)))))
pId nCalls Max Mean MAD Clock Total
expanded 9000000 191.50ms 321.38ns Ā±76% 2.89s 34%
with-apply 9000000 89.80Ī¼s 236.68ns Ā±22% 2.13s 25%
:tufte/compaction 22 228.93ms 93.36ms Ā±42% 2.05s 24%
Accounted 7.08s 84%
Clock 8.43s 100%
Well that was a red herring. Funny story: I spent an hour learning how to write macros to expand our digits before testing that hypothesis š . So maybe we just canāt make that ascending digits check any faster.
If we look at our initial times, you can see that thereās quite a bit of unaccounted-for work:
pId nCalls Max Mean MAD Clock Total
part1 20 802.12ms 571.52ms Ā±15% 11.43s 100%
defn_valid-number? 9,238,400 202.84ms 842.79ns Ā±74% 7.79s 68%
...
defn_get-digits 9,238,400 97.20Ī¼s 31.72ns Ā±138% 293.07ms 3%
It looks like weāre missing 29% of all the work the part1
function does š¤. Letās refactor part1
with excessive let bindings to be a little more granular about each step:
(defn part1 [numbers]
- (->> numbers
- vec
- (r/map (p ::get-digits get-digits))
- (r/filter (p ::valid-number? valid-number?))
- r/foldcat
- count))
+ (let [vec-nums (p ::to-vector (vec numbers))
+ digits (p ::to-digits (r/map get-digits vec-nums))
+ filtered (p ::filter (r/filter valid-number? digits))
+ collected (p ::collection (r/foldcat filtered))]
+ (p ::count (count collected))))
When we run our profile again, we get:
pId nCalls Max Mean MAD Clock Total
collection 20 62.29ms 47.55ms Ā±7% 950.91ms 89%
to-vector 20 11.24ms 5.87ms Ā±9% 117.41ms 11%
to-digits 20 7.30Ī¼s 4.85Ī¼s Ā±9% 96.90Ī¼s 0%
filter 20 1.90Ī¼s 1.37Ī¼s Ā±10% 27.30Ī¼s 0%
count 20 1.00Ī¼s 670.00ns Ā±21% 13.40Ī¼s 0%
Accounted 1.07s 100%
Clock 1.07s 100%
By moving the profiling step outside of our r/map
and r/filter
functions,
it looks like the profiler is no longer stepping down to the ::repeat-digits?
and ::ascending-digits?
blocks,
and most of the computational energy is now being spent on the r/foldcat
step.
I wonder if instead we can use folding to just sum a value for every result and return that instead?
Letās do some refactoring:
@@ -1,6 +1,14 @@
-(defn part1 [numbers]
+(defn part1-reducers-count [numbers]
(let [vec-nums (p ::to-vector (vec numbers))
digits (p ::to-digits (r/map get-digits vec-nums))
filtered (p ::filter (r/filter valid-number? digits))
collected (p ::collection (r/foldcat filtered))]
(p ::count-collected (count collected))))
+
+(defn part1-reducers-fold [numbers]
+ (let [vec-nums (p ::to-vector (vec numbers))
+ digits (p ::to-digits (r/map get-digits vec-nums))
+ filtered (p ::filter (r/filter valid-number? digits))]
+ (p ::count-folded (->> filtered
+ (r/map (constantly 1))
+ (r/fold +)))))
Letās compare them:
pId nCalls Max Mean MAD Clock Total
part1-reducers-count 20 66.14ms 52.11ms Ā±5% 1.04s 50%
part1-reducers-fold 20 59.42ms 51.29ms Ā±2% 1.03s 50%
collection 20 54.38ms 46.18ms Ā±5% 923.64ms 45%
count-folded 20 53.69ms 45.66ms Ā±3% 913.24ms 44%
to-vector 40 11.61ms 5.76ms Ā±5% 230.52ms 11%
to-digits 40 8.00Ī¼s 1.86Ī¼s Ā±28% 74.20Ī¼s 0%
filter 40 1.80Ī¼s 907.50ns Ā±13% 36.30Ī¼s 0%
count-collected 20 9.70Ī¼s 965.00ns Ā±91% 19.30Ī¼s 0%
Accounted 4.14s 200%
Clock 2.07s 100%
Only marginally faster. Although at this point Iām curious to see how far weāve come since we started. Letās see how our performance looks compared with how it used to be before we added reducers!
Profiling Three Implementations
(defn part1-raw
"Seq-based map/filter without using reducers"
[numbers]
(let [digits (p ::to-digits (map get-digits numbers))
filtered (p ::filter (filter valid-number? digits))]
(count filtered)))
(defn part1-reducers-count
"Our first foldcat-based reducers implementation"
[numbers]
(let [vec-nums (p ::to-vector (vec numbers))
digits (p ::to-digits (r/map get-digits vec-nums))
filtered (p ::filter (r/filter valid-number? digits))
collected (p ::collection (r/foldcat filtered))]
(p ::count-collected (count collected))))
(defn part1-reducers-fold
"Our new fold-based counting implementation"
[numbers]
(let [vec-nums (p ::to-vector (vec numbers))
digits (p ::to-digits (r/map get-digits vec-nums))
filtered (p ::filter (r/filter valid-number? digits))]
(p ::count-folded (->> filtered
(r/map (constantly 1))
(r/fold +)))))
Letās see what we get when we run it like this:
(let [numbers (range 402328 (inc 864247))]
(profile {} (dotimes [_ 20]
(p ::part1-raw (part1-raw numbers))
(p ::part1-reducers-cat (part1-reducers-count numbers))
(p ::part1-reducers-fold (part1-reducers-fold numbers)))))
pId nCalls Max Mean MAD Clock Total
part1-raw 20 664.77ms 573.35ms Ā±9% 11.47s 72%
...
part1-reducers-cat 20 247.77ms 126.40ms Ā±31% 2.53s 16%
...
part1-reducers-fold 20 224.75ms 100.47ms Ā±29% 2.01s 13%
...
Accounted 37.64s 235%
Clock 16.00s 100%
Oh wow, over 20 runs, our new folding reducer is averaging just over 100ms per run!
Bear in mind that some of this speedup was thanks to the JVM warming up. You can see that our original reducer implementation that we thought took around 230ms to run actually clocks in at 126ms (with profiling and after some changes). But this repeated profiling also confirms that our original implementation (with the type hinting) clocks in at well over half a second.
Bear in mind as well that the above metrics are after we swapped the order of the checks in our valid-digits?
function. If we set it back, our speed looks like this:
Performance using the old valid-digits?
implementation:
pId nCalls Max Mean MAD Clock Total
part1-raw 20 1.08s 1.01s Ā±3% 20.13s 74%
...
part1-reducers-cat-count 20 267.80ms 181.09ms Ā±26% 3.62s 13%
...
part1-reducers-fold 20 236.38ms 173.09ms Ā±22% 3.46s 13%
...
Accounted 1.13m 248%
Clock 27.21s 100%
Conclusion
So by leveraging Tufte and profiling our functions, we were able to:
- More accurately measure the performance of our application (181ms avg execution, not 236ms)
- Shave 45% off our execution time! (181ms down to 100ms).
What an adventure!
Over the last five days weāve brought our execution time down from 9000ms to 100ms ā very nearly two orders of magnitude faster. I think by now weāre reaching the point of diminishing returns. Do you know any other tricks that we could use to speed this up? Please let me know on Twitter!
Thanks, and see you next time! š