* Profiling and current results

Profiling with perf shows that this works indeed fast with little to optimize.
The code is well tuned and it approached performance levels that require
counting memory accesses and cache alignment.

Currently my methodology is rather crude. I'm running cl-bench to compare
between ECL versions and other implementations and I'm running flamegraph on a
microbenchmark also defined for ECL.

- reader :: this tests dispatch of optimized accessors (readers in microbench)
- method :: this tests dispatch of other generic funcs (2 args, empty body)

Both tests are designed so that, if the cache is big enough, should have a hit,
so we are testing here a happy path. We achieve that by prewarming the cache.

This table shows progress by validating optimization steps improvements:

|              | reader | method |                                           |
| ECL (base)   |   1.00 |   1.00 | this is a baseline for comparisons        |
| ECL (ver1)   |   0.90 |   0.98 | this only moves keys to the stack         |
| ECL (ver2)   |   0.75 |   0.94 | optimize hot path, remove lru, resize     |
| ECL (ver3)   |   0.80 |   0.94 | make cache local to each generic function |
| ECL (ver4)   |   0.55 |   0.94 | make cache local to each generic function |
| CCL (base)   |   0.60 |   0.30 | this is CCL, the closest competitor       |


** ECL (ver1)

Optimizes only memory access to the specialization vector (keys).  Noticeable
improvement of 5% shows that we are memory bound.

** ECL (ver2)

Optimizes a few things things -- removes disabling interrupts from the search
path and replaces generation managament by a single "used" tag. Also when the
cache pressure is too big, we resize it.

We may observe, that readers benefit much more from these optimizations than
ordinary methods. This may be explained by a call overhead to the effective
method function (optimized reader does only a single memory access).

At this point flamegraph indications:
- reader :: hashing 20% / search 50% / rest 30%
- method :: hashing 15% / search 20% / rest 65%

Method flamegraph is less readable, but it seems that binding (empty) next
methods and invoking the closure dominates hashing and search time and is a
separate angle for optimizing ECL generic function invocations. I'm going to
include results for method, but I don't expect drastic improvements.

At this point it is worth considering, whether I'm optimizing the right thing. I
believe that I do -- accessors are very prominent class of methods and improving
them will benefit most programs.

**backlog optimizations**
- optimize for no auxilliary methods (similar to sbcl's "fast method")
- make next-methods a lexical binding (no dynamic binding overhead)
- inline effective method in the discriminator (FGD)

It is worth noting, that in unmodified ~cl-bench~ clos benchmark the difference
is more substantial, because all test methods have bodies that invoke accessors.
Another noteworthy thing is that FGD, without inlining effective methods, will
likely hit a similar bottleneck.

** ECL (ver3)

This change moves from per-thread global cache for all GFs to a per-thread
per-GF functions. The initial cache size is smaller (128 vs 4096 entries).

As expected, additional pointer chase drops the performance compared to the
previous version for both ~reader~ and ~method~ (5%-7%). This change supports
further improvements: fewer keys and smaller records (look forward to ver4).

As unexpected, ~cl-bench~ tests show ~30% speedup for cl-bench constructor tests
(CLOS/SIMPLE-INSTATIATE and CLOS/INSTATIATE).  The most likely cause is that the
cache has many fewer entries, so we have effectively fewer collisions. Since
constructors span multiple generic functions, that compounds because there is no
cross-gf eviction. Less entries also mean shorter probe chain. This seems to
indicate, that reducing cache misses is substantial.

** ECL (ver4)

In this iteration we incorporation numerous ticks to reduce memory pressure and
evade computations (mostly for readers, but some benefit also normal methods):

1. capitalize on ver3 changes (no gf spec, no clearlist, spec is a vector)

   Thanks to moving cache to be local to a generic function, we don't need to
   maintain a thread-safe clearlist (because gf invalidation empties whole cache
   now), and we can reduce number of keys by 1 (gf is no longer a specializer).
   We also don't cons keys of arguments that are not specialized.

   Moreover, while not directly related, we change the specialization profile to
   be a vector (not a list), to reduce pointer chase. This will benefit more
   functions with many arguments, because we skip one pointer chase per argument
   while filling the spec vector.

2. insert the first key directly in the hash table entry (-1 ptr chase/probe)

   This change is a win, because thanks to inserting the first key directly in
   the entry, we remove one pointer-chase per _each_ probe, so depending on the
   length of the query chain, it is N dereferences per single dispatch.

3. various microoptimizations and prepare for better hash function

   This change has minor yet noticeable impact on performance (~3% for readers).
   The speedup comes from inling a call that checks instances being up-to-date.
   As a preparatory step we also duplicate the path for hashing 1 vs many keys.
   After this change readers tied with CCL performance-wise.

4. use a better hash function from the xxhash family

   Even for a single key this provides a noticeable speed improvement by 5%.
   This change is presented as a separate step for measurement.

Version 4 is composed of multiple optimizations that reduce the memory pressure
and speed up caching. None of these changes benefited ~method~ benchmark in a
significant manner. Here is a summary of them.

| ECL (ver4.1) |   0.70 |   0.95 | smaller keys, no removal, vector profile  |
| ECL (ver4.2) |   0.65 |   0.93 | inline the first key (cheaper probing)    |
| ECL (ver4.3) |   0.60 |   0.91 | inline the check for instance up-to-date  |
| ECL (ver4.4) |   0.55 |   0.92 | replace the hash with xxh64               |

At this point flamegraph indications:
- reader :: hashing 10% / search 55% / rest 35%
- method :: hashing 13% / search 17% / rest 70%

Our improvements clearly show in both profiles: hash time and search take less
time compared to the rest of computation.

The flamegraph also reaffirms our previous conclusions, that there is still a
lot to win when it comes to accessors, while other methods won't benefit much
from improving dispatch much. For example, speeding up search 5x will improve
the reader performance by 11%, while method call performance by 3%. Not much.

All these comounded to faster optimized accessors than CCL. What is left:
- precomputing the class hash so that 1-ary keys may evade recomputing it [hash]
- moving ensure-up-to-date-instance to the cache miss path                [rest]
- caching a direct memory address of the slot value (not its index)       [rest]
- flatten the hash table and reduce number of probes                    [search]