From 5466296f02c2d53894aa51ef035ddf690b5d2224 Mon Sep 17 00:00:00 2001 From: Raymond Hettinger Date: Fri, 2 May 2003 20:11:29 +0000 Subject: [PATCH] Research notes and explorations for optimizing Python dictionaries. --- Objects/dictnotes.txt | 189 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 189 insertions(+) create mode 100644 Objects/dictnotes.txt diff --git a/Objects/dictnotes.txt b/Objects/dictnotes.txt new file mode 100644 index 00000000000..dcfb7a0e272 --- /dev/null +++ b/Objects/dictnotes.txt @@ -0,0 +1,189 @@ +NOTES ON OPTIMIZING DICTIONARIES +================================ + + +Principal Use Cases for Dictionaries +------------------------------------ + +Passing keyword arguments + Typically, one read and one write for 1 to 3 elements. + Occurs frequently in normal python code. + +Class method lookup + Dictionaries vary in size with 8 to 16 elements being common. + Usually written once with many lookups. + When base classes are used, there are many failed lookups + followed by a lookup in a base class. + +Instance attribute lookup and Global variables + Dictionaries vary in size. 4 to 10 elements are common. + Both reads and writes are common. + +Builtins + Frequent reads. Almost never written. + Size 126 interned strings (as of Py2.3b1). + A few keys are accessed much more frequently than others. + +Uniquification + Dictionaries of any size. Bulk of work is in creation. + Repeated writes to a smaller set of keys. + Single read of each key. + + * Removing duplicates from a sequence. + dict.fromkeys(seqn).keys() + * Counting elements in a sequence. + for e in seqn: d[e]=d.get(e,0) + 1 + * Accumulating items in a dictionary of lists. + for k, v in itemseqn: d.setdefault(k, []).append(v) + +Membership Testing + Dictionaries of any size. Created once and then rarely changes. + Single write to each key. + Many calls to __contains__() or has_key(). + Similar access patterns occur with replacement dictionaries + such as with the % formatting operator. + + +Data Layout (assuming a 32-bit box with 64 bytes per cache line) +---------------------------------------------------------------- + +Smalldicts (8 entries) are attached to the dictobject structure +and the whole group nearly fills two consecutive cache lines. + +Larger dicts use the first half of the dictobject structure (one cache +line) and a separate, continuous block of entries (at 12 bytes each +for a total of 5.333 entries per cache line). + + +Tunable Dictionary Parameters +----------------------------- + +* PyDict_MINSIZE. Currently set to 8. + Must be a power of two. New dicts have to zero-out every cell. + Each additional 8 consumes 1.5 cache lines. Increasing improves + the sparseness of small dictionaries but costs time to read in + the additional cache lines if they are not already in cache. + That case is common when keyword arguments are passed. + +* Maximum dictionary load in PyDict_SetItem. Currently set to 2/3. + Increasing this ratio makes dictionaries more dense resulting + in more collisions. Decreasing it improves sparseness at the + expense of spreading entries over more cache lines and at the + cost of total memory consumed. + + The load test occurs in highly time sensitive code. Efforts + to make the test more complex (for example, varying the load + for different sizes) have degraded performance. + +* Growth rate upon hitting maximum load. Currently set to *2. + Raising this to *4 results in half the number of resizes, + less effort to resize, better sparseness for some (but not + all dict sizes), and potentially double memory consumption + depending on the size of the dictionary. Setting to *4 + eliminates every other resize step. + +Tune-ups should be measured across a broad range of applications and +use cases. A change to any parameter will help in some situations and +hurt in others. The key is to find settings that help the most common +cases and do the least damage to the less common cases. Results will +vary dramatically depending on the exact number of keys, whether the +keys are all strings, whether reads or writes dominate, the exact +hash values of the keys (some sets of values have fewer collisions than +others). Any one test or benchmark is likely to prove misleading. + + +Results of Cache Locality Experiments +------------------------------------- + +When an entry is retrieved from memory, 4.333 adjacent entries are also +retrieved into a cache line. Since accessing items in cache is *much* +cheaper than a cache miss, an enticing idea is to probe the adjacent +entries as a first step in collision resolution. Unfortunately, the +introduction of any regularity into collision searches results in more +collisions than the current random chaining approach. + +Exploiting cache locality at the expense of additional collisions fails +to payoff when the entries are already loaded in cache (the expense +is paid with no compensating benefit). This occurs in small dictionaries +where the whole dictionary fits into a pair of cache lines. It also +occurs frequently in large dictionaries which have a common access pattern +where some keys are accessed much more frequently than others. The +more popular entries *and* their collision chains tend to remain in cache. + +To exploit cache locality, change the collision resolution section +in lookdict() and lookdict_string(). Set i^=1 at the top of the +loop and move the i = (i << 2) + i + perturb + 1 to an unrolled +version of the loop. + +This optimization strategy can be leveraged in several ways: + +* If the dictionary is kept sparse (through the tunable parameters), +then the occurrence of additional collisions is lessened. + +* If lookdict() and lookdict_string() are specialized for small dicts +and for largedicts, then the versions for large_dicts can be given +an alternate search strategy without increasing collisions in small dicts +which already have the maximum benefit of cache locality. + +* If the use case for a dictionary is known to have a random key +access pattern (as opposed to a more common pattern with a Zipf's law +distribution), then there will be more benefit for large dictionaries +because any given key is no more likely than another to already be +in cache. + + +Optimizing the Search of Small Dictionaries +------------------------------------------- + +If lookdict() and lookdict_string() are specialized for smaller dictionaries, +then a custom search approach can be implemented that exploits the small +search space and cache locality. + +* The simplest example is a linear search of contiguous entries. This is + simple to implement, guaranteed to terminate rapidly, never searches + the same entry twice, and precludes the need to check for dummy entries. + +* A more advanced example is a self-organizing search so that the most + frequently accessed entries get probed first. The organization + adapts if the access pattern changes over time. Treaps are ideally + suited for self-organization with the most common entries at the + top of the heap and a rapid binary search pattern. Most probes and + results are all located at the top of the tree allowing them all to + be located in one or two cache lines. + +* Also, small dictionaries may be made more dense, perhaps filling all + eight cells to take the maximum advantage of two cache lines. + + +Strategy Pattern +---------------- + +Consider allowing the user to set the tunable parameters or to select a +particular search method. Since some dictionary use cases have known +sizes and access patterns, the user may be able to provide useful hints. + +1) For example, if membership testing or lookups dominate runtime and memory + is not at a premium, the user may benefit from setting the maximum load + ratio at 5% or 10% instead of the usual 66.7%. This will sharply + curtail the number of collisions. + +2) Dictionary creation time can be shortened in cases where the ultimate + size of the dictionary is known in advance. The dictionary can be + pre-sized so that no resize operations are required during creation. + Not only does this save resizes, but the key insertion will go + more quickly because the first half of the keys will be inserted into + a more sparse environment than before. The preconditions for this + strategy arise whenever a dictionary is created from a key or item + sequence of known length. + +3) If the key space is large and the access pattern is known to be random, + then search strategies exploiting cache locality can be fruitful. + The preconditions for this strategy arise in simulations and + numerical analysis. + +4) If the keys are fixed and the access pattern strongly favors some of + the keys, then the entries can be stored contiguously and accessed + with a linear search or treap. This exploits knowledge of the data, + cache locality, and a simplified search routine. It also eliminates + the need to test for dummy entries on each probe. The preconditions + for this strategy arise in symbol tables and in the builtin dictionary.