HashMap中loadFactor用于计算扩容的阈值。扩容意味着空间(resize)和时间(rehash)的开销,不扩容则会导致使用(查询、操作)成本快速增长,平衡时间和空间是一门艺术。
问题
JDK 1.7
As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put). The expected number of entries in the map and its load factor should be taken into account when setting its initial capacity, so as to minimize the number of rehash operations. If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.
JDK 1.8
Ideally, under random hashCodes, the frequency of nodes in bins follows a Poisson distribution with a parameter of about 0.5 on average for the default resizing threshold of 0.75, although with a large variance because of resizing granularity. Ignoring variance, the expected occurrences of list size k are (exp(-0.5) pow(0.5, k) / factorial(k)). The first values are:
0: 0.60653066
1: 0.30326533
2: 0.07581633
3: 0.01263606
4: 0.00157952
5: 0.00015795
6: 0.00001316
7: 0.00000094
8: 0.00000006
jdk1.7中的注释中指出了default_loadFactor 为 0.75是时间与空间的tradoff,但并未提到该数值是怎么得来的。
jdk1.8中的注释中提到随机的hashCodes命中特定bin的频率在缺省阈值为0.75时,服从参数的泊松分布,亦未解释0.75的来源。
基础
为了解释0.75的来源,需要一些基础工具做铺垫。
**Hash function**
A hash function is any function that can be used to map data of arbitrary size to data of fixed size. The values returned by a hash function are called hash values, hash codes, digests, or simply hashes. One use is a data structure called a hash table, widely used in computer software for rapid data lookup. Hash functions accelerate table or database lookup by detecting duplicated records in a large file. An example is finding similar stretches in DNA sequences. They are also useful in cryptography. A cryptographic hash function allows one to easily verify that some input data maps to a given hash value, but if the input data is unknown, it is deliberately difficult to reconstruct it (or equivalent alternatives) by knowing the stored hash value. This is used for assuring integrity of transmitted data, and is the building block for HMACs, which provide message authentication.
密码学中所用的安全哈希函数(cryptographic hash function)和查找/存储用的哈希遵从不同的设计原则,除了数据压缩和快速计算,安全哈希还满足抗原象性、抗第二原象性和抗碰撞性。
**Hash Collision**
In computer science, a hash collision or hash clash is a situation that occurs when two distinct inputs into a hash function produce identical outputs.
All hash functions have potential collisions, though with a well-designed hash function, collisions should occur less often (compared with a poorly designed function) or be more difficult to find. In certain specialized applications where a relatively small number of possible inputs are all known ahead of time it is possible to construct a perfect hash function which maps all inputs to different outputs. However, many hash functions, including most cryptographic hash functions, produce a fixed size output from an arbitrarily long message. In such a design, there will always be collisions, because any given hash has to correspond to a very large number of possible inputs.
哈希碰撞及解决方案见这里。
**Distribution**
伯努利分布:一次实验中,事件发生的概率为 ,不出现的概率为。
二项分布:在只有两个结果的n次独立的伯努利试验中,所期望的结果出现次数的概率。
泊松分布:描述单位时间内事件发生的概率。
分析
一次实验中key落在某个bucket的概率为 ,hash结果只与输入有关( random hashCodes),多次hash可视为相互独立。即bucket为空的概率服从二项分布:
因此bucket在如下条件下可能为空(,如JDK1.8中提到的):
进行次hash后,bucket为空的概率为0.5,即如JDK1.8中所指单位时间内key落在某bucket的概率服从的泊松分布。
更进一步,当趋于无穷时,扩容阈值快速逼近log(2):
其中,理论上阈值定为log(2)更能平衡时间与空间的开销。
结论
- default_loadFactor = 0.75和泊松分布没有直接关系。泊松分布描述的是单位时间内bucket中node个数的概率,服从泊松分布是HashMap采用Chaining解决哈希冲突的性质决定的。
- default_loadFactor = 0.75和泊松分布的参数有直接关系。是单位时间(或单位面积)内随机事件的平均发生次数,此处为bucket是否为空的期望为0.5。
- default_loadFactor取log(2)理论上更合适,但考虑HashMap的default_capacity = 16,扩容规则为 ,无疑 便于计算且易于实现。
看到这儿了,留个赞呗。
参考文献
- 王小云, 于红波. 密码杂凑算法综述[J]. 信息安全研究, 2015, v.1;No.1(01):19-30.
- https://stackoverflow.com/questions/10901752/what-is-the-significance-of-load-factor-in-hashmap
- https://blog.csdn.net/kjfcpua/article/details/44238757
- https://www.jianshu.com/p/64f6de3ffcc1