正文
著名的生日问题表明,对于 365 个可能的“哈希值”,在哈希冲突可能性达到 50%之前,必须计算出 23 个唯一哈希值。如果有 2^32 个可能的哈希值,那么在达到 50%的哈希冲突可能性之前,必须计算出大约 77,164 个唯一哈希值。根据这个近似公式:
from math import exp
def prob(x):
return 1.0 -exp(-0.5 * x * (x-1) / 2**32)
prob(77163) # 0.4999978150170551
prob(77164) # 0.500006797931095
对于给定数量的独立哈希,预计会发生多少次冲突?所运的是,维基百科提供了一个封闭式方程式:
def count(d, n):
return n - d + d * ((d - 1) / d)**n
这种封闭式的解决方案可用于在实际的哈希函数表现中加入理论拟合。
那么 String.hashCode() 符合标准吗?试着运行这段代码:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.IOException;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
import java.nio.charset.StandardCharsets;
public class HashTest {
private static Map collisions(Collection values) {
Map result=new HashMap<>();
for(T value : values) {
Integer hc=Integer.valueOf(value.hashCode());
Set bucket=result.get(hc);
if(bucket == null)
result.put(hc, bucket = new TreeSet<>());
bucket.add(value);
}
return result;
}
public static void main(String[] args) throws IOException {
System.err.println("Loading lines from stdin...");
Set lines=new HashSet<>();
try (BufferedReader r=new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8))) {
for(String line=r.readLine();line!=null;line=r.readLine())
lines.add(line);
}
// Warm up, if you please
System.err.print("Warming up");
for(int i=0;i<10;i++) {
System.err.print(".");
collisions(lines);
}
System.err.println();
System.err.println("Computing collisions...");
long start=System.nanoTime();
Map collisions=collisions(lines);
long finish=System.nanoTime();
long elapsed=finish-start;
int maxhc=0, maxsize=0;
for(Map.Entry e : collisions.entrySet()) {
Integer hc=e.getKey();
Set bucket=e.getValue();
if(bucket.size() > maxsize) {
maxhc = hc.intValue();
maxsize = bucket.size();
}
}
System.out.println("Elapsed time: "+elapsed+"ns");
System.out.println("Total unique lines: "+lines.size());
System.out.println("Time per hashcode: "+String.format("%.4f", 1.0*elapsed/lines.size())+"ns");
System.out.println("Total unique hashcodes: "+collisions.size());
System.out.println("Total collisions: "+(lines.size()-collisions.size()));
System.out.println("Collision rate: "+String.format("%.8f", 1.0*(lines.size()-collisions.size())/lines.size()));
if(maxsize != 0)
System.out.println("Max collisions: "+maxsize+" "+collisions.get(maxhc));
}
}
我们使用短字符串(words.txt,链接见文末)作为输入: