Tom K. answered 09/13/20
Knowledgeable and Friendly Math and Statistics Tutor
If there are N different features, each with D different values, so there are N^D possible values. We have P individuals in the population.
For N = 10 and D = 5, D^N = 9765625
The probability of at least one collision is 1 - the probability of no collisions.
The probability of no collisions is D^N-1/D^N * D^N-2/D^N *... * D^N-P+1/D^N
For D^N large, (D^N - x)/N^D = e^(-x/D^N), and 1 + 2 + ... + P-1 = P(P-1)/2, so 1-D^N-1/D^N * D^N-2/D^N *... * D^N-P+1/D^N is approximately 1 - e^-(P(P-1)/2/D^N is approximately 1 - e^-P^2/(2 D^N) (the sum would be less but the original fractions produce a larger negative coefficient, so this is a great idea).
a) For P = 100, if we use the final approximation 1 - e^-P^2/(2 D^N), we get
1 - e^-(100)^2/(2*9765625) = .000512
If we use the previous approximation 1 - e^-(P*(P-1))/(2 D^N), 1 - e^-(100*99)/(2*9765625) =
The exact answer is 1 - 9765624/9765625*9765623/9765625*...*9765526/9765625 = .000507
b) For P = 1000, if we use the final approximation 1 - e^-P^2/(2 D^N), we get
1 - e^-(1000)^2/(2*9765625) = 0.049911
If we use the previous approximation 1 - e^-(1000*999)/(2*9765625) = 0.049863
The exact answer is 1 - 9765624/9765625*9765623/9765625*...*9764626/9765625 = 0.049864
c) For P = 10000, if we use the final approximation 1 - e^-P^2/(2 D^N), we get
1 - e^-(10000)^2/(2*9765625) = 0.994024
If we use the previous approximation 1 - e^-(10000*9999)/(2*9765625) = 0.994021
The exact answer is 1 - 9765624/9765625*9765623/9765625*...*9755626/9765625 = 0.994031
Note how the first approximation, with less terms in the exponent, is not as good in the first 2 cases, with the second approximation being almost exactly correct, but with a large number of individuals, the approximation with more terms is less correct.