MD5 hashes (wikipedia article) are a way to derive a normalized, surrogate key from the business key (which may consist of more than one attribute). In DV 2.0 we use hash keys rather than the sequences of DV 1.0 because they:
- provide interoperability (e.g. to build federated data vaults where we can distribute/join over different systems (big data scenario))
- can be used for physical partitioning (uniform distribution)
- they can be calculated system independently (all hardware and OS will provide the same hash key given we follow some important rules on calculation)
- enable even more parallelism while loading the vault, because we need no HUB lookups anymore.
1. What about hash collisions?
In average we get a MD5 collision after 100 years if we would create approx. 6 billion new hashes per sec! (the calculation is based on the birthday paradox). This should be a risk we can take in typical data warehouse projects.
2. How could we detect and handle collisions?
A collision will not break the load process, it may historize satellite data in each load (in case of full loads) or overwrite satellite data (delta load). If we want to detect collisions we could stick to looking up the hub and compare the business keys.
2. What about performance issues? Sequences should be much faster or not?
3. Is it a problem to bind the business key to its surrogate key? We wanted to decouple those, right?