Archived
This forum has been archived. Please start a new discussion on GitHub.
Deprecating ice_getHash in C++?
It seems like ice_getHash is deprecated in almost any language mapping besides C++ - and as I learned for a good reason, since the algorithm used yields pretty bad results. I realized this after porting CORBA code that used CORBAs Object::_hash method, which (in case of my ORB) led to evenly distributed hashes - using ice_getHash I got collisions in as little as 100 objects (which all differed in object name only). I changed the code to compare object identities now, which is the better option anyway - in CORBA hashing object references was necessary because of its wicked idea of opaque object references and while porting the code I didn't change the methodology.
I would suggest to
a) Point out in the Ice Manual that these methods are deprecated in Java, Ruby, PHP, Python etc. - right now that's only in the CHANGES file (which is not read my people using Ice for the first time).
b) Deprecate this call in the C++ mapping as well and document in the Ice Manual - maybe also document that the hashing algorithm used is substandard
c) Add a compile time warning when ice_getHash is called (deprecation and substandard algorithm)
Alternatively a real hashing algorithm could be used (SHAsomething e.g.), but in the end thanks to ICE's straightforward design and the use of transparent proxies and object ids there is no real use for such an method and if really required it could be written by the user anyway (in the case of proxies it also included too much information anyway - endpoints etc. so it couldn't reliably be used for what I was using it anyway) - therefore I think your decision to deprecate it was correct, I just would've hoped this was for C++ as well and documented outside of the CHANGES file.
I would suggest to
a) Point out in the Ice Manual that these methods are deprecated in Java, Ruby, PHP, Python etc. - right now that's only in the CHANGES file (which is not read my people using Ice for the first time).
b) Deprecate this call in the C++ mapping as well and document in the Ice Manual - maybe also document that the hashing algorithm used is substandard
c) Add a compile time warning when ice_getHash is called (deprecation and substandard algorithm)
Alternatively a real hashing algorithm could be used (SHAsomething e.g.), but in the end thanks to ICE's straightforward design and the use of transparent proxies and object ids there is no real use for such an method and if really required it could be written by the user anyway (in the case of proxies it also included too much information anyway - endpoints etc. so it couldn't reliably be used for what I was using it anyway) - therefore I think your decision to deprecate it was correct, I just would've hoped this was for C++ as well and documented outside of the CHANGES file.
0
Comments
-
Hi Michael,
Thank you the suggestion. For Java, C#, and Ruby, ice_getHash is deprecated because there is a built-in hash function; unfortunately there is no equivalent for C++, so there is no obvious way to fix an application that relies on ice_getHash. We could deprecate it in the next release and find if it's really used much...
Then, in terms of documentation:
- deprecated API / features are usually removed from the Ice manual, and not kept as "deprecated"
- deprecated API / features are discussed in the release notes, for example: Upgrading your Application from Ice 3.3 - Ice 3.4.2 Release Notes
- when feasible, we add a compile-time warnings for deprecated APIs. ice_hash for example is deprecated (in retrospect we should have gone one step further):ICE_DEPRECATED_API ::Ice::Int ice_hash() const { return ice_getHash(); }
Best regards,
Bernard0 -
ice_getHash example
Hm, in case you really need to keep ice_getHash I would strongly suggest to change the internals of HashUtil.h to a standard hashing algorithm that guarantees equal distribution of hashes.
I wrote a little test to demonstrate the shortcomings of the algorithm:#include <iostream> #include <sstream> #include <string> #include <map> #include <Ice/Ice.h> template<typename T> inline std::string to_string(const T& val) { std::ostringstream streamOut; streamOut << val; return streamOut.str(); } int main(int argc, char** argv) { typedef std::map<Ice::Int, Ice::ObjectPrx> SeenMap; SeenMap seen; unsigned int collisions = 0; unsigned int i = 0; unsigned int maxCollisions = 10; Ice::CommunicatorPtr communicator = Ice::initialize(argc, argv); for (i = 0; collisions < maxCollisions; ++i) { std::string oid = to_string(i)+":default -p 10000"; Ice::ObjectPrx obj = communicator->stringToProxy(oid); if (!seen.insert(std::make_pair(obj->ice_getHash(), obj)).second) { Ice::ObjectPrx existingObj = seen[obj->ice_getHash()]; std::cerr << "Hash of " << obj->ice_toString() << " == " << seen[obj->ice_getHash()]->ice_toString() << " == " << obj->ice_getHash() << std::endl; ++collisions; } } std::cerr << "Found " << collisions << " collisions in " << i << " iterations" << std::endl; }
By modifying maxCollisions you can determine the hit/miss ratio, in case of maxCollision == 10 the program output is:Hash of 20 -t:tcp -p 10000 == 15 -t:tcp -p 10000 == 1490 Hash of 21 -t:tcp -p 10000 == 16 -t:tcp -p 10000 == 1495 Hash of 22 -t:tcp -p 10000 == 17 -t:tcp -p 10000 == 1500 Hash of 23 -t:tcp -p 10000 == 18 -t:tcp -p 10000 == 1505 Hash of 24 -t:tcp -p 10000 == 19 -t:tcp -p 10000 == 1510 Hash of 30 -t:tcp -p 10000 == 25 -t:tcp -p 10000 == 1515 Hash of 31 -t:tcp -p 10000 == 26 -t:tcp -p 10000 == 1520 Hash of 32 -t:tcp -p 10000 == 27 -t:tcp -p 10000 == 1525 Hash of 33 -t:tcp -p 10000 == 28 -t:tcp -p 10000 == 1530 Hash of 34 -t:tcp -p 10000 == 29 -t:tcp -p 10000 == 1535 Found 10 collisions in 35 iterations
Which is almost 30% duplicates.
When run with maxCollisions == 10000, the output is:... Hash of 11984 -t:tcp -p 10000 == 11979 -t:tcp -p 10000 == 192535 Hash of 11990 -t:tcp -p 10000 == 11985 -t:tcp -p 10000 == 192540 Hash of 11991 -t:tcp -p 10000 == 11986 -t:tcp -p 10000 == 192545 Hash of 11992 -t:tcp -p 10000 == 11987 -t:tcp -p 10000 == 192550 Hash of 11993 -t:tcp -p 10000 == 11988 -t:tcp -p 10000 == 192555 Hash of 11994 -t:tcp -p 10000 == 11989 -t:tcp -p 10000 == 192560 Found 10000 collisions in 11995 iterations
which is more than 80% duplicates.
So whoever is relying on ice_getHash to get reasonably unique identifiers for proxies within the Ice::Int number space right now is in serious trouble. I would assume that's also true for Object::ice_getHash (it uses the same simplistic algorithms from HashUtil.h), even though it's a lot less likely to surface in case of classes.0 -
One more suggestion
Thinking about this a little longer I would suggest to classify the algorithm used in ice_getHash() as a bug.
Besides the severe need to replace HashUtil I would also suggest to move the functionality to free functions in the Ice namespace - this would make sure ice_getHash is removed from the slice style documentation (e.g. Proxy Methods - Ice 3.4 - ZeroC) like it is done for other C++ specific functions.namespace Ice { Ice::Int hash(const Object& obj); Ice::Int hash(const ObjectPrx& objPrx); }
Even though I would suggest to use at least a 64bit/Ice::Long data type for the hash.0 -
Hi Michael,
Thank you for this little test case, which shows that our C++ proxy-hashing algorithm needs work! This only affects applications using Ice, as Ice itself does not maintain any hash-map or otherwise use this function.
We actually use a different algorithm for Objects:Int Ice::Object::ice_getHash() const { return static_cast<Int>(reinterpret_cast<Long>(this) >> 4); }
which could be better than the hash for proxies.
(The hash for proxies is computed from the proxy attributes, since two different proxy objects can be equal and as a result must yield the same hash value).
Cheers,
Bernard0