Archived

This forum has been archived. Please start a new discussion on GitHub.

Deprecating ice_getHash in C++?

It seems like ice_getHash is deprecated in almost any language mapping besides C++ - and as I learned for a good reason, since the algorithm used yields pretty bad results. I realized this after porting CORBA code that used CORBAs Object::_hash method, which (in case of my ORB) led to evenly distributed hashes - using ice_getHash I got collisions in as little as 100 objects (which all differed in object name only). I changed the code to compare object identities now, which is the better option anyway - in CORBA hashing object references was necessary because of its wicked idea of opaque object references and while porting the code I didn't change the methodology.

I would suggest to

a) Point out in the Ice Manual that these methods are deprecated in Java, Ruby, PHP, Python etc. - right now that's only in the CHANGES file (which is not read my people using Ice for the first time).

b) Deprecate this call in the C++ mapping as well and document in the Ice Manual - maybe also document that the hashing algorithm used is substandard

c) Add a compile time warning when ice_getHash is called (deprecation and substandard algorithm)

Alternatively a real hashing algorithm could be used (SHAsomething e.g.), but in the end thanks to ICE's straightforward design and the use of transparent proxies and object ids there is no real use for such an method and if really required it could be written by the user anyway (in the case of proxies it also included too much information anyway - endpoints etc. so it couldn't reliably be used for what I was using it anyway) - therefore I think your decision to deprecate it was correct, I just would've hoped this was for C++ as well and documented outside of the CHANGES file.

Comments

  • bernard
    bernard Jupiter, FL
    Hi Michael,

    Thank you the suggestion. For Java, C#, and Ruby, ice_getHash is deprecated because there is a built-in hash function; unfortunately there is no equivalent for C++, so there is no obvious way to fix an application that relies on ice_getHash. We could deprecate it in the next release and find if it's really used much...

    Then, in terms of documentation:

    - deprecated API / features are usually removed from the Ice manual, and not kept as "deprecated"

    - deprecated API / features are discussed in the release notes, for example: Upgrading your Application from Ice 3.3 - Ice 3.4.2 Release Notes

    - when feasible, we add a compile-time warnings for deprecated APIs. ice_hash for example is deprecated (in retrospect we should have gone one step further):
        ICE_DEPRECATED_API ::Ice::Int ice_hash() const
        {
            return ice_getHash();
        }
    

    Best regards,
    Bernard
  • ice_getHash example

    Hm, in case you really need to keep ice_getHash I would strongly suggest to change the internals of HashUtil.h to a standard hashing algorithm that guarantees equal distribution of hashes.

    I wrote a little test to demonstrate the shortcomings of the algorithm:
    #include <iostream>
    #include <sstream>
    #include <string>
    #include <map>
    #include <Ice/Ice.h>
    
    template<typename T>
    inline std::string to_string(const T& val)
    {
      std::ostringstream streamOut;
      streamOut << val;
      return streamOut.str();
    }
    
    
    int main(int argc, char** argv)
    {
      typedef std::map<Ice::Int, Ice::ObjectPrx> SeenMap;
      SeenMap seen;
      unsigned int collisions = 0;
      unsigned int i = 0;
      unsigned int maxCollisions = 10;
    
      Ice::CommunicatorPtr communicator = Ice::initialize(argc, argv);
      for (i = 0; collisions < maxCollisions; ++i)
      {
        std::string oid = to_string(i)+":default -p 10000";
        Ice::ObjectPrx obj = communicator->stringToProxy(oid);
        if (!seen.insert(std::make_pair(obj->ice_getHash(), obj)).second)
        {
          Ice::ObjectPrx existingObj = seen[obj->ice_getHash()];
          std::cerr << "Hash of "
                      << obj->ice_toString()
                      << " == "
                      << seen[obj->ice_getHash()]->ice_toString()
                      << " == "
                      << obj->ice_getHash()
                      << std::endl;
          ++collisions;
        }
      }
      std::cerr << "Found " << collisions << " collisions in " << i << " iterations" << std::endl;
    }
    

    By modifying maxCollisions you can determine the hit/miss ratio, in case of maxCollision == 10 the program output is:
    Hash of 20 -t:tcp -p 10000 == 15 -t:tcp -p 10000 == 1490
    Hash of 21 -t:tcp -p 10000 == 16 -t:tcp -p 10000 == 1495
    Hash of 22 -t:tcp -p 10000 == 17 -t:tcp -p 10000 == 1500
    Hash of 23 -t:tcp -p 10000 == 18 -t:tcp -p 10000 == 1505
    Hash of 24 -t:tcp -p 10000 == 19 -t:tcp -p 10000 == 1510
    Hash of 30 -t:tcp -p 10000 == 25 -t:tcp -p 10000 == 1515
    Hash of 31 -t:tcp -p 10000 == 26 -t:tcp -p 10000 == 1520
    Hash of 32 -t:tcp -p 10000 == 27 -t:tcp -p 10000 == 1525
    Hash of 33 -t:tcp -p 10000 == 28 -t:tcp -p 10000 == 1530
    Hash of 34 -t:tcp -p 10000 == 29 -t:tcp -p 10000 == 1535
    Found 10 collisions in 35 iterations
    

    Which is almost 30% duplicates.

    When run with maxCollisions == 10000, the output is:
    ...
    Hash of 11984 -t:tcp -p 10000 == 11979 -t:tcp -p 10000 == 192535
    Hash of 11990 -t:tcp -p 10000 == 11985 -t:tcp -p 10000 == 192540
    Hash of 11991 -t:tcp -p 10000 == 11986 -t:tcp -p 10000 == 192545
    Hash of 11992 -t:tcp -p 10000 == 11987 -t:tcp -p 10000 == 192550
    Hash of 11993 -t:tcp -p 10000 == 11988 -t:tcp -p 10000 == 192555
    Hash of 11994 -t:tcp -p 10000 == 11989 -t:tcp -p 10000 == 192560
    Found 10000 collisions in 11995 iterations
    

    which is more than 80% duplicates.

    So whoever is relying on ice_getHash to get reasonably unique identifiers for proxies within the Ice::Int number space right now is in serious trouble. I would assume that's also true for Object::ice_getHash (it uses the same simplistic algorithms from HashUtil.h), even though it's a lot less likely to surface in case of classes.
  • One more suggestion

    Thinking about this a little longer I would suggest to classify the algorithm used in ice_getHash() as a bug.

    Besides the severe need to replace HashUtil I would also suggest to move the functionality to free functions in the Ice namespace - this would make sure ice_getHash is removed from the slice style documentation (e.g. Proxy Methods - Ice 3.4 - ZeroC) like it is done for other C++ specific functions.
    namespace Ice
    {
      Ice::Int hash(const Object& obj);
      Ice::Int hash(const ObjectPrx& objPrx);
    }
    

    Even though I would suggest to use at least a 64bit/Ice::Long data type for the hash.
  • bernard
    bernard Jupiter, FL
    Hi Michael,

    Thank you for this little test case, which shows that our C++ proxy-hashing algorithm needs work! This only affects applications using Ice, as Ice itself does not maintain any hash-map or otherwise use this function.

    We actually use a different algorithm for Objects:
    Int
    Ice::Object::ice_getHash() const
    {
        return static_cast<Int>(reinterpret_cast<Long>(this) >> 4);
    }
    

    which could be better than the hash for proxies.

    (The hash for proxies is computed from the proxy attributes, since two different proxy objects can be equal and as a result must yield the same hash value).

    Cheers,
    Bernard