Archived

This forum has been archived. Please start a new discussion on GitHub.

High load bug

Hello,

We are experiencing high load with our deployment of ICE 3.4.2. In our application a UDP message is used for a keep
alive mechanism. If the remote party is killed the CPU usage on the sending side rises to 100% after about 35 seconds
irrespective of the number of alive messages sent in this time.
We have traced the problem to cpp/src/Ice/Selector.cpp:421. At this point in the code the events are checked. The event
on the socket is EPOLLERR, however. This results in the SocketOperationNone and the push_back onto the handlers queue.
Since the EPOLLERR is still present on the descriptor, the epoll_wait immediately returns with value "1"

We have a large install base of devices using 3.4.2 and the test effort to migrate to 3.5.1 is extensive. This is most
certainly not desirable. Is there a patch available for 3.4.2 to solve this problem?

Our build flags are the defaults used for Linux.

Can someone confirm this behaviour?

Thanks in advance

Comments

  • benoit
    benoit Rennes, France
    Hi,

    Thanks for the detailed report. The UDP socket does indeed trigger an EPOLLERR when it's connected (so it's only a "client side" issue, UDP sockets are connected on the server side). I was able to reproduce the problem. A fix is to check for EPOLLERR line 420 in Selector.cpp and mark the handler as ready for writes and reads if it occurs.

    I've attached a source patch that you can apply to a fresh Ice 3.4.2 source distribution with:
    $ cd Ice-3.4.2
    $ patch -p1 < patch.txt
    

    Could you give it a try and see if this solves the problem? We will look into doing more testing and integrate this fix for the next Ice release.

    Cheers,
    Benoit.
  • Is the OR between Read and Write (resulting in undefined value 3 in the enum) intentional?
  • benoit
    benoit Rennes, France
    Hi,

    Yes, it's fine. It's a bit flag value, we only check for which bits are set when reading the value.

    Cheers,
    Benoit.
  • Hi Benoit,

    Thanks for the response. It is comforting to know that this does not have adverse effects. We are not familiar enough with the code to evaluate the impact and appreciate that the feedback.
  • benoit
    benoit Rennes, France
    Hi,

    Note that you can run the Ice test suite using the allTests.py script located in Ice-3.4.2/cpp to ensure everything works fine on your platform.

    You can also ensure that the bug is fixed using the Ice hello demo (in the cpp/demo/Ice/hello directory):
    • start the server
    • start the client and send a datagram with 'd'
    • kill with -9 the server
    • send another datagram with 'd'
    This was causing the client process to consume 100% of the CPU before the fix, with the fix it should be back to normal and the datagram should succeed again after you restart the server.

    Cheers,
    Benoit.