Sporadic Freeze errors on concurrent icegridnode access

joshmoore · March 2009

Ice: /opt/Ice-3.3.0
OS1: Mac 10.4 Darwin mac 8.11.1 Darwin Kernel Version 8.11.1: Wed Oct 10 18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386 i386 i386
OS2: Linux necromancer 2.6.23-gentoo-r8 #2 SMP Tue Feb 19 19:27:38 GMT 2008 x86_64 Intel(R) Xeon(R) CPU X5355 @ 2.66GHz GenuineIntel GNU/Linux

While writing a restart script for our IceGrid installation, I periodically got the following errors:

icegridnode: failure occurred in daemon:
service caught unhandled Ice exception:
MapI.cpp:1211: Freeze::DatabaseException:
Db::put: Operation not permitted

or

icegridnode: failure occurred in daemon:
service caught unhandled Ice exception:
SharedDbEnv.cpp:546: Freeze::DatabaseException:
DbEnv::open: DB_RUNRECOVERY:
Fatal error, run database recovery

The grid is a single node with a collocated registry and several servers. After calling "icegridadmin -e 'node shutdown master'", a call is made to "icegridnode --deploy app.xml" causing the failure.

It wasn't straight-forward to come up with an exact reproducing test case, because of a rather large Java process which takes significant time to shutdown.

To simulate that, I had to add ThreadControl::sleep() calls and run "icegridnode --deploy" several times in the background, which we obviously are not doing in production. Even then, the following script will not always fail with Freeze errors, but tends to about 20% of the time.

set -e
set -x

##
## Cleanup from previous
##
killall icegridnode && {
  sleep 3
  killall -9 icegridnode || echo good
} || echo None found

## rm -rf var

#
# Using the pid for the app doesn't work since
# failed calls overwrite the pid before they are
# successful.
#
# test -e var/test.pid && kill -9 `cat var/test.pid` || echo Already stopped
# test -e var/app.pid && kill -9 `cat var/app.pid` || echo Already stopped

##
## Create test files
##
mkdir -p var etc lib
cat<<EOF > etc/app.xml
<icegrid>
  <application name="test">
    <node name="test">
     <server id="test" exe="lib/test" activation="always">
        <adapter name="adapter" register-process="true" endpoints="tcp"/>
      </server>
    </node>
  </application>
</icegrid>
EOF
cat<<EOF > etc/app.cfg
IceGrid.Registry.Client.Endpoints=tcp -h 127.0.0.1 -p 4061
IceGrid.Registry.Server.Endpoints=tcp -h 127.0.0.1
IceGrid.Registry.Internal.Endpoints=tcp -h 127.0.0.1
## IceGrid.Registry.SessionManager.Endpoints=tcp -h 127.0.0.1
IceGrid.Registry.Data=var
IceGrid.Registry.DynamicRegistration=0
IceGrid.Registry.PermissionsVerifier=IceGrid/NullPermissionsVerifier
IceGrid.Registry.AdminPermissionsVerifier=IceGrid/NullPermissionsVerifier
IceGrid.Node.CollocateRegistry=1
IceGrid.Node.Endpoints=tcp -h 127.0.0.1
IceGrid.Node.Name=test
IceGrid.Node.Data=var
Ice.StdOut=var/out
Ice.StdErr=var/err
EOF

cat<<EOF > etc/internal.cfg
Ice.Default.Locator=IceGrid/Locator:tcp -h 127.0.0.1 -p 4061
IceGridAdmin.Username=root
IceGridAdmin.Password=password
EOF

cat<<EOF > lib/test.cpp
#include <fstream>
#include <iostream>
#include <Ice/Ice.h>
#include <IceUtil/Time.h>
#include <IceUtil/Thread.h>
using namespace std;
int main(int argc, char* argv[]) {
    ofstream out;
    out.open("var/test.pid");
    out << getpid();
    out.close();
    int status = 0;
    Ice::CommunicatorPtr ic;
    ic = Ice::initialize(argc, argv);
    Ice::ObjectAdapterPtr adapter = ic->createObjectAdapter("adapter");
    ic->waitForShutdown();
    IceUtil::ThreadControl self;
    self.sleep(IceUtil::Time::seconds(6));
    ic->destroy();
    cout << "finished" << endl;
    return 0;
}
EOF
CXXFLAGS="-I/opt/Ice-3.3.0/include" LDFLAGS="-L/opt/Ice-3.3.0/lib -lIce -lIceUtil" make lib/test

start(){
  icegridnode --nochdir --daemon --pidfile var/app.pid --Ice.Config=etc/internal.cfg,etc/app.cfg --deploy etc/app.xml
}

test(){
  echo -n "."
  RESULT=`start 2>&1 || true`
  echo $RESULT | grep Freeze
}

start
icegridadmin --Ice.Config=etc/internal.cfg -e "node shutdown test"
# Causes the same problem:
# FIRST=`cat var/app.pid`
# kill $FIRST&
set +x
test&
test&
test&
test&
test&
test&
test&
test
wait

Even if such concurrent usage cannot be supported through some form of locking, it would be helpful if someone could explain why in the above script multiple calls seems to "recover" -- i.e. calling test.sh after a failing call to test.sh may succeed -- while in my production use, once the database is corrupt, I'm forced to "rm -rf var" before icegridnode will succeed.

Hope all of that can be of some use, but let me know if you need more information.
~Josh

P.S. As mentioned in the script, something I encountered while testing this was that the pidfile for icegridnode is overwritten even if the previous icegridnode is still active. I'm not sure if this is the intended behavior.

joshmoore · March 2009

I've just seen another way to reproduce this, but don't have it isolated yet:

josh@mac:~/code/omero.git/dist$ bin/omero admin stop 
icegridadmin: could not contact the default locator:
Network.cpp:1218: Ice::ConnectionRefusedException:
connection refused: Connection refused
Was the server already stopped?
Waiting on shutdown. Use CTRL-C to exit
icegridadmin: could not contact the default locator:
Network.cpp:1218: Ice::ConnectionRefusedException:
connection refused: Connection refused

josh@mac:~/code/omero.git/dist$ pstree | grep node
 | | |   \--- 09742 josh grep node
josh@mac:~/code/omero.git/dist$ bin/omero admin start
No descriptor given. Using etc/grid/default.xml
icegridnode: failure occurred in daemon:
service caught unhandled Ice exception:
SharedDbEnv.cpp:546: Freeze::DatabaseException:
DbEnv::open: DB_RUNRECOVERY: Fatal error, run database recovery

josh@mac:~/code/omero.git/dist$ bin/omero admin start
No descriptor given. Using etc/grid/default.xml
icegridnode: failure occurred in daemon:
service caught unhandled Ice exception:
SharedDbEnv.cpp:546: Freeze::DatabaseException:
DbEnv::open: DB_RUNRECOVERY: Fatal error, run database recovery

josh@mac:~/code/omero.git/dist$ bin/omero admin start
No descriptor given. Using etc/grid/default.xml
icegridnode: failure occurred in daemon:
service caught unhandled Ice exception:
SharedDbEnv.cpp:546: Freeze::DatabaseException:
DbEnv::open: DB_RUNRECOVERY: Fatal error, run database recovery

josh@mac:~/code/omero.git/dist$ bin/omero admin start
No descriptor given. Using etc/grid/default.xml
icegridnode: failure occurred in daemon:
service caught unhandled Ice exception:
SharedDbEnv.cpp:546: Freeze::DatabaseException:
DbEnv::open: DB_RUNRECOVERY: Fatal error, run database recovery

Stop calls "icegridadmin -e 'node shutdown master" followed by "icegridadmin -e 'node ping master'". Start calls "icegridnode --deploy" followed by another ping. Since there are no icegridnode processes after the stop, this would imply that concurrent access is not strictly required.

benoit · March 2009

Hi Josh,

You should make sure that the previous IceGrid registry/node isn't running anymore before starting a new one. Otherwise, as you discovered, this might corrupt the database environment. So your scripts should wait for the process to be gone before trying to restart a new IceGrid registry/node.

While the IceGrid registry has a check to prevent this, it's not 100% foolproof: the check just verifies that no other registry is running by trying to connect to the client endpoint. If you start concurrently multiple registries which are using the same database directory, this check will likely fail. Similarly, this check doesn't work if the previous registry is being shutdown. I believe this explains why you sometime see the Freeze error and sometime you don't see it.

As for the pidfile, it's the intended behavior, the purpose of --pidfile is just to write the PID of the process to the given file not to prevent multiple runs of the same service.

Cheers,
Benoit.

joshmoore · June 2009

Benoit,

Understood. My attempts to prevent db corruption, however, were also not sufficient, and users are now seeing this in the wild. I assume my only recourse will be to implement file-based locking within our python scripts.

If possible, please consider this a RFE to have the locks pushed down into the Ice executables. Failing that, it would be good to know where else (storm, freeze, etc.) I will need to include locking.

Best wishes,
~Josh.

joshmoore · September 2009

Another update Benoit,

the OMERO user with the initial problem has come back again looking for a work around for his corrupted registry dbs. See this thread. Of course, this is an extreme case involving power outages, but I'd certainly like to be able to keep our users from having to manually delete var/registry.

Other than checking for a non-zero exit code and reading the log file, can you suggest any ways of working around this?

Optimal would be some form of locking which prevents this from happening. Or a self-healing step, with warning perhaps. If that required a flag "--recreate", that'd be fine.** Another option might be a way to verify the registry db before calling icegridnode: --verify, or should I just try to open the db with a small application?

Any advice would be very welcome.
Cheers,
~Josh

(** Speaking of which, we use collocated registries along with the "--deploy" flag if that's possibly related.)

benoit · September 2009

Hi Josh,

Fixing this is on our TODO list. In the meantime, the only option I can think of to prevent this from happening would be for your script to use a file lock. Your script could for example open and lock a file, if the file doesn't exist already, your script would start the IceGrid node and write the node PID into this file and then close and unlock the file. If the file already exists, the script would first check if the process with the PID from the file is still running or not, if not running, it would start again the IceGrid node but if it's still running it would print an error message.

Cheers,
Benoit.

joshmoore · September 2009

Hi Benoit,

for the initial issue, of course you're right. To prevent user error via a restart mechanism, a lock would work. (For the moment, I've just prevented quick restarts).

This latest issue, however, wouldn't be fixed by locking, I don't think, since the corruption is happening due to power loss. There may not be much the icegrid{node,registry} executables can due to recover, but I was hoping it was something your planned fix could take into account.

Best wishes,
~Josh.

benoit · September 2009

The IceGrid registry already automatically recovers the database environment on startup. This should be sufficient most of the time to recover from errors such as ungraceful shutdown of the IceGrid registry.

If the registry still fails to start after such the "standard" recovery, this indicates a more serious issue (file system got corrupted, disk failure, etc) and fatal database recovery is needed. This might also occur if two registry are started simultaneously (this can be prevented using some file locking as mentioned above in my other posts). In any case, such a recovery will most likely require a backup of the registry database environment log and eventually database files.

If you prefer to do the recovery manually before starting the icegridregistry you can use the db_recover utility from BerkeleyDB, for example:

    $ db_recover -h db/registry

You can perform catastrophic/fatal recovery by using the "-c" option with the db_recover utility or by starting the IceGrid registry with the --Freeze.DbEnv.Registry.DbRecoverFatal option. In any case, such a fatal recovery will most likely require some intervention on the part of the user to restore some of the database environment files from a backup.

For more information on Berkeley DB recovery procedures see this link. For information on how to backup and restore a Freeze database environment, check out this section in the Ice manual.

Cheers,
Benoit.

joshmoore · September 2009

Thanks, Benoit!

That gives me several things to try.

Would there be any negative results of using "Freeze.DbEnv.Registry.DbRecoverFatal" along with "icegridnode --deploy" in a collocated-registry/node?

~Josh.

Archived

Sporadic Freeze errors on concurrent icegridnode access

Comments

Categories