2.14.6. Troubleshooting Tips
If an operation seems to hang, see the output of:
status ()
Check for the presence of the following conditions:
-
The cluster line shows 0% CPU, no message traffic and an unchanging number of buffers wired, this is probably a bug. To reset, restart the cluster or the offending process if found. Restart is done by executing:
raw_exit ();
over an SQL connection to the process in question.
-
The cluster line shows many threads waiting compared to total threads. If CPU is 0 and this does not change there could be a transaction that holds locks indefinitely. To clear, execute:
txn_killall (1);
Do this at a node that has local threads waiting. This is seen in the Lock Status paragraph of status ('') when connected to the node in question.
-
The cluster line shows a changing number in the pfs field. The system is swapping and slowed down.
-
If the status () itself hangs, try another process of the cluster. See that there is no temporary atomic activity like a long checkpoint. If the situation persists there is a bug. The checkpoint can be seen by the presence of the
checkpinmt_in_progress
file in each server's working directory.
-
To check the integrity of database files, do:
cl_exec ('backup ''/dev/null''');
If this returns, the databases are OK. If one is found to be corrupt the corresponding server exits.