Causes | Symptoms | Countermeasures | ||
Category 1 | Category 2 | |||
Errors caused by resource abnormalities | A failure has occurred in the network between the client and cluster. | [Server] 130008 [Description] Network connection error [Format] Communication failure cause |
If the failure is temporary, set the timeout so that it is settled within the failover timeout period. If the failure occurs regularly, consider making the network redundant and so on. |
|
Error caused by timeout setting | Either the server load is very high, or the load of the relevant transaction itself is high | [Server] 50000, 50001 [Client] 70000 [Description] Transaction timeout [Format] Timeout elapsed period, partition no., connection address, failure cause |
Check whether 1 statement of the application subject to execution can be executed within the transaction timeout period and set up a suitable value. This phenomenon occurs even if the load of the entire server becomes high temporarily as well. Check the output of the total space secured by the communication message (/performance/memoryDetail/work.transactionMessageTotal) regularly to see whether it has become larger by executing a gs_stat command appended with the --memoryDetail option. In particular, processing may become concentrated temporarily and the load may increase on the backup end during asynchronous replication. Change to the semi-synchronous replication mode if a timeout occurs because of these causes. |
|
A deadlock or extended lock is maintained in the application. | [Server] 50000, 50001 [Client] 70000 [Description] Transaction timeout [Format] Timeout elapsed period, partition no., connection address, failure cause |
Check the output value of the total space secured by the communication +E6 message (/performance/memoryDetail/work.transactionMessageTotal) by executing a gs_stat command appended with the --memoryDetail option. If this does not reduce regularly, a dead lock or lock standby will occur. Review the application and terminate if necessary. Adopt the appropriate measure as the lock will not be released by the server side if no limit is set. |
||
A failover timeout has occurred | [Client] 70000 [Description] Client failover timeout [Format] Timeout elapsed period, partition no., connection address, failure cause |
For large-scale data in which a cluster failover takes time to occur, especially when a single row such as BLOB, etc., is extremely large, the failover (synchronization process during execution) may take a while to complete, so set the failover timeout longer. In addition, as failure detection is carried out at the heartbeat interval, if the heartbeat interval is large, it may take a while before the failure is detected. The failover time will become longer in this case as well. |
||
Error caused by a replication failure | Failover process started with the replication process not completed normally | [Server] 50002 [Description] Update operation continuity check error [Format] Partition no., connection address, failure cause |
This symptom may appear when backup data is missing due to the timing of the node failure and the timing that the message is received or sent in the replication process. As the probability of this symptom appearing is especially high in asynchronous replication, it is recommended that the trade-off with performance be considered during cluster operation in the semi-synchronous replication mode if availability is a priority. | |
Error caused by a stop in the data service at the failover destination | The status of a cluster which used to be valid at the failover start point is reset during a failover, becoming a sub-cluster status. | [Server] 10010 [Description] Access when a cluster is not composed yet [Format] Partition no., connection address, failure cause |
See “Problems related to cluster failure” for details on the causes of cluster failure. If a cluster configuration is reset due to half or more of the nodes being down, get new nodes ready, and return the cluster to a state in which the number of nodes constituting a cluster can be secured. | |
Node failure has occurred simultaneously in nodes exceeding the number of replicas set in gs_cluster.json. | [Server] 10007 [Description] (Data service stopped due to detection of data lost) [Format] (*Master node only) Partition no., LSN (Log Sequence Number) of the latest data including the down node in corresponding partition, largest LSN in the current cluster, node address presumed to hold the latest data (however, reliability is not guaranteed) [Command check] The same data as the error description above can be acquired with gs_partition --loss |
As the cluster has detected that data consistency will break down due to continued operation, data service will be stopped for the partition concerned. Although there is a trade-off between the availability and performance, if availability is a priority, set the number of replicas in gs_cluster.json to be the same or higher than the number of nodes that are expected to be down simultaneously. | ||
The number of nodes which failed simultaneously was equal to or less than the number of replicas set in gs_cluster.json, but there was a partition in which the number of replicas was temporarily insufficient at the point that the failure occurred. | [Server]: 10007 [Description] (Data service stopped due to detection of data lost) [Format] (*Master node only) Partition no., LSN of the latest data including the down node in corresponding partition, largest LSN in the current cluster, node address presumed to hold the latest data [Command check] Use gs_partition --loss to check after the occurrence. If the system is operating with an insufficient number of replicas, check the current availability as REPLICA_LOSS will appear in gs_stat /cluster/partitionStatus. If you want to know the individual partition status, use a gs_partition to check the number of replicas for each partition. |
In order to reduce the downtime during a failover in GridStore, if the cluster deems that a certain amount of time is required (determined by whether the applicable log is large or not) when synchronizing a certain partition, it will synchronize only the group of nodes that can be synchronized within a short time first before partially starting the data services. In this case, although the number of replicas created by the asynchronous execution in the background is insufficient, the possibility will be lowered until they are recovered. Therefore, even if the number of replicas provisionally set up in gs_cluster.json is sufficient, note that this status will result if replica recovery is not carried out in time in the background. | ||
A stationary configuration error such as an unstable heartbeat has been detected in a cluster, and failover is repeated. | [Server]: 50003 [Description] (Access when a cluster is not composed yet) [Format] Partition no., connection address, failure cause [Server]: 50004 [Description] (Access when a cluster is being composed) [Format] Partition no., connection address, failure cause |
Cluster failure occurs regularly, making the cluster unstable. See “Problems related to cluster failure”. | ||
When the latest data exists in the distribution address even though operation continues as there are clusters in which a majority of the nodes can be secured after a network disruption occurs | [Server]: 10007 [Description] (Data service stopped due to detection of data lost) [Format] (*Master node only) Partition no., LSN of the latest data including the down node in corresponding partition, largest LSN in the current cluster, node address presumed to hold the latest data |
Network disruption has the same symptoms as when a disrupted node is down. If the latest data exists in the disruption address, the data service of the relevant partition will be stopped temporarily but once the disrupted network returns to normal, the data service stopped automatically will be restarted. | ||