https://github.com/voldemort/voldemort
Revision 5708c0cf9cf10a88194399542781bffc164ff27b authored by Arunachalam Thirupathi on 29 July 2014, 20:59:53 UTC, committed by Arunachalam Thirupathi on 29 July 2014, 21:17:38 UTC
1) Threshold failure detector, marks a node as available when it receives a non-catastrophic error after configured number of catastrophic errors. When a node goes unavailable, generally first there will be lots of connectExceptions ( catastrophic) and timeouts for already established connections. This causes failure detector to treat a node that is being down as up and affects the client latency as all the get waits for the connection timeout to happen and goes for the serial requests. 2) Threshold failure detector, marks a node as unavailable after a window rolls over. Threshold failure detector tracks successes/failures in window ( default 5 minutes) and if the successes drop below a configured ( defaul 95 ) percentage and if there are more errors than configured it marks the node as down. When the window rolls over, and if the first request succeeds or when a node comes back up the failure count is not reset. This causes an available node to be marked as down. 3) Enhance the unit tests to cover the above 2 cases. 4) Dump the statistics when a node goes down to reason about it in the logs. The new code is written with the following reasoning. Do the book keeping first. Make decision next. The new code intentionally does not reset the catastrophic errors on a window roll over as it will be reset by the first success anyway. The node can still flap when the failure is above minimum and the percentage oscillates around the configured percentage. But I don't see any good workaround and nor it was an actual issue on 20+ repetitions I created for an internal repro. So leaving them as it is. Previous code mixed both of these and lead to many issues. The previous code left many loose ends ( Boolean represented in int, Generic set methods, when it required reset, Set methods at individual variable level, when only reset method is required, copy pasted code). PS: There is a more serious third issue, Selector drains the parallel queue. When the request requires connection establishment Selector handles the connection too. If the node is dead Selector is going to wait configured time for connection to timeout ( default 5 seconds) . In this time Selector is not pumping read/write and hence all these requests eventually time out. We are discussing the potential issues for this fix and will address this in the next fixes. There are many other minor issues I uncovered as well.
1 parent 4fd732f
Tip revision: 5708c0cf9cf10a88194399542781bffc164ff27b authored by Arunachalam Thirupathi on 29 July 2014, 20:59:53 UTC
Threshold failure detector issue
Threshold failure detector issue
Tip revision: 5708c0c
File | Mode | Size |
---|---|---|
.settings | ||
bin | ||
clients | ||
config | ||
contrib | ||
docs | ||
example | ||
gradle | ||
private-lib | ||
public-lib | ||
src | ||
test | ||
voldemort-contrib | ||
.gitignore | -rw-r--r-- | 257 bytes |
CONTRIBUTORS | -rw-r--r-- | 659 bytes |
LICENSE | -rw-r--r-- | 11.1 KB |
NOTES | -rw-r--r-- | 2.5 KB |
NOTICE | -rw-r--r-- | 8.1 KB |
README.md | -rw-r--r-- | 4.5 KB |
build.gradle | -rw-r--r-- | 18.0 KB |
build.xml | -rw-r--r-- | 22.1 KB |
gradle.properties | -rw-r--r-- | 1.2 KB |
gradlew | -rwxr-xr-x | 5.0 KB |
gradlew.bat | -rw-r--r-- | 2.3 KB |
release_notes.txt | -rw-r--r-- | 36.2 KB |
settings.gradle | -rw-r--r-- | 149 bytes |
tomcat-tasks.properties | -rw-r--r-- | 420 bytes |
web.xml | -rw-r--r-- | 1.1 KB |
Computing file changes ...