https://github.com/voldemort/voldemort
Revision 5708c0cf9cf10a88194399542781bffc164ff27b authored by Arunachalam Thirupathi on 29 July 2014, 20:59:53 UTC, committed by Arunachalam Thirupathi on 29 July 2014, 21:17:38 UTC
1) Threshold failure detector, marks a node as available when it
receives a non-catastrophic error after configured number of
catastrophic errors. When a node goes unavailable, generally first there
will be lots of connectExceptions ( catastrophic) and timeouts for
already established connections. This causes failure detector to treat a
node that is being down as up and affects the client latency as all the
get waits for the connection timeout to happen and goes for the serial
requests.
2) Threshold failure detector, marks a node as unavailable after a
window rolls over. Threshold failure detector tracks successes/failures
in window ( default 5 minutes) and if the successes drop below a
configured ( defaul 95 ) percentage and if there are more errors than
configured it marks the node as down. When the window rolls over, and if
the first request succeeds or when a node comes back up the failure
count is not reset. This causes an available node to be marked as down.
3) Enhance the unit tests to cover the above 2 cases.
4) Dump the statistics when a node goes down to reason about it in the
logs.

The new code is written with the following reasoning.

Do the book keeping first.
Make decision next.

The new code intentionally does not reset the catastrophic errors on a
window roll over as it will be reset by the first success anyway. The
node can still flap when the failure is above minimum and the percentage
oscillates around the configured percentage. But I don't see any good
workaround and nor it was an actual issue on 20+ repetitions I created
for an internal repro. So leaving them as it is.

Previous code mixed both of these and lead to many issues. The previous
code left many loose ends ( Boolean represented in int, Generic set
methods, when it required reset, Set methods at individual variable
level, when only reset method is required, copy pasted code).

PS: There is a more serious third issue, Selector drains the parallel
queue. When the request requires connection establishment Selector
handles the connection too. If the node is dead Selector is going to
wait configured time for connection to timeout ( default 5 seconds) . In
this time Selector is not pumping read/write and hence all these
requests eventually time out. We are discussing the potential issues for
this fix and will address this in the next fixes. There are many other
minor issues I uncovered as well.
1 parent 4fd732f
History
Tip revision: 5708c0cf9cf10a88194399542781bffc164ff27b authored by Arunachalam Thirupathi on 29 July 2014, 20:59:53 UTC
Threshold failure detector issue
Tip revision: 5708c0c
File Mode Size
.settings
bin
clients
config
contrib
docs
example
gradle
private-lib
public-lib
src
test
voldemort-contrib
.gitignore -rw-r--r-- 257 bytes
CONTRIBUTORS -rw-r--r-- 659 bytes
LICENSE -rw-r--r-- 11.1 KB
NOTES -rw-r--r-- 2.5 KB
NOTICE -rw-r--r-- 8.1 KB
README.md -rw-r--r-- 4.5 KB
build.gradle -rw-r--r-- 18.0 KB
build.xml -rw-r--r-- 22.1 KB
gradle.properties -rw-r--r-- 1.2 KB
gradlew -rwxr-xr-x 5.0 KB
gradlew.bat -rw-r--r-- 2.3 KB
release_notes.txt -rw-r--r-- 36.2 KB
settings.gradle -rw-r--r-- 149 bytes
tomcat-tasks.properties -rw-r--r-- 420 bytes
web.xml -rw-r--r-- 1.1 KB

README.md

back to top