Revision e95584a889e1902fdf1ded9712e2c3c3083baf96 authored by Tuong Lien on 02 October 2019, 11:49:43 UTC, committed by David S. Miller on 02 October 2019, 15:02:05 UTC
We have identified a problem with the "oversubscription" policy in the
link transmission code.

When small messages are transmitted, and the sending link has reached
the transmit window limit, those messages will be bundled and put into
the link backlog queue. However, bundles of data messages are counted
at the 'CRITICAL' level, so that the counter for that level, instead of
the counter for the real, bundled message's level is the one being
increased.
Subsequent, to-be-bundled data messages at non-CRITICAL levels continue
to be tested against the unchanged counter for their own level, while
contributing to an unrestrained increase at the CRITICAL backlog level.

This leaves a gap in congestion control algorithm for small messages
that can result in starvation for other users or a "real" CRITICAL
user. Even that eventually can lead to buffer exhaustion & link reset.

We fix this by keeping a 'target_bskb' buffer pointer at each levels,
then when bundling, we only bundle messages at the same importance
level only. This way, we know exactly how many slots a certain level
have occupied in the queue, so can manage level congestion accurately.

By bundling messages at the same level, we even have more benefits. Let
consider this:
- One socket sends 64-byte messages at the 'CRITICAL' level;
- Another sends 4096-byte messages at the 'LOW' level;

When a 64-byte message comes and is bundled the first time, we put the
overhead of message bundle to it (+ 40-byte header, data copy, etc.)
for later use, but the next message can be a 4096-byte one that cannot
be bundled to the previous one. This means the last bundle carries only
one payload message which is totally inefficient, as for the receiver
also! Later on, another 64-byte message comes, now we make a new bundle
and the same story repeats...

With the new bundling algorithm, this will not happen, the 64-byte
messages will be bundled together even when the 4096-byte message(s)
comes in between. However, if the 4096-byte messages are sent at the
same level i.e. 'CRITICAL', the bundling algorithm will again cause the
same overhead.

Also, the same will happen even with only one socket sending small
messages at a rate close to the link transmit's one, so that, when one
message is bundled, it's transmitted shortly. Then, another message
comes, a new bundle is created and so on...

We will solve this issue radically by another patch.

Fixes: 365ad353c256 ("tipc: reduce risk of user starvation during link congestion")
Reported-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
1 parent a761129
History
File Mode Size
9p
adfs
affs
afs
autofs
befs
bfs
btrfs
cachefiles
ceph
cifs
coda
configfs
cramfs
crypto
debugfs
devpts
dlm
ecryptfs
efivarfs
efs
erofs
exportfs
ext2
ext4
f2fs
fat
freevxfs
fscache
fuse
gfs2
hfs
hfsplus
hostfs
hpfs
hugetlbfs
iomap
isofs
jbd2
jffs2
jfs
kernfs
lockd
minix
nfs
nfs_common
nfsd
nilfs2
nls
notify
ntfs
ocfs2
omfs
openpromfs
orangefs
overlayfs
proc
pstore
qnx4
qnx6
quota
ramfs
reiserfs
romfs
squashfs
sysfs
sysv
tracefs
ubifs
udf
ufs
unicode
verity
xfs
Kconfig -rw-r--r-- 7.6 KB
Kconfig.binfmt -rw-r--r-- 7.6 KB
Makefile -rw-r--r-- 4.4 KB
aio.c -rw-r--r-- 56.0 KB
anon_inodes.c -rw-r--r-- 4.6 KB
attr.c -rw-r--r-- 9.6 KB
bad_inode.c -rw-r--r-- 5.3 KB
binfmt_aout.c -rw-r--r-- 8.3 KB
binfmt_elf.c -rw-r--r-- 63.3 KB
binfmt_elf_fdpic.c -rw-r--r-- 47.1 KB
binfmt_em86.c -rw-r--r-- 2.8 KB
binfmt_flat.c -rw-r--r-- 28.0 KB
binfmt_misc.c -rw-r--r-- 18.5 KB
binfmt_script.c -rw-r--r-- 4.4 KB
block_dev.c -rw-r--r-- 55.9 KB
buffer.c -rw-r--r-- 90.2 KB
char_dev.c -rw-r--r-- 16.5 KB
compat.c -rw-r--r-- 3.2 KB
compat_binfmt_elf.c -rw-r--r-- 3.2 KB
compat_ioctl.c -rw-r--r-- 31.0 KB
coredump.c -rw-r--r-- 22.1 KB
d_path.c -rw-r--r-- 11.3 KB
dax.c -rw-r--r-- 45.8 KB
dcache.c -rw-r--r-- 83.9 KB
dcookies.c -rw-r--r-- 7.1 KB
direct-io.c -rw-r--r-- 40.8 KB
drop_caches.c -rw-r--r-- 1.8 KB
eventfd.c -rw-r--r-- 11.1 KB
eventpoll.c -rw-r--r-- 64.5 KB
exec.c -rw-r--r-- 46.9 KB
fcntl.c -rw-r--r-- 23.3 KB
fhandle.c -rw-r--r-- 6.8 KB
file.c -rw-r--r-- 24.2 KB
file_table.c -rw-r--r-- 10.2 KB
filesystems.c -rw-r--r-- 6.4 KB
fs-writeback.c -rw-r--r-- 74.3 KB
fs_context.c -rw-r--r-- 18.1 KB
fs_parser.c -rw-r--r-- 11.0 KB
fs_pin.c -rw-r--r-- 1.9 KB
fs_struct.c -rw-r--r-- 3.3 KB
fs_types.c -rw-r--r-- 2.5 KB
fsopen.c -rw-r--r-- 11.2 KB
inode.c -rw-r--r-- 60.7 KB
internal.h -rw-r--r-- 5.1 KB
io_uring.c -rw-r--r-- 94.1 KB
ioctl.c -rw-r--r-- 17.7 KB
libfs.c -rw-r--r-- 32.7 KB
locks.c -rw-r--r-- 78.9 KB
mbcache.c -rw-r--r-- 12.0 KB
mount.h -rw-r--r-- 4.0 KB
mpage.c -rw-r--r-- 21.1 KB
namei.c -rw-r--r-- 122.8 KB
namespace.c -rw-r--r-- 97.0 KB
no-block.c -rw-r--r-- 478 bytes
nsfs.c -rw-r--r-- 6.1 KB
open.c -rw-r--r-- 30.2 KB
pipe.c -rw-r--r-- 27.7 KB
pnode.c -rw-r--r-- 15.1 KB
pnode.h -rw-r--r-- 1.9 KB
posix_acl.c -rw-r--r-- 21.5 KB
proc_namespace.c -rw-r--r-- 7.8 KB
read_write.c -rw-r--r-- 51.6 KB
readdir.c -rw-r--r-- 11.3 KB
select.c -rw-r--r-- 34.2 KB
seq_file.c -rw-r--r-- 24.7 KB
signalfd.c -rw-r--r-- 9.0 KB
splice.c -rw-r--r-- 40.2 KB
stack.c -rw-r--r-- 2.5 KB
stat.c -rw-r--r-- 19.4 KB
statfs.c -rw-r--r-- 9.9 KB
super.c -rw-r--r-- 47.8 KB
sync.c -rw-r--r-- 10.4 KB
timerfd.c -rw-r--r-- 13.5 KB
userfaultfd.c -rw-r--r-- 51.2 KB
utimes.c -rw-r--r-- 7.3 KB
xattr.c -rw-r--r-- 23.5 KB

back to top