Revision 1758bde2e4aa5ff188d53e7d9d388bbb7e12eebb authored by Lukas Wunner on 28 June 2022, 10:15:08 UTC, committed by Jakub Kicinski on 30 June 2022, 03:38:52 UTC
Upon system sleep, mdio_bus_phy_suspend() stops the phy_state_machine(),
but subsequent interrupts may retrigger it:

They may have been left enabled to facilitate wakeup and are not
quiesced until the ->suspend_noirq() phase.  Unwanted interrupts may
hence occur between mdio_bus_phy_suspend() and dpm_suspend_noirq(),
as well as between dpm_resume_noirq() and mdio_bus_phy_resume().

Retriggering the phy_state_machine() through an interrupt is not only
undesirable for the reason given in mdio_bus_phy_suspend() (freezing it
midway with phydev->lock held), but also because the PHY may be
inaccessible after it's suspended:  Accesses to USB-attached PHYs are
blocked once usb_suspend_both() clears the can_submit flag and PHYs on
PCI network cards may become inaccessible upon suspend as well.

Amend phy_interrupt() to avoid triggering the state machine if the PHY
is suspended.  Signal wakeup instead if the attached net_device or its
parent has been configured as a wakeup source.  (Those conditions are
identical to mdio_bus_phy_may_suspend().)  Postpone handling of the
interrupt until the PHY has resumed.

Before stopping the phy_state_machine() in mdio_bus_phy_suspend(),
wait for a concurrent phy_interrupt() to run to completion.  That is
necessary because phy_interrupt() may have checked the PHY's suspend
status before the system sleep transition commenced and it may thus
retrigger the state machine after it was stopped.

Likewise, after re-enabling interrupt handling in mdio_bus_phy_resume(),
wait for a concurrent phy_interrupt() to complete to ensure that
interrupts which it postponed are properly rerun.

The issue was exposed by commit 1ce8b37241ed ("usbnet: smsc95xx: Forward
PHY interrupts to PHY driver to avoid polling"), but has existed since
forever.

Fixes: 541cd3ee00a4 ("phylib: Fix deadlock on resume")
Link: https://lore.kernel.org/netdev/a5315a8a-32c2-962f-f696-de9a26d30091@samsung.com/
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Tested-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: stable@vger.kernel.org # v2.6.33+
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/b7f386d04e9b5b0e2738f0125743e30676f309ef.1656410895.git.lukas@wunner.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1 parent e65af54
Raw File
rcuref.rst
.. SPDX-License-Identifier: GPL-2.0

====================================================================
Reference-count design for elements of lists/arrays protected by RCU
====================================================================


Please note that the percpu-ref feature is likely your first
stop if you need to combine reference counts and RCU.  Please see
include/linux/percpu-refcount.h for more information.  However, in
those unusual cases where percpu-ref would consume too much memory,
please read on.

------------------------------------------------------------------------

Reference counting on elements of lists which are protected by traditional
reader/writer spinlocks or semaphores are straightforward:

CODE LISTING A::

    1.					    2.
    add()				    search_and_reference()
    {					    {
	alloc_object				read_lock(&list_lock);
	...					search_for_element
	atomic_set(&el->rc, 1);			atomic_inc(&el->rc);
	write_lock(&list_lock);			 ...
	add_element				read_unlock(&list_lock);
	...					...
	write_unlock(&list_lock);	   }
    }

    3.					    4.
    release_referenced()		    delete()
    {					    {
	...					write_lock(&list_lock);
	if(atomic_dec_and_test(&el->rc))	...
	    kfree(el);
	...					remove_element
    }						write_unlock(&list_lock);
						...
						if (atomic_dec_and_test(&el->rc))
						    kfree(el);
						...
					    }

If this list/array is made lock free using RCU as in changing the
write_lock() in add() and delete() to spin_lock() and changing read_lock()
in search_and_reference() to rcu_read_lock(), the atomic_inc() in
search_and_reference() could potentially hold reference to an element which
has already been deleted from the list/array.  Use atomic_inc_not_zero()
in this scenario as follows:

CODE LISTING B::

    1.					    2.
    add()				    search_and_reference()
    {					    {
	alloc_object				rcu_read_lock();
	...					search_for_element
	atomic_set(&el->rc, 1);			if (!atomic_inc_not_zero(&el->rc)) {
	spin_lock(&list_lock);			    rcu_read_unlock();
						    return FAIL;
	add_element				}
	...					...
	spin_unlock(&list_lock);		rcu_read_unlock();
    }					    }
    3.					    4.
    release_referenced()		    delete()
    {					    {
	...					spin_lock(&list_lock);
	if (atomic_dec_and_test(&el->rc))	...
	    call_rcu(&el->head, el_free);	remove_element
	...					spin_unlock(&list_lock);
    }						...
						if (atomic_dec_and_test(&el->rc))
						    call_rcu(&el->head, el_free);
						...
					    }

Sometimes, a reference to the element needs to be obtained in the
update (write) stream.	In such cases, atomic_inc_not_zero() might be
overkill, since we hold the update-side spinlock.  One might instead
use atomic_inc() in such cases.

It is not always convenient to deal with "FAIL" in the
search_and_reference() code path.  In such cases, the
atomic_dec_and_test() may be moved from delete() to el_free()
as follows:

CODE LISTING C::

    1.					    2.
    add()				    search_and_reference()
    {					    {
	alloc_object				rcu_read_lock();
	...					search_for_element
	atomic_set(&el->rc, 1);			atomic_inc(&el->rc);
	spin_lock(&list_lock);			...

	add_element				rcu_read_unlock();
	...				    }
	spin_unlock(&list_lock);	    4.
    }					    delete()
    3.					    {
    release_referenced()			spin_lock(&list_lock);
    {						...
	...					remove_element
	if (atomic_dec_and_test(&el->rc))	spin_unlock(&list_lock);
	    kfree(el);				...
	...					call_rcu(&el->head, el_free);
    }						...
    5.					    }
    void el_free(struct rcu_head *rhp)
    {
	release_referenced();
    }

The key point is that the initial reference added by add() is not removed
until after a grace period has elapsed following removal.  This means that
search_and_reference() cannot find this element, which means that the value
of el->rc cannot increase.  Thus, once it reaches zero, there are no
readers that can or ever will be able to reference the element.	 The
element can therefore safely be freed.	This in turn guarantees that if
any reader finds the element, that reader may safely acquire a reference
without checking the value of the reference counter.

A clear advantage of the RCU-based pattern in listing C over the one
in listing B is that any call to search_and_reference() that locates
a given object will succeed in obtaining a reference to that object,
even given a concurrent invocation of delete() for that same object.
Similarly, a clear advantage of both listings B and C over listing A is
that a call to delete() is not delayed even if there are an arbitrarily
large number of calls to search_and_reference() searching for the same
object that delete() was invoked on.  Instead, all that is delayed is
the eventual invocation of kfree(), which is usually not a problem on
modern computer systems, even the small ones.

In cases where delete() can sleep, synchronize_rcu() can be called from
delete(), so that el_free() can be subsumed into delete as follows::

    4.
    delete()
    {
	spin_lock(&list_lock);
	...
	remove_element
	spin_unlock(&list_lock);
	...
	synchronize_rcu();
	if (atomic_dec_and_test(&el->rc))
	    kfree(el);
	...
    }

As additional examples in the kernel, the pattern in listing C is used by
reference counting of struct pid, while the pattern in listing B is used by
struct posix_acl.
back to top