Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC : empty fifos + hardware mailboxes when re-enabling CAN ? #137

Open
fenugrec opened this issue Nov 16, 2022 · 6 comments
Open

RFC : empty fifos + hardware mailboxes when re-enabling CAN ? #137

fenugrec opened this issue Nov 16, 2022 · 6 comments

Comments

@fenugrec
Copy link
Collaborator

Scenario :

  • device is connected and communicating on bus
  • disconnect bus; this could leave some frames pending transmission in the software fifo and hardware mailboxes.
  • if anything but a hard reset is done, the remaining frames in the fifo and mboxes will be sent on the next bus that we connect to. That is, even ip link set ... down , or a usb-reset, will not clear mboxes etc. May be an undesirable behaviour.

It is not as simple as just modifying can_disable() since this func is also called on USB suspend . IMO a USB suspend+resume cycle should not empty buffers; current behaviour seems ok.

Comments, ideas, counter-examples ?

@marckleinebudde
Copy link
Collaborator

marckleinebudde commented Nov 16, 2022

A ip link set $IFACE down should shut down the CAN controller and leave the bus. A ip link set $IFACE up should reset the CAN controller, program it's registers and join the bus.

During down I would reset the TX FIFO and discard all frames that are queued for CAN TX. Probably the same for frames queued for going to the host via USB.

Shutting down properly down during high bus load is an interesting task...

@marckleinebudde
Copy link
Collaborator

https://github.com/torvalds/linux/blob/master/drivers/net/can/spi/mcp251xfd/mcp251xfd-core.c#L1648

For a Linux driver the (important) code during ifdown is:

static int mcp251xfd_stop(struct net_device *ndev)
{
	struct mcp251xfd_priv *priv = netdev_priv(ndev);

	netif_stop_queue(ndev);                          // Stops the TX queue from the kernel into the driver
	mcp251xfd_chip_interrupts_disable(priv);
	free_irq(ndev->irq, priv);
	mcp251xfd_chip_stop(priv, CAN_STATE_STOPPED);    // Shuts down CAN controller, as much as possible, reset
	mcp251xfd_transceiver_disable(priv);

	pm_runtime_put(ndev->dev.parent);                // power CAN controller down

	return 0;
}

@fenugrec
Copy link
Collaborator Author

Shutting down properly down during high bus load is an interesting task...

Agreed.
I think on the CAN side it's fairly "straightforward" on F0 :

  • make sure USB can't pump more frames into firmware fifo (can't just set EP to NAK since that would kill everything on multichannel)
  • disable CAN RX
  • set ABRQ bit on all mboxes (delay should be deterministic, a function of bitrate ; worst case is a frame that just started transmitting. bxCAN will complete the transmission then empty the mbx)
  • empty firmware fifos
  • option 1: poll all mboxes with a maximum timeout (either fixed, or determined by bitrate ) - not a huge fan, could disrupt USB comms ?
  • option 2: reset bxCAN peripheral, accept possible corrupt frames on the bus.

@marckleinebudde
Copy link
Collaborator

looks good - usually the CAN controllers have different modes, e.g. config mode where you can configure everything and they are not on the bus and a normal mode, where they CAN RX and TX on the bus. joining and leaving the bus takes some time, the controller takes care not to leave the bus during an ongoing TX, you usually have to poll some bit.

@ryedwards
Copy link
Contributor

An example from the HAL...not sure if even they are doing it correctly:

/**
  * @brief  Stop the FDCAN module and enable access to configuration registers.
  * @param  hfdcan pointer to an FDCAN_HandleTypeDef structure that contains
  *         the configuration information for the specified FDCAN.
  * @retval HAL status
  */
HAL_StatusTypeDef HAL_FDCAN_Stop(FDCAN_HandleTypeDef *hfdcan)
{
  uint32_t Counter = 0U;

  if (hfdcan->State == HAL_FDCAN_STATE_BUSY)
  {
    /* Request initialisation */
    SET_BIT(hfdcan->Instance->CCCR, FDCAN_CCCR_INIT);

    /* Wait until the INIT bit into CCCR register is set */
    while ((hfdcan->Instance->CCCR & FDCAN_CCCR_INIT) == 0U)
    {
      /* Check for the Timeout */
      if (Counter > FDCAN_TIMEOUT_VALUE)
      {
        /* Update error code */
        hfdcan->ErrorCode |= HAL_FDCAN_ERROR_TIMEOUT;

        /* Change FDCAN state */
        hfdcan->State = HAL_FDCAN_STATE_ERROR;

        return HAL_ERROR;
      }

      /* Increment counter */
      Counter++;
    }

    /* Reset counter */
    Counter = 0U;

    /* Exit from Sleep mode */
    CLEAR_BIT(hfdcan->Instance->CCCR, FDCAN_CCCR_CSR);

    /* Wait until FDCAN exits sleep mode */
    while ((hfdcan->Instance->CCCR & FDCAN_CCCR_CSA) == FDCAN_CCCR_CSA)
    {
      /* Check for the Timeout */
      if (Counter > FDCAN_TIMEOUT_VALUE)
      {
        /* Update error code */
        hfdcan->ErrorCode |= HAL_FDCAN_ERROR_TIMEOUT;

        /* Change FDCAN state */
        hfdcan->State = HAL_FDCAN_STATE_ERROR;

        return HAL_ERROR;
      }

      /* Increment counter */
      Counter++;
    }

    /* Enable configuration change */
    SET_BIT(hfdcan->Instance->CCCR, FDCAN_CCCR_CCE);

    /* Reset Latest Tx FIFO/Queue Request Buffer Index */
    hfdcan->LatestTxFifoQRequest = 0U;

    /* Change FDCAN peripheral state */
    hfdcan->State = HAL_FDCAN_STATE_READY;

    /* Return function status */
    return HAL_OK;
  }
  else
  {
    /* Update error code */
    hfdcan->ErrorCode |= HAL_FDCAN_ERROR_NOT_STARTED;

    return HAL_ERROR;
  }
}

@marckleinebudde
Copy link
Collaborator

FYI: on the RX-side, we have the same problem and it even crashes the Linux driver if HW Timestamps are enabled.

On the Linux side we first shoot down all pending bulk transfers and then send the reset command to the candlelight. If the CAN bus is busy, this will lead to RX'ed CAN frames in the "to host" queue. All queued RX'ed frames will be send to the host as soon as the driver submits the new bulk urbs.

I'll fix the crash on the Linux side. A proper reset for the RX (and TX) queues should be implemented in the FW, too.

Whissi pushed a commit to Whissi/linux-stable that referenced this issue Jul 27, 2023
commit 5886e4d upstream.

If the gs_usb device driver is unloaded (or unbound) before the
interface is shut down, the USB stack first calls the struct
usb_driver::disconnect and then the struct net_device_ops::ndo_stop
callback.

In gs_usb_disconnect() all pending bulk URBs are killed, i.e. no more
RX'ed CAN frames are send from the USB device to the host. Later in
gs_can_close() a reset control message is send to each CAN channel to
remove the controller from the CAN bus. In this race window the USB
device can still receive CAN frames from the bus and internally queue
them to be send to the host.

At least in the current version of the candlelight firmware, the queue
of received CAN frames is not emptied during the reset command. After
loading (or binding) the gs_usb driver, new URBs are submitted during
the struct net_device_ops::ndo_open callback and the candlelight
firmware starts sending its already queued CAN frames to the host.

However, this scenario was not considered when implementing the
hardware timestamp function. The cycle counter/time counter
infrastructure is set up (gs_usb_timestamp_init()) after the USBs are
submitted, resulting in a NULL pointer dereference if
timecounter_cyc2time() (via the call chain:
gs_usb_receive_bulk_callback() -> gs_usb_set_timestamp() ->
gs_usb_skb_set_timestamp()) is called too early.

Move the gs_usb_timestamp_init() function before the URBs are
submitted to fix this problem.

For a comprehensive solution, we need to consider gs_usb devices with
more than 1 channel. The cycle counter/time counter infrastructure is
setup per channel, but the RX URBs are per device. Once gs_can_open()
of _a_ channel has been called, and URBs have been submitted, the
gs_usb_receive_bulk_callback() can be called for _all_ available
channels, even for channels that are not running, yet. As cycle
counter/time counter has not set up, this will again lead to a NULL
pointer dereference.

Convert the cycle counter/time counter from a "per channel" to a "per
device" functionality. Also set it up, before submitting any URBs to
the device.

Further in gs_usb_receive_bulk_callback(), don't process any URBs for
not started CAN channels, only resubmit the URB.

Fixes: 45dfa45 ("can: gs_usb: add RX and TX hardware timestamp support")
Closes: candle-usb/candleLight_fw#137 (comment)
Cc: [email protected]
Cc: John Whittington <[email protected]>
Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-2-9017cefcd9d5@pengutronix.de
Signed-off-by: Marc Kleine-Budde <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
maryamtahhan pushed a commit to maryamtahhan/linux that referenced this issue Aug 3, 2023
If the gs_usb device driver is unloaded (or unbound) before the
interface is shut down, the USB stack first calls the struct
usb_driver::disconnect and then the struct net_device_ops::ndo_stop
callback.

In gs_usb_disconnect() all pending bulk URBs are killed, i.e. no more
RX'ed CAN frames are send from the USB device to the host. Later in
gs_can_close() a reset control message is send to each CAN channel to
remove the controller from the CAN bus. In this race window the USB
device can still receive CAN frames from the bus and internally queue
them to be send to the host.

At least in the current version of the candlelight firmware, the queue
of received CAN frames is not emptied during the reset command. After
loading (or binding) the gs_usb driver, new URBs are submitted during
the struct net_device_ops::ndo_open callback and the candlelight
firmware starts sending its already queued CAN frames to the host.

However, this scenario was not considered when implementing the
hardware timestamp function. The cycle counter/time counter
infrastructure is set up (gs_usb_timestamp_init()) after the USBs are
submitted, resulting in a NULL pointer dereference if
timecounter_cyc2time() (via the call chain:
gs_usb_receive_bulk_callback() -> gs_usb_set_timestamp() ->
gs_usb_skb_set_timestamp()) is called too early.

Move the gs_usb_timestamp_init() function before the URBs are
submitted to fix this problem.

For a comprehensive solution, we need to consider gs_usb devices with
more than 1 channel. The cycle counter/time counter infrastructure is
setup per channel, but the RX URBs are per device. Once gs_can_open()
of _a_ channel has been called, and URBs have been submitted, the
gs_usb_receive_bulk_callback() can be called for _all_ available
channels, even for channels that are not running, yet. As cycle
counter/time counter has not set up, this will again lead to a NULL
pointer dereference.

Convert the cycle counter/time counter from a "per channel" to a "per
device" functionality. Also set it up, before submitting any URBs to
the device.

Further in gs_usb_receive_bulk_callback(), don't process any URBs for
not started CAN channels, only resubmit the URB.

Fixes: 45dfa45 ("can: gs_usb: add RX and TX hardware timestamp support")
Closes: candle-usb/candleLight_fw#137 (comment)
Cc: [email protected]
Cc: John Whittington <[email protected]>
Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-2-9017cefcd9d5@pengutronix.de
Signed-off-by: Marc Kleine-Budde <[email protected]>
orangecms pushed a commit to orangecms/linux that referenced this issue Aug 20, 2023
commit 5886e4d upstream.

If the gs_usb device driver is unloaded (or unbound) before the
interface is shut down, the USB stack first calls the struct
usb_driver::disconnect and then the struct net_device_ops::ndo_stop
callback.

In gs_usb_disconnect() all pending bulk URBs are killed, i.e. no more
RX'ed CAN frames are send from the USB device to the host. Later in
gs_can_close() a reset control message is send to each CAN channel to
remove the controller from the CAN bus. In this race window the USB
device can still receive CAN frames from the bus and internally queue
them to be send to the host.

At least in the current version of the candlelight firmware, the queue
of received CAN frames is not emptied during the reset command. After
loading (or binding) the gs_usb driver, new URBs are submitted during
the struct net_device_ops::ndo_open callback and the candlelight
firmware starts sending its already queued CAN frames to the host.

However, this scenario was not considered when implementing the
hardware timestamp function. The cycle counter/time counter
infrastructure is set up (gs_usb_timestamp_init()) after the USBs are
submitted, resulting in a NULL pointer dereference if
timecounter_cyc2time() (via the call chain:
gs_usb_receive_bulk_callback() -> gs_usb_set_timestamp() ->
gs_usb_skb_set_timestamp()) is called too early.

Move the gs_usb_timestamp_init() function before the URBs are
submitted to fix this problem.

For a comprehensive solution, we need to consider gs_usb devices with
more than 1 channel. The cycle counter/time counter infrastructure is
setup per channel, but the RX URBs are per device. Once gs_can_open()
of _a_ channel has been called, and URBs have been submitted, the
gs_usb_receive_bulk_callback() can be called for _all_ available
channels, even for channels that are not running, yet. As cycle
counter/time counter has not set up, this will again lead to a NULL
pointer dereference.

Convert the cycle counter/time counter from a "per channel" to a "per
device" functionality. Also set it up, before submitting any URBs to
the device.

Further in gs_usb_receive_bulk_callback(), don't process any URBs for
not started CAN channels, only resubmit the URB.

Fixes: 45dfa45 ("can: gs_usb: add RX and TX hardware timestamp support")
Closes: candle-usb/candleLight_fw#137 (comment)
Cc: [email protected]
Cc: John Whittington <[email protected]>
Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-2-9017cefcd9d5@pengutronix.de
Signed-off-by: Marc Kleine-Budde <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
tuxedo-bot pushed a commit to tuxedocomputers/linux that referenced this issue Oct 4, 2023
BugLink: https://bugs.launchpad.net/bugs/2037005

commit 5886e4d upstream.

If the gs_usb device driver is unloaded (or unbound) before the
interface is shut down, the USB stack first calls the struct
usb_driver::disconnect and then the struct net_device_ops::ndo_stop
callback.

In gs_usb_disconnect() all pending bulk URBs are killed, i.e. no more
RX'ed CAN frames are send from the USB device to the host. Later in
gs_can_close() a reset control message is send to each CAN channel to
remove the controller from the CAN bus. In this race window the USB
device can still receive CAN frames from the bus and internally queue
them to be send to the host.

At least in the current version of the candlelight firmware, the queue
of received CAN frames is not emptied during the reset command. After
loading (or binding) the gs_usb driver, new URBs are submitted during
the struct net_device_ops::ndo_open callback and the candlelight
firmware starts sending its already queued CAN frames to the host.

However, this scenario was not considered when implementing the
hardware timestamp function. The cycle counter/time counter
infrastructure is set up (gs_usb_timestamp_init()) after the USBs are
submitted, resulting in a NULL pointer dereference if
timecounter_cyc2time() (via the call chain:
gs_usb_receive_bulk_callback() -> gs_usb_set_timestamp() ->
gs_usb_skb_set_timestamp()) is called too early.

Move the gs_usb_timestamp_init() function before the URBs are
submitted to fix this problem.

For a comprehensive solution, we need to consider gs_usb devices with
more than 1 channel. The cycle counter/time counter infrastructure is
setup per channel, but the RX URBs are per device. Once gs_can_open()
of _a_ channel has been called, and URBs have been submitted, the
gs_usb_receive_bulk_callback() can be called for _all_ available
channels, even for channels that are not running, yet. As cycle
counter/time counter has not set up, this will again lead to a NULL
pointer dereference.

Convert the cycle counter/time counter from a "per channel" to a "per
device" functionality. Also set it up, before submitting any URBs to
the device.

Further in gs_usb_receive_bulk_callback(), don't process any URBs for
not started CAN channels, only resubmit the URB.

Fixes: 45dfa45 ("can: gs_usb: add RX and TX hardware timestamp support")
Closes: candle-usb/candleLight_fw#137 (comment)
Cc: [email protected]
Cc: John Whittington <[email protected]>
Link: https://lore.kernel.org/all/20230716-gs_usb-fix-time-stamp-counter-v1-2-9017cefcd9d5@pengutronix.de
Signed-off-by: Marc Kleine-Budde <[email protected]>
Signed-off-by: Greg Kroah-Hartman <[email protected]>
Signed-off-by: Kamal Mostafa <[email protected]>
Signed-off-by: Roxana Nicolescu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants