Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bring hardware_check.rs back into service #107

Merged

Conversation

sirhcel
Copy link
Contributor

@sirhcel sirhcel commented Jun 12, 2023

I did not manage to run hardware_check.rs on Linux and macOS because the tests failed due to missing or wrong received data. This happened due to reading with a timeout of zero and not clearing already received data from previous test cases. This is one small step for #106.

This PR addresses these two by:

  • Clearing send and receive buffer at the beginning of each test case
  • Explicitly flushing output data before reading to ensure that all data has been on the wire at this point in time
  • Setting a timeout greater than zero (with 10 ms the tests still felt flaky, and 1,000 ms should give enough timing safety margin as we still have no way to disable timeouts completely yet)

It also includes the following additions:

  • Factoring out the actual test routine into an individual function for reuse
  • Making assertions on buffer data/bytes instead of strings for always showing the received data in case of an assertion error and not bailing out from invalid UTF-8
  • Using the assert_hex crate for assertions on buffer contents
  • Transmitting the test message back and forth instead only in one direction

Successfully ran the tests on Linux, macOS, and Windows.

@jerrywrice
Copy link

jerrywrice commented Jun 12, 2023

Sirhcel - I see that our results are quite similar ...

I've been running/testing example app 'hardware_check.rs' on Windows, and have the following update:

  1. Per Microsoft's documentation, Windows supports a maximum serial port baud rate of 250_000. 'Hardware_check'
    currently uses a top baud rate setting of 2_000_000 in dual port mode, which results in a Windows hangup
    on the corresponding read_exact() invocation following the write() at this invalid baud rate. I revised my version
    of 'hardware_check' to conditionally (for Windows only) limit the max tested baud rate to 250_000 - which resolved the issue.
  2. Related to item 1, when setting the baud rate, it might be useful (in the next major version?) to return an error or
    panic if the specified baud rate is known to be invalid for the target platform.
  3. 'Hardware_check' uses a default flow control port setting of 'SOFTWARE' in function
    'fn set_defaults(port: &mut dyn serialport::SerialPort)'. This leads to potential issues when binary data is transmitted, so I
    changed my version to use 'NONE' as the default flow control.
  4. I also found that when 'Hardware_check' is performed with a connected second com port (Loopback_Port mode), the
    flow-control sub-test initial write/recv_exact sequence in function 'test_dual_ports()' panics/errors due to a data mismatch
    on its 'recv_exact()' invocation. This is because an earlier invocation of function 'test_single_port()' writes port data,
    but can't read that data from its attached secondary port since it has no knowledge/access to the secondary port. I fixed
    this in my version by clearing the buffers at the start of 'test_dual_ports()'.
  5. 'Hardware_check' currently transfers only the fixed text buffer "Test Message". This data pattern contains only printable
    ASCII (Utf-8) u8 values (12 bytes in total) and induces minimal stress on the serial sub-system. I changed
    'test_dual_ports()' to use definable length buffers of internally generated data, and in my version added a new
    application command line argument "--XfrLen=nnn" which permits the test operator to specify a size for the
    'test_dual_ports()' transfer buffers. The default buffer length is 128. The actual data buffer is automatically populated
    with a sequence of repeating bytes from 0..255. The caveat tot his is whenever flow control is set to SOFTWARE, the
    bytes '0x11' and '0x13' (XON and XOFF) are replaced with '0x20' to avoid hanging up the port.
  6. The current 'hardware_check' application performs serial data transfers by first executing a 'write' followed inline from
    the same thread with a corresponding port 'recv_exact'. I understand this approach was straightforward, but leaves open
    the possibility of a run-time deadlock when using larger transfer buffer sizes. Fortunately on Windows I find that buffers
    up to 5K or so did not block (on my laptop). I would like to enhance this test (or create a new test) so each distinct
    port-to-port data transfer executes one or both of its write/recv_exact methods in separate threads. This approach will
    allow configurable length transfer buffers with the expectation that there should never be a application induced deadlock
    due to a blocking write. Furthermore, if the distinct write/read methods are in separate threads, it's certainly feasible to
    have another (separate) monitor thread which uses the calculated transfer duration to detect if the transfer has in-fact
    hung or crashed - and then report this as an error message.
  7. When a serial port's flow-control is set to 'SOFTWARE', as in the latter stages of the current 'hardware_check' example,
    if a 0x13 u8 byte (XOFF) is received on the serial port, the receiving serial driver internally treats this as a special control
    code and temporarily disables/suspends any ongoing transmission of serial data. The receiving port's transmit operation
    is disabled until receipt of a subsequent 0x11 (XON) byte from the remote end-point. The current 'hardware_check'
    example does not in-fact truly test the SOFTWARE flow control setting's correct behavior. Instead it enables it but only
    sends non XON/XOFF data bytes. Fully/correctly testing this feature is somewhat complicated, likely involving
    paired threads and custom communication between these test threads. Note that SOFTWARE flow control
    originated many decades ago with serially attached consoles and remote slave terminals connected to a host computer
    (i.e. multi-user time-shared computers), and allows a terminal operator to press their keyboard's CNTL^S
    (XOFF) key to pause the attached serial ports transmission of text data, and later press the CNTL^Q (XON)
    key to command the host computer to resume sending display data. I understand this 'use case' to be quite rare, and
    certainly for contemporary and new software development projects to in-fact have an actual requirement for this.
    I suggest we simply enhance the documentation with respect to the SOFTWARE flow control settings and clearly state
    it should not be used when transferring binary (non-printable) serial data.
  8. Similar to item 7, when a serial port's flow-control is set to 'HARDWARE', I'm unaware if 'hardware_check' is designed to
    or can reasonable attempt to verify it operates as expected??? I suppose the fact that (large) transfers are successful might be
    good enough?
  9. Although 'Hardware_check' sets the serial port parameter 'timeout' value, it doesn't actually test this to assure it
    functions as expected. I would like to incorporate a timeout setting operational test. This might require prompting
    the test operator to unplug the dual connected ports???

I'm currently awaiting delivery of a loop-back connector so I can execute the application in single port in true loop-back mode.
To this time I've executed it with two serial ports and cables + cross-over.

I expect to have additional crate related test details to discuss when we meet. Once it's decided which testing improvements are appropriate (and how to package them), I'll plan to provide a push request with the associated new code or code updates.

@sirhcel
Copy link
Contributor Author

sirhcel commented Jun 13, 2023

Thank you for your extensive reply @jerrywrice!

I'm happy that we are seeing similar issues and thank you for digging deeper into it! I will digest your response the next days and likely give multiple answers. Just as a starter:

  1. Per Microsoft's documentation, Windows supports a maximum serial port baud rate of 250_000. 'Hardware_check'
    currently uses a top baud rate setting of 2_000_000 in dual port mode, which results in a Windows hangup
    on the corresponding read_exact() invocation following the write() at this invalid baud rate. I revised my version
    of 'hardware_check' to conditionally (for Windows only) limit the max tested baud rate to 250_000 - which resolved the issue.

That's an interesting point. I will have a look with an ocscilloscope at the actual baud rate on the wire. I was running my tests on Windows 10 on real hardware (no VM) with two SiLabs CP2102N USB serial adapters back to back and my Windows survived this tests without showing any issues.

Could you please share a link to this piece of documentation from Microsoft? Which serial hardware are you using? I'm wondering whether this might be caused by a driver when 2 Mbaud are apparently out of spec on Windows.

2. Related to item 1, when setting the baud rate, it might be useful (in the next major version?) to return an error or
panic if the specified baud rate is known to be invalid for the target platform.

I agree with you in general. I have not checked the implementations but set_baud_rate at least defines Result<()> as its return type. Shouldn't the calls down towards the OS indicate such an error condition instead of our library having a "book of acrane knowledge" to reject such cases upfront? Do you see why there is no support for baud rates above 250 kbaud on Windows? I still can hardly imagine this given that nowadays one could easily communicate with at least some Mbaud.

@jerrywrice
Copy link

jerrywrice commented Jun 13, 2023

Christian - as you know my input regarding the maximum baud rate with Windows serial ports is more nuanced. Microsoft doesn't clearly document that there is a 'maximum' (enforced) baud rate setting, rather they document their standard baud rates. I assume this gives their technical support team an 'out' if a customer files a technical incident report and encounters a failure when using a higher 'non standard' rate, such as 2_000_000.

The relevant MS link is https://learn.microsoft.com/en-us/windows/win32/api/winbase/ns-winbase-dcb?redirectedfrom=MSDN

As you know, Windows accepts user specified numeric baud rate values other than those defined in this document page (such as 2_000_000). My understanding is that the Windows OS and I/O system is not (internally) designed and coded as efficiently as Linux, and therefore will not achieve Linux like upper serial transfer speeds. That is my theory - but I don't have the means nor time to verify this.

My development laptop runs Windows x64 Professional 10.0.19045 - with an Intel Xeon W-11955M 2.6 Ghz (2611 Mhz) 8 core CPU (16 logical processors) with 64 Gb of RAM. I'm using (2) null modem connected FTDI USB serial cables with the FTDI driver version 2.12.36.4. At this time my null-modem is in-fact a cable which is about 10 feet long (not ideal!). I'll replace this with a simple connector style null-modem when I get access to one.

I can now transfer serial data successfully (in Windows) up to 550 KBaud, but when I attempt 560 KBaud the recv_exact() errors and panics. I haven't looked at this any deeper - but will likely do so after we discuss and decide what additional (example) test
application code is appropriate.

Note that it should be possible (using threads) to generate a test that iterates and determines a maximum usable baud rate on a given test computer and cable setup. Whether this is the most important thing to deal with at this time is another question ...

@mlsvrts
Copy link
Contributor

mlsvrts commented Jun 13, 2023

It's not possible for this library to determine the maximum possible baudrate for all windows serial devices, as this value depends on the driver implementation.

So I don't think it's correct for us to enforce any limitation on the baudrate value. If the driver fails to indicate when we try to set an unsupported baudrate, that is a bug in the driver -- not serialport-rs.

C#'s serialport implementation indicates much the same: https://learn.microsoft.com/en-us/dotnet/api/system.io.ports.serialport.baudrate?view=dotnet-plat-ext-7.0

Moreover, the water becomes quite muddy when considering USB serial devices. The baudrate 'on the line' and the baudrate in the driver are not comparable, as the data is being sent as USB packets, and is confounded by pesky details like the USB polling rate, packet size, if the driver is sending partial packets, the size of the USB buffer on both ends, etc.

@jerrywrice
Copy link

Agreed. Nevertheless, I suggest that crate supplied example applications avoid utilizing exceptionally high baud rates which are prone to fail with typical hardware (especially on Windows). It creates confusion with users trying the crate for the first time if one of the examples fails 'out of the box'. They'll likely submit an unnecessary PR, or else avoid the crate altogether.

Cheers

@sirhcel
Copy link
Contributor Author

sirhcel commented Jun 14, 2023

Agreed.

Great, than we are gaining the same understanding of baud rate limiting.

Nevertheless, I suggest that crate supplied example applications avoid utilizing exceptionally high baud rates which are prone to fail with typical hardware (especially on Windows). It creates confusion with users trying the crate for the first time if one of the examples fails 'out of the box'. They'll likely submit an unnecessary PR, or else avoid the crate altogether

We should go slow and sort things out first. I bear with you, that we should not ship defaults/examples which commonly fail on a certain platform. I hooked up the logic analyzer last night and can report, that on my Windows machine, the transmission with 2 Mbaud actually happened with 2 Mbaud and went perfectly fine. And if I'm reading it correctly, this works with a 50:50 chance on your systems. I would call it early to label this baud rate failing on typical hardware - especially with issues in the queue like the read timeouts which reportedly are causing more trouble.

What I would love to see in the long term would be a set of test applications for quantifying the properties/quality of a serial transmission channel. Just to type while thinking: With ramping up speeds, testing a serious amount of transmission data, checking flow control, ...

@sirhcel
Copy link
Contributor Author

sirhcel commented Jun 14, 2023

I will continue here for having this discussion in one place, but #106 looks like the better one as there are several points which are not directly related with this PR. With this PR I'm aiming at getting hardware_check.rs to work as-is again while leaving improvements to further ones.

Maybe creating a work-in-progress PR for your changes @jerrywrice might help the discussion.

  1. 'Hardware_check' uses a default flow control port setting of 'SOFTWARE' in function
    'fn set_defaults(port: &mut dyn serialport::SerialPort)'. This leads to potential issues when binary data is transmitted, so I
    changed my version to use 'NONE' as the default flow control.

This sounds reasonable. Basic communication testing should not involve flow control at all and enabling software flow control lays a potential trap for young players.

  1. I also found that when 'Hardware_check' is performed with a connected second com port (Loopback_Port mode), the
    flow-control sub-test initial write/recv_exact sequence in function 'test_dual_ports()' panics/errors due to a data mismatch
    on its 'recv_exact()' invocation. [...]

Agreed and also done in this PR. I also included not converting the received data to UTF-8 to avoid panicking without printing the data actually received in case of bad UTF-8.

  1. 'Hardware_check' currently transfers only the fixed text buffer "Test Message". This data pattern contains only printable
    ASCII (Utf-8) u8 values (12 bytes in total) and induces minimal stress on the serial sub-system. [...]

Agreed. I would leave this to your or further PRs. Once we are opening this field, we should also think about how to test communication between different platforms as this might point out faulty flow control.

  1. The current 'hardware_check' application performs serial data transfers by first executing a 'write' followed inline from
    the same thread with a corresponding port 'recv_exact'. [...]

This is surely a limitation of hardware_check.rs But at this time we are talking about a fixed and known buffer which gets transmitted. For me this also falls into the category multi process/multi platform testing which I would love to tackle in the future but we should leave out for a first step to get going again at all.

  1. When a serial port's flow-control is set to 'SOFTWARE', as in the latter stages of the current 'hardware_check' example,
    if a 0x13 u8 byte (XOFF) is received on the serial port, the receiving serial driver internally treats this as a special control
    code and temporarily disables/suspends any ongoing transmission of serial data. [...]

Looks to me like we are getting into the realm of advanced multi process/multi platform_ testing again.

I'm thinking about this since a while and having a remotely controllable communication pair could help out here. Maybe pyserial's/a RFC 2271 proxy/server/... could perform this task and we don't have to write our own.

  1. Similar to item 7, when a serial port's flow-control is set to 'HARDWARE', I'm unaware if 'hardware_check' is designed to
    or can reasonable attempt to verify it operates as expected??? I suppose the fact that (large) transfers are successful might be
    good enough?

hardware_check.rs does not necessarily needs to be the place. We could and maybe should split up tests among different test cases/test drivers. Hardware flow control could be tested for example with a second remote controlled comunication pair and the test checks the amount of data it still receives after requesting a pause, the continuation after requesting to resume, ... But again, this is something for our shiny future.

@jerrywrice
Copy link

Per your feedback, new comments related to additional example testing will be on #106

@sirhcel
Copy link
Contributor Author

sirhcel commented Aug 6, 2023

There was a lot of general discussion going on here recently. The update to hardware_check.rs is still relevant. Any objections against merging it @eldruin?

Copy link
Contributor

@eldruin eldruin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is great to improve on the tests, thank you!

@eldruin eldruin merged commit 22f854d into serialport:main Aug 7, 2023
@sirhcel sirhcel deleted the bringing-hardware-check-back-to-service branch September 3, 2023 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants