-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CTCE links fail using TSAF under VM #640
Comments
Hi Jeff, Interesting, I never tried VM's TSAF, but spent quite some effort before I got SSI to work with CTCE's. Am I correct to assume that even the newer z/VM's also still support TSAF? I would certainly like to make TSAF work as well, and actually already started investigating the logs you sent. The first pair, JS01 <-> JS04, already show that the TSAF startup does not work, which in the past for me was always a big stumbling block, as about every CTC link CCW program solves the startup problem differently. Solving this TSAF - CTCE problem will take some time, also because I'll be offline soon until the second half of March. So please bear with me. Cheers, Peter |
Thanks for looking at this, Peter. I'll write up a doc about how I have TSAF set up between two z/VMs, hopefully that will be enough to get you started. One thing to be aware of, the TSAF on SP5 is incompatible with the TSAF on the other versions, so don't compare those logs (JS02 and JS10) hoping to find a fix for ESA and z/VM. When I run SP5 as a guest of VM/ESA, the VCTC links come up, but the SP5 side immediately drops with a message:
I don't know how much their CTC drivers differ, but given that SP5 is the only 370 mode OS in the mix, it could be a lot. I'm also working to get a VM trace of the VCTC when TSAF comes up successfully with JS01 is running as a guest of JS04. I should have more on that soon, I just need to do some research first. I've done a VM trace before, but it's been a while and I didn't really know what I was doing then... Don't worry about the timing. I've got plenty to do until this gets going. Just make sure and let me know if there's anything I can do to help it along. Jeff |
Hey Peter, Here's version 0.1 of the document on running TSAF over a VCTCA. I have a section in there for running it over a CTCE, but I'll flesh that out when we get further into this process and I can get past the first step without it failing. FYI, this is rough and rambles a bit, but I'm hoping you'll be able to follow it and recreate what I've done. I figured it was more important to get it to you sooner and work on polishing it later. Jeff |
I'm researching the VM I/O TRACE options and looking at what's avaiable with an eye toward getting it as close to what Hercules provides so it'll be easier to compare and contrast. z/VM provides the ability to trace the following I/O "types": SIO, SIOF, SSCH, RIO, RSCH
I was guessing that the SSCH would be most comparable to what you already have, but maybe it would be SIO? What do you think? From what I've read so far, I believe I can trace multiple types. So, if you want SSCH and RSCH for example, that should be doable too. Or if you're like me, maybe you want to trace them all in the hopes you'll recognize something? Jeff |
After a bit more reading in the manuals, I found the "trace i/o ccw" command that looks like a good match for what we need. I created the attached trace and log files for TSAFVM on both the host and guest VMs. Just a reminder, JS16 is the host VM. VCTC 807 in JS16's TSAFVM couples to VCTC 816 on the JS07 (guest) VM. In the below attached zip, I have included the following files:
Jeff |
Interesting Jeff! The traces show how easy the TSAF startup CCW's apparently are using VCTC's: No OPeration (NOP), Set Extended Mode (SEM), REaD, Sense Adapter State (SAS) on one side only, and then just REaD's and WRiTes. So what's my CTCE code doing wrong? I'll try to set up TSAF between two z/VM's, as per usual to first reproduce the problem, so I can start debugging for real. But, as I said, I'm going to be offline for almost 2 weeks, for a little winter vacation over here. So please bear with me. Cheers, Peter |
Great, I'm glad it's useful. Let me know if I can gather any more information or if I can help you recreate the environment for testing. Don't worry about the downtime, it'll take me a day just to reorganize my To Do list and see which of the many things I have listed that I want to play with while you're out! Jeff |
Hi Jeff, This is just a short heads-up message, to confirm that I've been able to reproduce the problem between two z/VM 7.3 systems, and that effectively my debugging has started. So please bear with me. Cheers, Peter |
Thanks. If there's any way I can help, just let me know. Jeff |
Hi Jeff, I think I'm getting close to where the problem lies. Between my two z/VM 7.3 systems I got the TSAF CTCE connection working by adding the additional keyword FICON as a trailing option on the CTCE configuration statement. This might not work on older VM's, but I'd like to know if so. The main difference -- and I hope that's all that is needed -- is that a FICON CTC starts in the so called extended mode, whereas the other CTC emulations start in basic mode. If the FICON option fixes TSAF also on older VM's, then we could consider the problem fixed. If not, then I could add another option, say SEM (for Set Extended Mode, this is also the CTC CCW command with the same function). I'm quite interested how that FICON option would work for you. Cheers, Peter |
Hi Peter, I've got some quick results for you, I'll have a more detailed report with any-any tests later. VM/SP5 - As feared, no joy. Same error messages as before:
VM/ESA - moderate success: I had to do a
<start TSAF...>
z/VM 4.0 - z/VM 6.3 - almost total success! The CTCAs came online automatically at IPL time. When I started the links in TSAF there was an error message about setting 370 mode, but then the links connected and the collection formed:
Like I said, I'll do an any-any test tonight, just to make sure ESA can talk to z/VM and vice-versa, but I wanted to get you some quick results to show the progress you've made. To be honest, if there are no problems with those tests, I can live with the current results for VM/ESA and z/VM. VM/SP is more of an issue but less of a need, since I don't have any TSAF clients that can talk over the link yet. Thanks! |
Hi Peter, OK, this was more fun than I expected... I took this opportunity to get current on Hercules on my Pis, since I was running last October's code there until now. Once I did, I found that the VM SP images would connect with FICON as well. Once I discovered that, I re-upgraded my Windows box to the Hercules level for the Pis (listed below) and retested there but it still fails. If you'd like I can capture some CCW traces for this link on the Pi and then on Windows for comparison. Here are the test results for all my images on Pis: Notes: JS02-JS10 (the VM/SP 5s) connected to each other and stayed up for over 2 hours, then initially failed during my "formal" testing. I had to shutdown the VMs, devinit and IPL to get them reconnected and even then I had to add and delete links a couple times to get them to connect. JS02 and JS10 connected to other peers and then dropped immediately due to TSAF version mismatches (marked "drop" above). I counted that a success, since the link had to connect for the software to detect the mismatch. JS07 connected to JS01 during my initial testing. Thereafter, it failed until I did a shutdown, devinit, IPL on Wednesday. JS10-JS16 On Tuesday, I was unable to get this to connect, even though JS02 (the other VM/SP) connects to JS16. On Wednesday, like JS07, it came right up after a shutdown, devinit, IPL. On Wednesday, after some additional testing, I was able to get all four non-SP nodes in the same collection. My configuration is a hub and spoke with JS01 as the hub and the other three nodes connected to it. Prior to this, I had all 4 nodes connected in various configurations, but the hub and spoke hadn't been successful yet, though now I think I had it right once or twice and just wasn't patient enough to allow it to resolve the application issues, thinking they were link problems. After having the hub and spoke up for awhile, "something" (apparently unrelated to the links) happened on the Pi running JS01 (the hub) and JS04 (a spoke). I had to reboot the Pi. As an experiment, I left the other two nodes running, untouched. When I brought JS01 back up, they reconnected immediately and everything took off. Things were a little unstable with TSAF for a few minutes, but those were application issues (authentication failures) rather than problems with the links. For now, I'm going to keep this scenario up and start passing some files back and forth over it, to see if that detects any issues. The two SP nodes were connected in their own collection (after some effort, but I got there). Instead of the "shutdown, devinit, IPL" sequence, I tried varying the device offline on both ends and devinit. It seemed to have worked in getting the two SP5 images connected. Unfortunately, the TSAF collection was not stable. It looked like the nodes could connect but then couldn't talk over the link. The link timed out and reconnected and tried to reform the collection, just to fail again. I shutdown both systems, including restarting Hercules. Once I re-IPLed I was able to reconnect them and everything was stable again. I'll keep working trying to get a client for this scenario, but for now, I don't have any additional testing I can do to cause a load on the link. I am getting a lot of Thanks, |
Hi Jeff, Thanks for your testing feedback! I think we may have things working with the following fix, also without using the FICON option on the CTCE configuration statement. The corrections are just these (the corrected --- ctcadpt_47.c 2024-03-16 17:10:50.632114881 +0100
+++ SDL-hyperion/ctcadpt.c 2024-03-28 10:22:50.986839075 +0100
@@ -1874,12 +1874,13 @@
// We merge a Unit Check in case the Y state is Not Ready.
// But only when pUnitStat is still 0 or Unit Check or Busy (no Attn).
- // SENSE bit 1 for Intervention Required will be set as well.
+ // sense byte 0 bit 1 (Intervention Required) will be set,
+ // and also bit 7 (Interface Discconect / Operation Check).
if( IS_CTCE_YNR( pDEVBLK -> ctceyState ) &&
( ( *pUnitStat & (~ ( CSW_BUSY | CSW_UC ) ) ) == 0 ) )
{
*pUnitStat |= CSW_UC;
- pDEVBLK->sense[0] = SENSE_IR;
+ pDEVBLK->sense[0] = ( SENSE_IR | SENSE_OC );
}
// Produce a CTCE Trace logging if requested, noting that for the
@@ -1956,7 +1957,9 @@
// SetSIDInfo( dev, 0x3088, 0x61, ... ); CISCO 7206 CLAW protocol ESCON connected
// SetSIDInfo( dev, 0x3088, 0x62, ... ); OSA/D device
// But the orignal CTCX_init had this :
- SetSIDInfo( dev, 0x3088, 0x08, 0x3088, 0x01 );
+// SetSIDInfo( dev, 0x3088, 0x08, 0x3088, 0x01 );
+// Which is what we used until we made the VM TSAF connection work as well which needed :
+ SetSIDInfo( dev, 0x3088, 0x08, 0x0000, 0x01 );
dev->numsense = 2;
// A version 4 only feature ...
@@ -3082,10 +3085,11 @@
CTCE_Info.state_y_prev = pDEVBLK->ctceyState;
// A system reset at power up initialisation must result in
- // sense byte 0 bit 4 set showing Intervention Required.
+ // sense byte 0 bit 1 (Intervention Required) to be set,
+ // and also bit 7 (Interface Discconect / Operation Check).
if( pDEVBLK->ctce_system_reset )
{
- pDEVBLK->sense[0] = SENSE_IR;
+ pDEVBLK->sense[0] = ( SENSE_IR | SENSE_OC );
}
// Reset the y-command register to 0, clear any WEOF state, The changes include the addition, in two places, of the setting of Sense byte 0 bit 7, the Operations Check. I had observed TSAF somehow complaining about the sense byte being x'40' whereas your VCTC trace always showed x'41'. So now also CTCE produces x'41' also, assuming that the CTC resets are so-called "selective resets". The more important change is the setting of the "Device Type" to x'0000' instead of x'3088' in the By the way, this TSAF experience is an extra for me, as I think it has the simplest and most straightforward CTC CCW program I've come across, as you can see from the Hercules logs I've included (EULERR_843.log and EULERS_812.log ). These show only the On the z/VM OPERATOR console on one of the two systems I tested with (nodename EULER73S, link address at it's side is 0812; the 0800 link addresses are from earlier tests), I observe a delay of more than a minute prior to it successfully connecting to the other side :
This can also be seen in the above Hercules log I've tested numerous times, but I'll await your feedback on this before committing this Cheers, Peter |
Hi Peter, Here are my results from Thursday's initial testing. My initial testing was more general, setting up my desired solution and seeing how close we are to having it work. A few things came up that I documented below to give you something to look at while I concentrate on specific, end-to-end tests for all nodes and platforms. If there's anything else specific you'd like me to test, please let me know. I updated ctcadpt.c and rebuilt Hercules on all my platforms except Pi1 (it's not used in testing, but I kept it for comparison if needed later) and I removed FICON from all CTCE definitions. I'm seeing a lot fewer HHC05079I but I'm still seeing some, mostly WAITs. For examples, I've attached JS01 hercules log.txt and JS02 hercules log.txt. It appears JS01 (the "hub") logs messages when talking to other VM levels (SP and z/VM) but not when talking to another VM/ESA (JS04). You can tell by the device numbers in the log. In the SP systems, I'm using devices in the 7xx range, in the non-SP systems, in the 8xx range. The "nn" is the number of the node being connected to (i.e. 804 on JS01 connects to 801 on JS04). Just a reminder: GeneralI noticed that the SP systems only brought online CTCs at IPL that had something on the other end. I was able to vary them on though, without having to start anything else. JS02 - only 701 came online at IPL. It was the only other running system at the time. I did not have to do any "set rdevices" for the CTCs to come online to any system. WindowsI started each link on JS01 and then went to the other node and attempted to connect to it. JS01-JS02, I could not get them connected at all. SP-SP test JS02-JS10, no joy, tried various combinations with no connection. Notes I varied the CTC online to JS01 (the CTC came online automatically on JS04 after the devinit) and attached them to TSAFVM. When I added the links to TSAF, it came up after about a minute and JOINed the collection. Two hours later, all the non-SP system links were still up and in the collection. At that point, I shut down the Windows images to test on the Pis. When shutting down the systems in reverse sequence, I saw the following. I'm guessing the unit checks began when I shut down the remote Hercules. I didn't notice these until I got to JS01. So, the '40'X bit is still getting through somewhere, apparently related to the other end of the link going away.
As verification, a minute after I shutdown TSAF on JS16 on the Pi on Friday morning, I saw the following at JS01:
Nothing changed when I shut down z/VM, but about a minute after shutting down Hercules, I started receiving:
These messages repeat every minute. Deleting the link on JS01 stops them from recurring. I did not see these messages when shutting down the SP images. PIsI started each link on JS01 and then went to the other node and attempted to connect to it. JS01-JS02, after a couple minutes the link came up by itself and then dropped due to incompatable software levels SP-SP The JS02-JS10 link eventually came up after bouncing the CTCs several times and waiting over 5:00. At one point TSAFVM on JS02 abended trying to manage storage (i.e. in DMSFREE), but I just bumped the VM size up and restarted it. I figured that was an application error and probably not related to the CTCs. I'm sure we're one PUT tape away from that being fixed... :) On Friday, I stopped TSAF on JS10 and then waited to see how long it would take the other end to notice. After 10 minutes of nothing on JS02, I logged off the TSAFVM virtual machine. Then JS02 immediately noticed the link go down. Notes I left the PIs up overnight and when I checked them today, I found repetitions of the following on the consoles:
From looking at the Hercules log, some of these appear to correlate to Well, hopefully that will give you something to look at while I do my in-depth testing tonight. I'll send out a status message tomorrow with more updates. Thanks, |
Hi Jeff, Thanks for your feedback and all your tests! A few minutes ago I committed a291e7e to the development branch which will also avoid the TSAF messages concerning sense '40'X. As to the SP testing: did you also try without the ATTNDELAY 200 option on the CTCE configuration? Cheers, Peter |
Peter, Good news on the commit. I'll update my images soon, so my version doesn't show "-modified." I had not done any testing with ATTNDELAY removed. I took it off today and retested the failing Windows links. They all now work. JS01->JS02 failed the first time I tried it, but when I went back afterwards and tried to get a trace for you, it worked 3 times in a row. It has since also worked with trace off. Now, off to my PI testing... :) Thanks, |
Hi Peter, Looks like good news on the PIs as well. I removed ATTNDELAY from the SP images and retested everything with the modified CTCADAPT.C module and everything worked. I got some HHC05079I error messages, but nowhere near as many as before. Some of them were for links not currently in use, so they will go away when this goes into my "production" network. In almost 5 hours of Hercules uptime, the following HHC05079I message counts were logged:
After my any-any test, I linked the two SP systems in one collection and the other 4 systems in another. I also started VTAM on 4 of the images (JS01, JS02, JS04, JS16). If left all the images up for about 3 hours when I took time off for dinner. When I came back, they were all still connected, with no VTAM or TSAF error messages. Thanks, |
Peter,
Unfortunately, this did not hold overnight for the non-SP collection. The SP5-SP5 collection stayed up all night and was fine when I shut it down this morning, logging only 1 HHC05079I message all night. However, none of the non-SP links survived the night. By morning, all the links had failed and each system's collection had reverted to only the local host. So far, I'm not seeing an obvious correlation between any Hercules and OPERATOR log events. There are a lot of HHC05079I messages and some of them seem to be the obvious trigger to TSAF events, like this one.
Then there are other HHC05079Is that seemingly go unnoticed...
I.e. nothing logged for almost an hour after this HHC message. And other TSAF events with no obvious causes.
I.e. separated by time and system from the previous message. By the way, I have not seen any HHC WAIT messages with a Busy_Waits value under 3. Is is possible they're just not being reported but still impacting TSAF? Maybe that could be the explanation for some of the "untriggered" TSAF events. In the attached zip file (Logs0401.zip), I have included Hercules and Operator logs for each system in the collection. I have trimmed the files down to data from midnight to 13:45 on Monday, April 1. The raw logs included data from my any-any tests and the system shutdowns that just added confusion. I have also included:
If I could find a way to recreate one of these events, I could easily capture a trace for you to look at, but at this point, unless you have a better idea, all I can think of is to start the trace and let it run for possibly hours, until the event recurred. Then we can trim the files to maybe the most recent 1/2 hour of activity for comparison. Is there anything else you might want to see in the trace, like the start up sequence for each link? While you get a chance to review this and respond, I'll try a first trace attempt. I'll limit it to two systems, JS01 as the hub and JS16. I'll start that up now and see what I get in a few hours. Thanks, |
Hi Peter, Well, it's been several hours and I haven't seen an error more significant than this synchronization message, though it has happened multiple times. Obviously something happened to make TSAF resynchronize, but I don't know what. No link errors have happened and neither peer has dropped from the collection. I went ahead and grep'ed out 1/2 hour that includes the event (from 00:20:00 to 00:49:00) in the Hercules trace log. Hopefully, that will be enough to give you an idea of what was going on at the time. If you need more data, just ask. In the meantime, I'll leave this scenario running overnight and see if we catch anything more interesting. If we don't see something more impactful by tomorrow, I'll probably try a different "remote" node (e.g. switch JS16 out for JS07). I don't want to have to put all 4 systems back up and manage all that trace data if we don't have to. JS01 - Operator log
JS16 - Operator log
As for the trace logs, they're over 1.5M each, so I have zipped them up: Traces.zip Thanks, |
Hi Jeff, Thanks again for your testing and feedback Jeff ! The I am not sure if these Busy_Wait cases are related to TSAF link outages / synchronization issues you experienced. What I encountered in all of my years of CTCE bug hunting is that at least on two occasions I eventually found out I had intermittent problems with network interface cards (NIC). In both cases it was with a NIC on a motherboard, which I fixed by purchasing an additional new NIC card. The last time it happened was 2 months ago, and I replaced that guilty NIC with an USB-plug 2.5 Gbit/sec NIC device. In both cases inspecting the packet retransmission counters on the routers showed that the faulty NIC devices indeed had had problems. Certainly my first experience caused me a lot of time lost due do searching for a cause that did not apply at all. Frustrating these not-solid-broken but just intermittent NIC problems ! Networking people always suggest to check the packet error counters whenever there is trouble. Your TSAF query link status shows delays of 109 and 62 msecs, which seems a lot. Mine show around 15 msecs. Concerning CTCE tracing : the most efficient one for me is produced with Cheers, Peter |
Hi Peter, I'm ashamed to say, as a former network engineer, I had not considered the real physical hardware in my PD efforts until now. :) I was so tunnel-visioned on the CTCs presented to the OSes by Hercules, I hadn't even thought of the "real" network under them. That would help to explain why there were no errors between JS01 and JS04, since they both resided on one Pi and errors were occurring between JS01-JS07 and JS01-JS16, since JS07 and JS16 reside on a different Pi. So, today, I've got two paths going. In an effort to eliminate hardware as a contributor, I have updated my Windows box to the latest commit with the ctcadpt.c changes and I started all 4 non-SP images (1,4,7,16) there. I changed my configuration statements to use 127.0.0.1 as the IP address on these images, so this traffic should never even touch hardware. Once I got the images all up and the links started, I saw the link delay of 15 that you mentioned, as well. Debugging is on and I'll leave this running overnight to see if we catch anything. On my Pis, I have updated Hercules to the latest commit on all of them. I also rebooted them, to reset the error counters. Some of them had been up almost two months, so I can't say how recent the drops they recorded are. Part of the problem may be performance-related as well. Running multiple VM/ESA or z/VM images on a Pi 4 may tax it to the point where it can't quite keep up with the network and fails intermittently. I'll try running only one VM image on each Pi. That'll only give me three nodes in the collection, but every one of them will be traversing a physical network link. For now, I'll forego any real testing with SP. I'm working on a copy of VM/SP 6, to get TSAF working and test with that along with SP 5. TSAF on SP 5 generates a bunch of blank messages which may or may not indicate an issue. Maybe TSAF on SP 6 will be more informative. Thanks, |
Hi Peter, Here are the logs from the 4-node collection on Windows. These images were all running on the same host, so there should be no physical network issues to deal with. As you'll see from the logs, I started the first system (JS01, the hub) around 15:45 yesterday. By 16:00, all nodes were up and in the collection. JS01 logged a "Timeliness check failed" message at 18:57 that affected the JS01-JS16 link and then apparently casued a waterfall of JOINs and collection synchronization that lasted a few minutes, until 19:06.
I'm assuming that indicates a drop or delay in receiving traffic from the CTC, but I'll let you decode the debug information to find out. It looks like things were then stable until 07:59:54 this morning, when JS16 for some reason logged a message indicating JS07 had been dropped from the collection, even though the hub, JS01, had not logged a similar message. JS16 then logged some authentication checks and finally, a completion time expired message, at which point it reset and began attempting to rejoin the collection. It was never successful, even though it tried several times during the day. At 08:00, the other nodes began logging errors and eventually all the nodes ended up disconnected and in their own 1-node collections. At approximately 14:20, I logged in and noticed all the links were down. Before shutting down the nodes, I tried restarting TSAF on JS01 and when that didn't recover the links, I restarted a couple of them individually. I had to recycle TSAF on JS16 to get it reconnected. FYI, I looked up the ATSMRX520I message and confirmed it is non-impactful. It looks like its generated when any node joins or leaves the collection and then approximately every 2 hours (apparently based on "Sync period: 7200 seconds" from "q status" display) during stable operations.
Please take a look and see if you find anything anomalous. I'll start collecting the logs for the PI-based collection. Thanks! Thanks, |
Hi Peter, Here are the logs from the 3-node, PI-based collection. Again, JS01 is the hub and this time there are two remote nodes, JS07 and JS16. I started bringing up the collection at 17:10 and it was completely up and synchronized by 17:16. Link delays were mostly 31 with one 15, so greatly improved over what we've seen in the past on this platform. By 20:06 we had our first issue, with JS07 being deleted from the collection due to an authentication check failure. JS07 rejoined the collection and then 4 minutes later... We have a new error message today! ATSNRC601E Frame discarded. Hop-count limit reached.
I'll be interested to hear your evaluation of what might be happening with this. Typically, I would expect to see hop-count limit issues with a much larger network. This is 3 nodes, you can't get any smaller and still have message routing vs. just talking to your peer. This is certainly the problem of the day. We got many of these and the collection never really recovered until I logged on at about 16:30 today and restarted JS01 and the links. As for the packet drops on the interface, all the PIs experienced them. Interestingly, all 3 of the PIs dropped 90 packets. This implies to me that the problem may be with my Netgear switch instead of the individual PIs. In each case, soon after I reloaded the PIs yesterday, they all had either 5 (1 PI) or 6 (2 PIs) drops. Today, after shutting down Hercules on them, they had either 95 (the same PI that had 5) or 96 drops. Fortunately, TCP is a "lossless" protocol and retransmits the dropped packets, unfortunately, the time it takes to do that may cause issues for CTCE and/or TSAF. For example, interface eth0 on Pi3 (hosting JS01): Before -
After -
shows it experienced 90 drops. FYI, you'll see greatly increased TX bytes on JS16 and RX bytes on JS07. That was caused by me moving JS07 from PI4, which was originally hosting JS07 and JS16 to PI2 so it could run on it's own platform. Please take a look when you get a chance and if you have any questions, give me a shout. Thanks! Thanks, |
Forgive me for treading into unfamiliar territory, but just out of curiosity, are the two nodes in question (JS01 and JS16) on different physical machines? Even though I know absolutely nothing about TSAF, I'm wondering if guest TOD clock synchronization might be an issue here? Another thought (question for Peter): do your communications links disable Nagle? If not, might that be something else that could cause such problems? (i.e. a delay in timely transmission of short messages due to Nagle?) Just some thoughts. Please ignore if unwarranted. |
Hi Fish,
No, in this scenario, all the nodes in the collection were running on one system, my Windows box. Thanks, |
Just out of curiosity, the next time you run your tests, issue the Hercules Given that they're both running on the same physical system, I would expect that they both should be within a few microseconds of each other, but definitely within at least a millisecond of each other. Note however that this might prove to be difficult (if not impossible!) to prove however, given that you can't exactly issue both commands individually at the exact same time! I wish there was some way to do that, but I can't for the life of me think of a way to do it. (Batch file maybe??) Still, it might prove to be interesting?? (Or, ... it could be a complete waste of time too!) |
Hi Fish, It turns out that the best I could do manually with the clocks command was a couple seconds. However, TSAF does have a "query status" command that shows various information about the collection.
Unfortunately, there are no units indicated for the clock deviation numbers, so I don't know if those are microseconds, milliseconds, or God help us, seconds. :) The IBM manual for present-day z/VM is not very helpful. It's description for the Query Status command includes:
Thanks, |
Hi Fish, Concerning your Nagle question :
Yes, CTCE disables Nagle, and if the disabling Nagle wouldn't work, message All my TSAF testing was using z/VM 7.3 on two hosts, a regular i7-6700K 4-core 32GB PC with Ubuntu 22.04, and a 8GB RPI4 with RPI OS "bullseye" (the official RPI Debian derivative). The I suspect TSAF nowadays is just left as is, as I think z/VM's SSI kind of replaces TSAF. CTCE supports SSI quite reliably, but SSI uses a more sophisticated CTCE setup, using the recommended 2 CTC's per link, so that each direction (send / receive) get's it's own dedicated CTC. My VTAM setup between 4 z/VM 7.2 SSI systems uses the 4 CTC's per link, using VTAM MPC's, i.e. 2 send/receive pairs for additional redundancy when a connection problem would develop. Nevertheless, I think the current CTCE fixes make VM's TSAF facility work fine, but within the limitations of what might be temporary intermittent communication problems. CTCE using TCP sockets supplying the basic communication cannot really implement the original native channel-to-channel devices when it comes to error recovery on intermittent communication problems. That's a restriction we may have to live with. Cheers, Peter |
Some additional clock timing info which I use. Cheers, Peter |
Hi Peter, Some good news, some remaining questions...
It seems we're pretty similarly equipped. My Windows box is an i7-6700K 4-core 48GB CPU @ 4.00 GHz. Instead of Ubuntu, I'm running Windows 10 Pro. The 3 PIs I've been testing with are all 8 GB PI 4's running bullseye.
I have an a PI 2 that I use for network services. It gets time from the Internet and my other PIs get their time from it, so while it's not a stratum-1 source, it is at least common to all the PIs.
In my initial testing, I was getting much higher delay times and my main tests 2 days ago were alot better with most being 31 and one being 15. Over last night, I was able to retest everything (see below) and I got consistent 15s on all links.
I haven't looked into SSI, but it appears VM/ESA also supports it, since occasionally I'll mistype a "q collect" command for TSAF and get a CP response instead. If it supports file-sharing, it may be a good fit for my ESA and z VM requirements. Then I'll only have TSAF on the VM/SPs to deal with.
I would agree with that based on further PI testing I did last night, but I don't see how there would have been any network/communication errors on the collection that all ran on Windows. Last nightOvernight, I restarted both the test scenarios I reported on 4/3. To put it briefly, they both ran very well. A - the Windows scenario The Windows-based 4-node collection came up cleanly and stayed up with no issues. I didn't make any changes to the environment or Hercules configurations, but I got much better results than last time. Honestly, I just thought I was going to play with the "clocks" command and once I got the Windows systems up to test it, I figured, "what the heck, let's see what happens overnight." One HHC05079I message was logged:
but it had no visible effect on TSAF and everything is still up and clean. I brought up SFSPOOL1 (the shared file server) today and I'll try passing some files around to see if that causes any issues. B - the PI scenario I moved the PIs to a Cisco 3560 switch. The 3560 is only 100Mbps and it's very noisy, so this is not a long-term solution, but proved to be a good test. The PIs ran all night with no drops or network errors of any kind and TSAF was stable. There was an issue initially with authentication errors that prevented me from bringing JS07 (z/VM 4.4) up with JS01 (VM/ESA 2.4) as the hub, but when I moved it to JS16 (z/VM 6.3), everything came up and stayed up all night. (Maybe there are some TSAF fixes on z/VM that came in after VM/ESA?) Today, I dynamically moved JS07 back to JS01 (i.e. add link to JS01, remove link from JS16), effectively making JS01 the hub again and I'll see how that runs. So far, no problems. The link delay times are all 15. Initially, one of the links was 31, but after a short time that changed to 15 as well. As long as this stays stable, I'll bring up SFSPOOL1 here, too, and play with it to see what happens. In this scenario, more HHC05079I messages were logged, but none had any visible effect on TSAF. JS01 logged 12, with one at 00:10:46 being seemingly unrelated to any TSAF messages and the rest appearing to be the result of me starting the link to JS07 at 13:44 to make JS01 the new hub. JS16 logged 3 messages. One occurred during the start up of the link to JS07, another coincides with a "Synchronization is now NORMAL" message and the third doesn't match any Operator messages. Summary Note, in neither scenario was any logging or debugging turned on. I was able to capture Hercules and Operator logs, but not the extended information that would have been provided by the debug command. QuestionsI guess my remaining question for this testing is why was the Windows scenario successful this time and not last? Is it possible the debug command introduces delay or some other factor and that this throws off CTCE? I'm assuming you tested with debug on as well, so that seems unlikely or you'd have seen it too. Beyond that, I can't come up with anything. Also, since the VM/ESA and z/VM situations seem to be getting resolved, are there additional tests you'd like to see with the VM/SP systems? I've got TSAF available on VM/SP 6 now, so the blank messages that SP5 produced will hopefully be correlateable to something visible on SP6. Also, I'll soon have a shared file system and another SP6 image, so we could concentrate on testing with them, if there's something remaining to look into. Thanks, |
Hi, Windows PIs 370-mode SSI (ISFC) Per https://www.vm.ibm.com/ssi/
Based on this, I might try playing with ISFC on the VM/ESA and z/VM OSes next, since Peter reports good results using SSI on z/VM. If that works, I could then limit my TSAF efforts to VM/SP. Thanks, |
Hi, Here are summaries of my weekend testing.
The TSAF links on Windows did not even stay up for an hour before they started dropping and needed Operator intervention to reconnect. The TSAF VM even abended, requiring all the links to be restarted. After that, the links stayed up for approximately 10 hours before the collection fell apart. In addition, at one point overnight, there was an error on a VTAM CTC link, causing it to fail.
The PI-based tests were up longer, but still eventually failed. First, there were issues getting JS16 to join the collection on JS01 so I made JS16 the hub and had the others connect to it. It continued to run this way for over 12 hours before I added link 807 to JS01 and deleted it from JS16 to see if I could move the hub back to VM/ESA. The transition went smoothly and TSAF stayed up in this configuration for about 36 hours before the collection fell apart and all 3 nodes became isolated. No network drops occurred during any of this testing. Summary FYI, I have ordered a 3560cx switch to resolve the packet drops. It's a fanless 3560 that has gigabit ports, so it should be a great replacement for my Netgear switch. Having an intelligent switch as the core of my "production" network will be nice, too. Peter, thanks for all your help. I'm glad you were able to get an enhancement to the CTCE adapter driver out of all this, at least. If there is anything else you'd like to try, please let me know. If and when I try TSAF on the VM/SPs, if I run into any issues, I'll let you know. Thanks, |
Hi Jeff, Thanks ! I understand my hopes for a perfectly stable TSAF-over-CTCE are not there yet, but nevertheless think this issue outcome was a positive one in that at least TSAF connections are now possible. I will leave this issue open for a few more days, but unless there are serious objections to it, I'd propose to close this issue after that, Cheers, Peter |
Hi, FYI, after about 22 hours, the two VM/SP 5 systems are the only ones left in the collection. All the links have failed except for one link between the 2 SP5 nodes. I'm going to put my production network back up and just keep TSAF around for short-term use when needed. I'll close this issue now. Thanks! Thanks, |
Hi,
I've run into a couple issues with CTCEs running with TSAF (Transparent Services Access Facility) on VM. Could someone take a look at these for me and help me find a way to get this working? Thanks!
Please note:
When using VM OSes more advanced than VM/SP 5, TSAF fails to open CTCE links. The same links work when a virtual CTCA is used.
When the link fails to come up, the following messages are generated (where "801" is the link address):
OSes tested include:
I have included configurations as well as Hercules, TSAFVM and OPERATOR logs for 2 VM/ESA systems trying to establish a link in the attached zip file. The Hercules logs include CTC debugging information.
JS04 and JS01 are VM/ESA 2.4.0 systems, connected by a CTCE link. Address 801 on JS04 connects to address 804 on JS01.
Files in the zip file:
Additionally, when connecting two VM/SP 5 images, TSAF worked when I was first playing with it on a year old version of Hercules. However, when I updated to the current dev level code to submit this issue, that broke the VM/SP 5 links. When attempting to bring up the link, TSAF reported the following errors (where 710 is the link address), however this appears to be cosmetic, since the link came up on the old version of Hercules with these errors:
In the attached zip file, I have included Hercules configurations as well as Hercules, TSAFVM and OPERATOR logs for 2 VM/SP systems trying to establish a link, both in the old version of Hercules that worked and the new version that failed. The Hercules logs include CTCE debugging information.
JS02 and JS10 are VM/SP 5 systems, connected by a CTCE link. Address 710 on JS02 connects to address 702 on JS10.
Files in the zip file:
The text was updated successfully, but these errors were encountered: