Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial implementation of runUntilPrompt goroutine #11

Closed
wants to merge 6 commits into from

Conversation

hellt
Copy link
Contributor

@hellt hellt commented Apr 15, 2021

@ogenstad this PR addresses the issue of having a custom crypto/ssh package

Wanted let you have a look at a draft version of a goroutine-based approach

@hellt hellt marked this pull request as ready for review April 15, 2021 13:13
@hellt
Copy link
Contributor Author

hellt commented Apr 15, 2021

@ogenstad I wonder what your tests will indicate to the proposed changes.
I have also changed the timeout test to make sure the code doesn't have a chance to finish reading from in-mem reader

@ogenstad
Copy link
Member

Thanks! I've done a few tests and it seems to work fine. As I understand it the started go routine should end when the readUntilPrompt method is returned regardless off if it's stuck or not, is that correct according to you?

Unfortunately I noticed that I'd forgotten to add the linting job to the CI, would you mind running it locally with mage and fix the linting issues? There's only two minor issues.

At the same time you can perhaps delete the section in the README that mentions the previous ssh fork?

@hellt
Copy link
Contributor Author

hellt commented Apr 15, 2021

Thanks! I've done a few tests and it seems to work fine. As I understand it the started go routine should end when the readUntilPrompt method is returned regardless off if it's stuck or not, is that correct according to you?

hm, it may be that the goroutine may actually hang...
Let's see if my way more savvy Go colleague knows the answer to the following q, since we discussed this piece of code.

@karimra, does it seem to you as well, that in this part of the code if the Read blocks then the goroutine will leak by hanging indefinitely.

The main readUntilPrompt will not suffer from it, and the execution will end because ctx.Done will kick in, but the goroutine will not be cancelled.

I am not sure that this is too bad, because it seems the goroutine will end once the transport SSH session will be closed, si likely mem leaking will not happen?

ADD1: I haven't read it yet, but maybe there is a workaround to that https://benjamincongdon.me/blog/2020/04/23/Cancelable-Reads-in-Go/

@hellt
Copy link
Contributor Author

hellt commented Apr 15, 2021

That article doesn't solve the problem it seems.

I wonder if changing io.Reader interface to io.ReadCloser is something that will actually solve it?

Essentially what we pass as a reader is the SSH Channel, which implements Close() https://github.com/golang/crypto/blob/4f45737414dc36c02b62c3adac99418bd1fa70db/ssh/session.go#L133

So I wonder if by changing the interface to ReadCloser we can just call Close() when the context signals us and that will close the channel and unblock the reader and goroutine will finish

@karimra
Copy link
Contributor

karimra commented Apr 16, 2021

@hellt, @ogenstad is right that goroutine will hang since the contextReader.Read() will be already on the default section of select.

I tried to come up with an alternative solution by modifying the contextReader.Read implementation and avoid select default :

type readResult struct {
	n   int
	err error
}

func (c *contextReader) read(p []byte) *readResult {
	n, err := c.r.Read(p)
	return &readResult{n, err}
}

func (c *contextReader) Read(p []byte) (int, error) {
	ctx, cancel := context.WithCancel(c.ctx) // this can also be a timeoutContext to create a local read timeout
	defer cancel()
	rrCh := make(chan *readResult)

	go func() {
		select {
		case rrCh <- c.read(p):
		case <-ctx.Done():
		}
	}()

	select {
	case <-ctx.Done():
		return 0, ctx.Err()
	case rr := <-rrCh:
		return rr.n, rr.err
	}
}

This creates a cancellable context local to the Read method, the started goroutine selects on sending a readResult or the timeout of the main context (so the local ctx as well) which should stop the goroutine.
I believe this goroutine always stops at the end of Read

@ogenstad
Copy link
Member

Thanks @karimra! I would assume that the block would occur anytime if the final io.Reader that's used is blocking. As you indicate @hellt, I'd also assume that any leaking go routines would end once the ssh session is terminated.

However, I did some tests where I ran some commands in a way that would block the reader since the expected prompt wouldn't be found.

og-ap-01#ping        <-- First command, blocks when the device stops sending output after 'Protocol [ip]'
Protocol [ip]:           <- Second command just send '' for [enter]
Target IP address: 8.8.8.8  <- Third command send '8.8.8.8'
Repeat count [5]:  <-- continue sending enter
Datagram size [100]:
Timeout in seconds [2]:
Extended commands [n]:
Sweep range of sizes [n]: <-- final command finally finds the prompt and returns the expected output
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 8.8.8.8, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/4 ms
og-ap-01#

At this stage the assumption would be that we have a number of started go routines that are in blocking state. However, if they remain in that blocking state I'd also assume that they would come to life again once there's more data to be read from the buffer. Once the buffer is read by some function that data isn't available any more so I think it's reasonable that multiple go routines trying to read from the same buffer would get in each others way, this doesn't seem to be happening.

I modified the anonymous go routine in the PR and changed it so that it only read one byte at a time and then slept for a while, i.e.

buffer := make([]byte, 5)
time.Sleep(1 * time.Millisecond)

At this stage when running additional commands I'd expect the go routines to get in each others way as they'd read a byte each causing the output to be garbled. This doesn't happen though. The output from the main program still looks good after sending that ping command followed by a few others.

unable to run command: time out waiting to find prompt

unable to run command: time out waiting to find prompt

unable to run command: time out waiting to find prompt

unable to run command: time out waiting to find prompt

unable to run command: time out waiting to find prompt

unable to run command: time out waiting to find prompt

unable to run command: time out waiting to find prompt

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 8.8.8.8, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/1/4 ms

# show arp
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  172.29.50.1             0   18e8.2942.2aa7  ARPA   BVI1
Internet  172.29.50.4             -   f4cf.e2d8.3a14  ARPA   BVI1

# show inventory
NAME: "AP2700", DESCR: "Cisco Aironet 2700 Series (IEEE 802.11n) Access Point"
PID: AIR-AP2702I-UXK9  , VID: V01, SN: FCW1907NEXN



# show arp
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  172.29.50.1             0   18e8.2942.2aa7  ARPA   BVI1
Internet  172.29.50.4             -   f4cf.e2d8.3a14  ARPA   BVI1

# show inventory
NAME: "AP2700", DESCR: "Cisco Aironet 2700 Series (IEEE 802.11n) Access Point"
PID: AIR-AP2702I-UXK9  , VID: V01, SN: FCW1907NEXN



# show arp
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  172.29.50.1             0   18e8.2942.2aa7  ARPA   BVI1
Internet  172.29.50.4             -   f4cf.e2d8.3a14  ARPA   BVI1

I couldn't say if that this means that the go routines are actually ended or if they get stuck in some other zombie state.

Still I'm inclined to go with this solution regardless. I'd originally incorrectly assumed that the replace statement in this projects go.mod file would get used automatically by others using this project. Since that's not the case it would be weird to use the package, coupled with the fact that I also don't like using a modified fork of the crypto/ssh package even if it's just to "fix" one line in the code. It would be nice if it was possible to just have a setting in crypto/ssh to change this behaviour, perhaps it will be an option in the future.

Under normal circumstances the Read call should return the expected prompt, or fail because the connection has ended. With this in mind even if there was a leak here I don't think it's something that would occur often for normal usage. The above example with ping could also have been avoided if the program used RunUntil instead of Run when we knew that the device prompt wouldn't have been found.

What are your thoughts?

@karimra
Copy link
Contributor

karimra commented Apr 16, 2021

The proper way to know what happening is to use a pprof server to see what is happening internally.

The not so proper way is to abuse prometheus, by instrumenting your program:

add this goroutine to your main:

	go func() {
		http.Handle("/metrics", promhttp.Handler())
		http.ListenAndServe(":8888", nil)
	}()

run your example and continuously get the number of running goroutines with: curl localhost:8888/metrics | grep ^go_goroutines.
The number of goroutines is a gauge, so it should decrease at some point.

@hellt
Copy link
Contributor Author

hellt commented Apr 16, 2021

@ogenstad I want to create another branch with @karimra proposal for comparison
Maybe what you can do also is to use runtime.NumGoroutine() func that will return the number of goroutines used. If we call this before readUntilPrompt and after, we will see if number is not matching and goroutines stale

@ogenstad
Copy link
Member

Thanks for the tips!

Here's a snippet that connects to Cisco's sandbox machines, as they are open to the internet with full access they aren't always accessible.

package main

import (
	"context"
	"fmt"
	"log"
	"time"
	"runtime"

	"github.com/networklore/netrasp/pkg/netrasp"
)

func run(d netrasp.Platform, command string, comment string) {
	fmt.Println("################################################")
	fmt.Println("Sending command:", command)
	if comment != "" {
		fmt.Println("Comment: ", comment)
	}
	fmt.Println("Number of go routines", runtime.NumGoroutine())
	ctx, cancelRun := context.WithTimeout(context.Background(), 3*time.Second)
	defer cancelRun()

	fmt.Println("Output:")
	output, err := d.Run(ctx, command)
	if err != nil {
		fmt.Printf("unable to run command: %v\n", err)
	}
	fmt.Println(output)
	fmt.Println("Number of go routines", runtime.NumGoroutine())
}

func main() {
	device, err := netrasp.New("ios-xe-mgmt.cisco.com",
		netrasp.WithUsernamePassword("developer", "C1sco12345"),
		netrasp.WithDriver("ios"),
		netrasp.WithInsecureIgnoreHostKey(),
		netrasp.WithSSHPort(8181),
	)
	if err != nil {
		log.Fatalf("unable to create client: %v", err)
	}
	fmt.Println("Number of go routines", runtime.NumGoroutine())

	ctx, cancelOpen := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancelOpen()
	err = device.Dial(ctx)
	if err != nil {
		fmt.Printf("unable to connect: %v\n", err)

		return
	}
	fmt.Println("################################################")
	fmt.Println("### Connected")
	run(device, "show arp", "hello")
	run(device, "ping", "Starting ping for the first blocking command")
	run(device, "", "First enter")
	run(device, "8.8.8.8", "Ping target")
	run(device, "", "Next")
	run(device, "", "Next")
	run(device, "", "Next")
	run(device, "", "Next")
	run(device, "", "Final command that will execute ping then return to the prompt")

	run(device, "show arp","")
	run(device, "show inventory","")
	run(device, "show arp","")
	run(device, "show inventory","")

	fmt.Println("################################################")
	fmt.Println("### Closing connection")
	_ = device.Close(context.Background())
	fmt.Println("Number of go routines", runtime.NumGoroutine())
}

The output when running the program:

Number of go routines 1
################################################
### Connected
################################################
Sending command: show arp
Comment:  hello
Number of go routines 9
Output:
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  10.10.20.48             -   0050.56bb.e14e  ARPA   GigabitEthernet1
Internet  10.10.20.253            1   0896.ad9e.444c  ARPA   GigabitEthernet1
Internet  10.10.20.254           22   0008.e3ff.fd90  ARPA   GigabitEthernet1

Number of go routines 9
################################################
Sending command: ping
Comment:  Starting ping for the first blocking command
Number of go routines 9
Output:
unable to run command: time out waiting to find prompt

Number of go routines 10
################################################
Sending command:
Comment:  First enter
Number of go routines 10
Output:
unable to run command: time out waiting to find prompt

Number of go routines 10
################################################
Sending command: 8.8.8.8
Comment:  Ping target
Number of go routines 10
Output:
unable to run command: time out waiting to find prompt

Number of go routines 10
################################################
Sending command:
Comment:  Next
Number of go routines 10
Output:
unable to run command: time out waiting to find prompt

Number of go routines 10
################################################
Sending command:
Comment:  Next
Number of go routines 10
Output:
unable to run command: time out waiting to find prompt

Number of go routines 10
################################################
Sending command:
Comment:  Next
Number of go routines 10
Output:
unable to run command: time out waiting to find prompt

Number of go routines 10
################################################
Sending command:
Comment:  Next
Number of go routines 10
Output:
unable to run command: time out waiting to find prompt

Number of go routines 10
################################################
Sending command:
Comment:  Final command that will execute ping then return to the prompt
Number of go routines 10
Output:
Sending 5, 100-byte ICMP Echos to 8.8.8.8, timeout is 2 seconds:
!!.!!
Success rate is 80 percent (4/5), round-trip min/avg/max = 26/26/27 ms

Number of go routines 9
################################################
Sending command: show arp
Number of go routines 9
Output:
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  10.10.20.48             -   0050.56bb.e14e  ARPA   GigabitEthernet1
Internet  10.10.20.253            1   0896.ad9e.444c  ARPA   GigabitEthernet1
Internet  10.10.20.254           23   0008.e3ff.fd90  ARPA   GigabitEthernet1

Number of go routines 9
################################################
Sending command: show inventory
Number of go routines 9
Output:
NAME: "Chassis", DESCR: "Cisco CSR1000V Chassis"
PID: CSR1000V          , VID: V00  , SN: 9YM0YVI0F9D

NAME: "module R0", DESCR: "Cisco CSR1000V Route Processor"
PID: CSR1000V          , VID: V00  , SN: JAB1303001C

NAME: "module F0", DESCR: "Cisco CSR1000V Embedded Services Processor"
PID: CSR1000V          , VID:      , SN:



Number of go routines 9
################################################
Sending command: show arp
Number of go routines 9
Output:
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  10.10.20.48             -   0050.56bb.e14e  ARPA   GigabitEthernet1
Internet  10.10.20.253            1   0896.ad9e.444c  ARPA   GigabitEthernet1
Internet  10.10.20.254           23   0008.e3ff.fd90  ARPA   GigabitEthernet1

Number of go routines 9
################################################
Sending command: show inventory
Number of go routines 9
Output:
NAME: "Chassis", DESCR: "Cisco CSR1000V Chassis"
PID: CSR1000V          , VID: V00  , SN: 9YM0YVI0F9D

NAME: "module R0", DESCR: "Cisco CSR1000V Route Processor"
PID: CSR1000V          , VID: V00  , SN: JAB1303001C

NAME: "module F0", DESCR: "Cisco CSR1000V Embedded Services Processor"
PID: CSR1000V          , VID:      , SN:



Number of go routines 9
################################################
### Closing connection
Number of go routines 9

When the Read call gets blocked after the first ping command we can see that the number of go routines go from 9 to 10, but it doesn't climb any more. And once the issue is resolved it goes back to 9. So worst case scenario it might be that there's one leaked go routine per device for some time. Would you agree?

@hellt
Copy link
Contributor Author

hellt commented Apr 16, 2021

@ogenstad yes, that seems correct as when I run this branch against a non-blocking reads, the num of goroutines stays at 9 consistently.

I am curious if you already run the same with #12 to see if it doesn't have a leaked goroutine

@ogenstad
Copy link
Member

Thanks! I checked with #12 and the number of go routines look exactly the same, so there's no difference. Perhaps the code in this PR is a bit easier to read though.

@hellt
Copy link
Contributor Author

hellt commented Apr 16, 2021

I have tested #12 against CSR1000V and indeed the blocking goroutine is there, since I suppose here it blocks the same way

@karimra
Copy link
Contributor

karimra commented Apr 16, 2021

For the Read in #12 not to block, u have to give a context.WithTimeout, with a shorter duration than the main context

@hellt
Copy link
Contributor Author

hellt commented Apr 16, 2021

similarly to what I did here

#12 (comment)

I added debug print statements to see what is happening. And this approach spawns less goroutines, with the same pattern of a single read going to block:

// readUntilPrompt reads until the specified prompt is found and returns the read data.
func readUntilPrompt(ctx context.Context, r io.Reader, prompt *regexp.Regexp) (string, error) {
	var output string
	bufCh := make(chan string)
	errCh := make(chan error)

	go func() {
		id := rand.Intn(300)
		dead, _ := ctx.Deadline()
		fmt.Printf("goroutine %v started, context expires in %v\n", id, time.Until(dead))
		i := 0
		for {
			i += 1
			select {
			case <-ctx.Done():
				fmt.Printf("goroutine %v ended on ctx timeout\n", id)
				return
			default:
				buffer := make([]byte, 10000)
				fmt.Printf("read %v is called %d time\n", id, i)
				bytes, err := r.Read(buffer)
				if err != nil {
					errCh <- fmt.Errorf("error reading output from device %w: %v", errRead, err)
				}
				fmt.Printf("read %v returned result\n", id)
				latestOutput := string(buffer[:bytes])

				output += latestOutput

				workingOutput := output
				workingOutput = strings.ReplaceAll(workingOutput, "\r\n", "\n")
				workingOutput = strings.ReplaceAll(workingOutput, "\r", "\n")
				lines := strings.Split(workingOutput, "\n")
				matches := prompt.FindStringSubmatch(lines[len(lines)-1])
				if len(matches) != 0 {
					bufCh <- output
					fmt.Printf("goroutine %v ended on matched prompt\n", id)
					return
				}
			}
		}
	}()
❯ go run main.go
Number of go routines 1
goroutine 281 started, context expires in 4.909480251s
read 281 is called 1 time
read 281 returned result
goroutine 281 ended on matched prompt
<SNIP>


################################################
### Connected
################################################
Sending command: show arp
Comment:  hello
Number of go routines 9
Output:
goroutine 47 started, context expires in 2.999942166s
read 47 is called 1 time
read 47 returned result
read 47 is called 2 time
read 47 returned result
read 47 is called 3 time
read 47 returned result
read 47 is called 4 time
read 47 returned result
read 47 is called 5 time
read 47 returned result
read 47 is called 6 time
read 47 returned result
read 47 is called 7 time
read 47 returned result
read 47 is called 8 time
read 47 returned result
read 47 is called 9 time
read 47 returned result
read 47 is called 10 time
read 47 returned result
goroutine 47 ended on matched prompt
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  10.0.0.2              221   5255.0a00.0002  ARPA   GigabitEthernet1
Internet  10.0.0.15               -   5254.0023.4700  ARPA   GigabitEthernet1




Number of go routines 9
################################################
Sending command: ping
Comment:  Starting ping for the first blocking command
Number of go routines 9
Output:
goroutine 59 started, context expires in 2.999954932s
read 59 is called 1 time
read 59 returned result
read 59 is called 2 time
read 59 returned result
read 59 is called 3 time
read 59 returned result
read 59 is called 4 time
read 59 returned result
read 59 is called 5 time
read 59 returned result
read 59 is called 6 time
read 59 returned result
read 59 is called 7 time
unable to run command: context timeout

Number of go routines 10
################################################
Sending command: 
Comment:  First enter
Number of go routines 10
Output:
goroutine 181 started, context expires in 2.999831854s
read 181 is called 1 time
read 59 returned result
goroutine 59 ended on ctx timeout
unable to run command: context timeout

Number of go routines 10
################################################
### Closing connection
Number of go routines 10

@hellt
Copy link
Contributor Author

hellt commented Apr 17, 2021

superseded by #12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants