ContainerClient + akka http alternative to HttpUtils #3812

tysonnorris · 2018-06-26T23:50:23Z

Description

HttpUtils (http client for invoker -> action container) uses org.apache.http client that is synchronous and poor performing for concurrent requests. I ran into problems using it with concurrent activation support. Instead of trying to force that client to work, this PR is to work towards replacing it (or re-replacing it) with akka http based client.

This PR provides:

add a ContainerClient trait to define the http client interface
refactoring of HttpUtils to implement this trait (HttpUtils remains default for now)
add a PoolingContainerClient to provide a akka http based impl (based on PoolingRestClient)

Related issue and scope

I opened an issue to propose and discuss this change (#????)

My changes affect the following components

Types of changes

Bug fix (generally a non-breaking change which closes an issue).
Enhancement or new feature (adds new functionality).
Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

I signed an Apache CLA.
I reviewed the style guides and followed the recommendations (Travis CI will check :).
I added tests to cover my changes.
My changes require further changes to the documentation.
I updated the documentation where necessary.

Initial tests using this client to support this PR (for adding concurrency support in nodejs container) apache/openwhisk-runtime-nodejs#41 were good, where tests requiring coordinated completion of 128 concurrent requests succeeded, while same tests with existing HttpUtils/org.apache.http client failed with as few as 3 concurrent requests.

More tests will be done, but wanted to get early feedback on this in general (wip labelled).

codecov-io · 2018-06-29T00:29:25Z

Codecov Report

Merging #3812 into master will decrease coverage by 4.62%.
The diff coverage is 54.11%.

@@            Coverage Diff             @@
##           master    #3812      +/-   ##
==========================================
- Coverage   75.69%   71.07%   -4.63%     
==========================================
  Files         145      146       +1     
  Lines        6930     6983      +53     
  Branches      423      431       +8     
==========================================
- Hits         5246     4963     -283     
- Misses       1684     2020     +336

Impacted Files	Coverage Δ
...containerpool/kubernetes/KubernetesContainer.scala	`91.66% <ø> (ø)`	⬆️
...la/whisk/core/containerpool/ContainerFactory.scala	`100% <ø> (ø)`	⬆️
...la/src/main/scala/whisk/core/mesos/MesosTask.scala	`86.11% <ø> (ø)`	⬆️
...sk/core/containerpool/docker/DockerContainer.scala	`75.94% <0%> (-1.98%)`	⬇️
.../src/main/scala/whisk/http/PoolingRestClient.scala	`90% <100%> (+0.71%)`	⬆️
...n/scala/whisk/core/database/CouchDbRestStore.scala	`73.23% <100%> (ø)`	⬆️
...on/scala/src/main/scala/whisk/common/Logging.scala	`86.95% <100%> (+0.28%)`	⬆️
...ain/scala/whisk/core/containerpool/Container.scala	`74.24% <20%> (-1.95%)`	⬇️
.../containerpool/ApacheBlockingContainerClient.scala	`70.9% <30%> (ø)`
...whisk/core/containerpool/AkkaContainerClient.scala	`72.72% <72.72%> (ø)`
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9dd34f2...cb3a7da. Read the comment docs.

markusthoemmes

Great we're getting this back. It's a delicate change though, so let's be extra careful on reviews.

markusthoemmes · 2018-06-27T06:05:02Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+import whisk.http.PoolingRestClient
+
+trait ContainerClient {
+  def close(): Unit


Could we implement java.lang.AutoCloseable instead? Gives you the niceness of integrating into the try with resource world (although that's not needed here).

Should we remove the close method from the trait then and instead

trait ContainerClient extends AutoCloseable { ...

?

markusthoemmes · 2018-06-27T06:13:25Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+    // Timeout includes all retries.
+    as.scheduler.scheduleOnce(timeout) {
+      promise.tryFailure(new TimeoutException(s"Request to ${endpoint} could not be completed in time."))
+    }


As scala futures are not abortable, even though you're finishing the Promise here, the underlying HTTP request might still be in flight. Should we instead extend the PoolingRestClient to take timeout values as well? I believe you can configure the underlying connection pool to have connections timeout.

That'd be in line with what we have today wrt. timeout handling.

I've tried adding .completionTimeout() stage at the pool and the queue with no luck (was expecting a failure in that case, but don't get any success or failure...). I expected it to work at the pool. Any tips here?
EDIT: OK, I guess completionTimeout is stream level, and we need timeout per event...

So I tried to do this using https://github.com/paypal/squbs/blob/master/squbs-ext/src/main/scala/org/squbs/streams/Timeout.scala#L268

I'm not wild about:

dragging in squbs artifacts (seems heave handed for "just" adding a timeout)

the Try[HttpResponse] is wrapped as a Try[Try[HttpResponse]] this seems awkward.

ok idle-timeout is working, removed these changes and the promise.tryFailure

markusthoemmes · 2018-06-27T06:22:43Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+        }
+      }
+
+    tryOnce()


If the above gets implemented (timeout on the connections themselves rather than enforced by the promise), you can drop the Promise here (hard to reason about) and implement the retry like this:

private def retryingRequest(req: Future[HttpRequest], retry: Boolean): Future[HttpResponse] = { request(req).recoverWith { case _: akka.stream.StreamTcpException if retry => akka.pattern.after(retryInterval, as.scheduler)(retryingRequest(req, retry)) case t => Future.failed(t) } }

much nicer! Thanks!

markusthoemmes · 2018-07-15T11:23:17Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+    //map the HttpResponse to ContainerResponse
+    val r = promise.future
+      .flatMap({ response =>
+        val contentLength = response.entity.contentLengthOption.getOrElse(0l)


Is this a behavioral change? I think HttpUtils handles an unknown contentLength as NoResponseReceived.

Yes good catch; fixed

markusthoemmes · 2018-07-15T11:24:06Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+        if (contentLength <= maxResponse.toBytes) {
+          Unmarshal(response.entity.withSizeLimit(maxResponse.toBytes)).to[String].map { o =>
+            //handle 204 as NoResponseReceived for parity with HttpUtils client
+            if (response.status == StatusCodes.NoContent) {


Is this needed? HttpUtils doesn't have that extra clause. It does implement however the case of an absent Content-Length as noted above.

This is to satisfy a test case that I transferred from ContainerConnectionTests to PoolingContainerClientTests - see handle empty entity response.

Now this test is arguably wrong, compare to not truncate responses within limit (or one of them, at least), which returns a null and empty string as test responses (with a 200, not 204).

HttpUtils (or org.apache.http) seems to vary from akka http in its handling for this case, where the response.getEntity is null on HttpUtils, but only when there is a 204 (not when there is a null sent as the response entity...)

I agree this is weird, but wanted to keep the tests at parity between HttpUtils and PoolingContainerClient at least for now.

markusthoemmes · 2018-07-15T11:26:51Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+          //ignore the tail (MUST CONSUME ENTIRE ENTITY!)
+          tail.runWith(Sink.ignore)
+          //captured string MAY be larger than the max response, so take only maxResponse bytes to get the exact length
+          Future.successful(truncatedResponse.take(maxResponse.toBytes.toInt).utf8String)


Both ignore cases need to wait on the stream to be consumed, like:

tail.runWith(Sink.ignore).map(_ => truncatedResponse.take(maxResponse.toBytes.toInt).utf8String)

chetanmeh · 2018-07-17T05:15:43Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+      .recover {
+        case t: StreamTcpException => Left(Timeout(t))
+        case t: TimeoutException   => Left(Timeout(t))
+        case t: Throwable          => Left(ConnectionError(t))


May be use case NonFatal(t) => Left(ConnectionError(t)) to avoid handling of fatal errors

sounds good!

tysonnorris · 2018-07-17T05:43:20Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+            Future { Left(NoResponseReceived()) }
+      })
+      .recover {
+        case t: StreamTcpException => Left(Timeout(t))


BTW - this is also parity with HttpUtils, but seems wrong. If there are retries on StreamTcpException, and after retrying until reaching the timeout period is still failing, we should broadcast the same exception, e.g. as Left(ConnectionError(t)) right?

AFAIK this is not checked anywhere, so I would prefer to change it

chetanmeh · 2018-07-17T05:19:30Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+          logging.warn(this, s"POST failed with $t - no retry because timeout exceeded.")
+          Future.failed(t)
+        }
+      case t => Future.failed(t)


Not required

chetanmeh · 2018-07-17T05:20:48Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+          logging.warn(this, s"POST failed with $t - no retry because timeout exceeded.")
+          Future.failed(t)
+        }
+      case t => Future.failed(t)


May be we also track retryCount and include that in both failure case logging and also add a success case logging (if retryCount > 0) to get a sense of how many times retries are being performed

chetanmeh · 2018-07-17T05:31:11Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+        } else {
+          //ignore the tail (MUST CONSUME ENTIRE ENTITY!)
+          //captured string MAY be larger than the max response, so take only maxResponse bytes to get the exact length
+          tail.runWith(Sink.ignore).map(_ => truncatedResponse.take(maxResponse.toBytes.toInt).utf8String)


Would it be safe to convert byte stream truncated at arbitrary boundary to be converted to string?

HttpUtils also used same approach so behavior wise its compatible

It is as safe as truncation can be - client may get an error, but the ActivationResponse will end up with some info regarding the truncation.

chetanmeh · 2018-07-17T05:35:19Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+object PoolingContainerClient {
+
+  /** A helper method to post one single request to a connection. Used for container tests. */
+  def post(host: String, port: Int, endPoint: String, content: JsValue, timeout: Duration = 30.seconds)(


Is this method currently being used? If not then we can probably drop it or move it to some utility in test

No, but I will update it so that it is used in place of HttpUtils.post during tests when pooling-client == true

chetanmeh · 2018-07-17T05:35:59Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+    tid: TransactionId): (Int, Option[JsObject]) = {
+    val connection = new PoolingContainerClient(host, port, 90.seconds, 1.MB, 1)
+    val response = executeRequest(connection, endPoint, content)
+    connection.close()


Should connection be closed after await is done?

These post and concurrentPost functions maintain test compatibility with HttpUtils, they are used in synchronous fashion so should be closed after the await.

Need to move below the await then?

chetanmeh · 2018-07-17T05:44:42Z

common/scala/src/main/scala/whisk/core/database/CouchDbRestStore.scala

@@ -513,7 +512,7 @@ class CouchDbRestStore[DocumentAbstraction <: DocumentSerializer](dbProtocol: St
      .getOrElse(Future.successful(true)) // For CouchDB it is expected that the entire document is deleted.

  override def shutdown(): Unit = {
-    Await.ready(client.shutdown(), 1.minute)
+    client.shutdown()


Is this change required? client.shutdown() returns a Future so needs Await

chetanmeh · 2018-07-17T05:55:16Z

common/scala/src/main/scala/whisk/http/PoolingRestClient.scala

+      .withConnectionSettings(if (timeout.isDefined) {
+        ClientConnectionSettings(system.settings.config)
+          .withIdleTimeout(timeout.get)
+      } else { ClientConnectionSettings(system.settings.config) })


Alternative

private val timeoutSettings = { val ps = ConnectionPoolSettings(system.settings.config) timeout.map(t => ps.withUpdatedConnectionSettings(_.withIdleTimeout(t))).getOrElse(ps) }

chetanmeh · 2018-07-17T05:57:10Z

common/scala/src/main/scala/whisk/http/PoolingRestClient.scala

  }

-  // Additional queue in case all connections are busy. Should hardly ever be
-  // filled in practice but can be useful, e.g., in tests starting many
-  // asynchronous requests in a very short period of time.


Docs can be retained

chetanmeh · 2018-07-17T06:07:22Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+    }
+  }
+
+  private def truncated(responseBytes: Source[ByteString, _],


May be move it to object PoolingContainerClient and then have a test for this logic. Per current test coverage some flows are not covered

There is at least one test already that covers truncation: https://github.com/apache/incubator-openwhisk/pull/3812/files#diff-70e8c471d9056bda26d602a05f7ad091R180

Is codecov.io updated on each build? Can you tell why it wouldn't show in coverage? Theses tests PoolingContainerClientTests are using ContainerClient directly, no mocks etc, so I'm not sure why the coverage would not reflect?

I added tests, I think coverage is better, will look again after next run;
I also removed the case (Nil, tail) => it is not clear when this would ever come into play, or how to test for it working properly.

chetanmeh · 2018-07-17T06:13:02Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+    with ContainerClient
+    with AutoCloseable {
+
+  def close() = shutdown()


PoolingRestClient.shutdown returns a Future while ContainerClient.close is defined to return a Unit. Should we wait for the result completion or change contract for ContainerClient?

@markusthoemmes WDYT? This may affect use of AutoCloseable - for now I will return a Unit

markusthoemmes · 2018-07-18T07:07:58Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+          case _ =>
+            //handle missing Content-Length as NoResponseReceived
+            //also handle 204 as NoResponseReceived, for parity with HttpUtils client
+            Future { Left(NoResponseReceived()) }


Never use Future.apply if you already have the value of the Future at hand. It will schedule the value to the ExecutionContext unnecessarily.

Use Future.successful(Left(NoResponseReceived()) instead (note how it doesn't require an ExecutionContext)

markusthoemmes · 2018-07-18T07:08:12Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+        }
+      })
+      .recover {
+        case t: StreamTcpException => Left(Timeout(t))


Is a StreamTcpException always a timeout?

No, but this is parity with HttpUtils - when timeout after retries, the Timeout response is used. I agree this is wrong, but wasn't sure how to otherwise make it compatible. If this compatibility is not a problem, I would change it?

So I changed this so that on retry timeout, we don't rethrow a StreamTcpException, but rather a TimeoutException (with the message from StreamTcpException); this way we can project the Timeout consistency, but not imply that a mid-flight StreamTcpException is any indication of a timeout - WDYT?

markusthoemmes · 2018-07-18T07:11:46Z

common/scala/src/main/scala/whisk/core/containerpool/HttpUtils.scala

    val entity = new StringEntity(body.compactPrint, StandardCharsets.UTF_8)
    entity.setContentType("application/json")

    val request = new HttpPost(baseUri.setPath(endpoint).build)
    request.addHeader(HttpHeaders.ACCEPT, "application/json")
    request.setEntity(entity)

-    execute(request, timeout, maxConcurrent, retry)
+    Future { execute(request, timeout, maxConcurrent, retry) }


Should add blocking here as well. This is using sync IO.

Instead of Future.successful?

Yes, in this case you actually want another thread to take over, so you use Future(), but you also include blocking to denote that this is a blocking operation that might grab a Thread indefinitly. It gives the ExecutionContext the chance to adapt accordingly (create more threads)

markusthoemmes

A few nits on organization and documentation of the code mostly. Getting there, well done 🎉 . Will do a deeper pass shortly.

markusthoemmes · 2018-07-19T21:38:03Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+
+    //create the request
+    val req = Marshal(body).to[MessageEntity].map { b =>
+      //DO NOT reuse the connection (in case of paused containers)


In all cases actually, not just "in case of paused containers".

markusthoemmes · 2018-07-19T21:38:15Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+ * content type and the accept headers are both 'application/json.
+ * The reason we still use this class for the action container is a mysterious hang
+ * in the Akka http client where a future fails to properly timeout and we have not
+ * determined why that is.


Please update the ScalaDoc.

markusthoemmes · 2018-07-19T21:38:53Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+import whisk.http.PoolingRestClient
+
+trait ContainerClient {
+  def close(): Unit


Should we remove the close method from the trait then and instead

trait ContainerClient extends AutoCloseable { ...

?

markusthoemmes · 2018-07-19T21:40:00Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+        if (r._2 > 0) {
+          logging.info(this, s"completed after ${r._2} retries")
+        }
+        val response = r._1


You can unpack response and retries directly in the flapMap, like:

.flatMap { case (response, retries) =>

To avoid the tuple accessors.

markusthoemmes · 2018-07-19T21:41:26Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+            }
+          case _ =>
+            //handle missing Content-Length as NoResponseReceived
+            //also handle 204 as NoResponseReceived, for parity with HttpUtils client


I think we still need to drain the entity, like:

response.discardEntityBytes().future.map(_ => Left(NoResponseReceived())

since this case is also reached by an unknown Content-Length.

markusthoemmes · 2018-07-19T21:43:57Z

common/scala/src/main/scala/whisk/core/containerpool/Container.scala

@@ -166,16 +170,20 @@ trait Container {
    implicit transid: TransactionId): Future[RunResult] = {
    val started = Instant.now()
    val http = httpConnection.getOrElse {
-      val conn = new HttpUtils(s"${addr.host}:${addr.port}", timeout, 1.MB)
+      val conn = if (config.poolingClient) {


poolingClient seems an odd name since HttpUtils does pool as well. Should we rename to newContainerClient? (akka themselves did the same when they implemented the new ConnectionPool)

markusthoemmes · 2018-07-19T21:45:10Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+ * @param queueSize once all connections are used, how big of queue to allow for additional requests
+ * @param retryInterval duration between retries for TCP connection errors
+ */
+protected class PoolingContainerClient(


Should we name this AkkaContainerClient or NewContainerClient for more clarity? Can we also rename HttpUtils to something more meaningful now, like ApacheBlockingContainerClient?

markusthoemmes · 2018-07-19T21:47:12Z

common/scala/src/main/scala/whisk/core/containerpool/ContainerClient.scala

+import whisk.core.entity.size.SizeLong
+import whisk.http.PoolingRestClient
+
+trait ContainerClient {


The trait should be placed in its own file (or both ContainerClients should be placed in this one file, but I'd prefer one file for the trait, one for the akka based and one for the apache based)

markusthoemmes · 2018-07-19T21:48:14Z

common/scala/src/main/scala/whisk/http/PoolingRestClient.scala

+  private val timeoutSettings = {
+    val ps = ConnectionPoolSettings(system.settings.config)
+    timeout.map(t => ps.withUpdatedConnectionSettings(_.withIdleTimeout(t))).getOrElse(ps)
+  }


Thanks @chetanmeh !

markusthoemmes · 2018-07-19T21:48:31Z

core/invoker/src/main/resources/application.conf

@@ -29,6 +29,7 @@ whisk {
  container-pool {
    num-core: 4      # used for computing --cpushares, and max number of containers allowed
    core-share: 2    # used for computing --cpushares, and max number of containers allowed
+    pooling-client:  true # if true, use PoolingContainerClient for HTTP from invoker to action container (otherwise use HttpUtils)


Should we default to false for now? For safety?

tysonnorris · 2018-07-20T03:45:43Z

@chetanmeh this is close - let me know if you think I missed any of your comments? I think I got them all.

RE codecov - still some mysteries in the summary at top of PR conversation, but the report on codecov.io looks right to me. One anomaly in the PR conversation, I'm not sure where this is coming from?

...core/database/cosmosdb/RxObservableImplicits.scala | 0% <0%> (-100%)

markusthoemmes

Last round of comments from my side. I'm okay merging when these changes are made, since it's behind a toggle anyway.

Great job, thank you very much 🎉

markusthoemmes · 2018-07-20T08:38:15Z

common/scala/src/main/scala/whisk/core/containerpool/AkkaContainerClient.scala

+      .flatMap {
+        case (response, retries) => {
+          if (retries > 0) {
+            logging.info(this, s"completed after ${retries} retries")


Does it make sense to write a metrics here rather than printing this per request? Or maybe move the logline to debug and write a metrics additionally?

markusthoemmes · 2018-07-20T08:39:37Z

common/scala/src/main/scala/whisk/core/containerpool/AkkaContainerClient.scala

+  private def truncated(responseBytes: Source[ByteString, _],
+                        previouslyCaptured: ByteString = ByteString.empty): Future[String] = {
+    responseBytes.prefixAndTail(1).runWith(Sink.head).flatMap {
+      case (Seq(prefix), tail) =>


Does this case match all possible outcomes? Wasn't there a Nil case here before? I guess we won't reach the Nil case in runtime because we check earlier if contentLength < maxBytes. Might still make sense to include it for good measure?

I removed it since I was not able to establish a test that actually exercised that code path; can add it back for defense 👍

markusthoemmes · 2018-07-20T09:00:59Z

core/invoker/src/main/scala/whisk/core/containerpool/docker/DockerContainer.scala

+      val conn = new ApacheBlockingContainerClient(
+        s"${addr.host}:${addr.port}",
+        timeout,
+        ActivationEntityLimit.MAX_ACTIVATION_ENTITY_LIMIT)


Shouldn't this also check the feature toggle and use the correct client as configured?

Also noticed that ActivationEntityLimit.MAX_ACTIVATION_ENTITY_LIMIT is not used in Container.scala... These are both 1mb, but docs for MAX_ACTIVATION_LIMIT says This refers to the invoke-time parameters - but in this case we are limiting the response size (and I don't see any assertion of limit on the request entity size in former HttpUtils?).
WDYT?

markusthoemmes · 2018-07-20T09:04:29Z

tests/src/test/scala/whisk/core/containerpool/test/ContainerPoolTests.scala

@@ -296,7 +301,7 @@ class ContainerPoolTests
    val (containers, factory) = testContainers(2)
    val feed = TestProbe()

-    val pool = system.actorOf(ContainerPool.props(factory, ContainerPoolConfig(2, 2), feed.ref))
+    val pool = system.actorOf(ContainerPool.props(factory, ContainerPoolConfig(2, 2, false), feed.ref))


I think we should externalize building of the ContainerPoolConfig into a method so we don't need to adjust the values not needed for these tests continually. Check https://github.com/apache/incubator-openwhisk/pull/3767/files#diff-d00e1ef9ea3255332a28c35676361e29 for an impl. I did in another PR of exactly the same issue. I think it keeps the testcases clearer and reduces the diff in future PRs.

markusthoemmes · 2018-07-20T09:06:41Z

tests/src/test/scala/whisk/core/containerpool/kubernetes/test/KubernetesClientTests.scala

@@ -188,6 +189,7 @@ object KubernetesClientTests {
  implicit def strToInstant(str: String): Instant =
    strToDate(str).get

+  implicit val as = ActorSystem("kubernetes-client-tests-actor-system")


This actorSystem is leaked I think. Could you make the TestKubernetesClient take the actorSystem as an implicit parameter, so you can use the one imported (and closed) by the tests above?

Or maybe even move the class into the testclass. Makes it even easier?

I will give it a try, but it isn't clear why this test was setup this way? @dgrove-oss @jcrossley3 ?

The TestKubernetesClient is shared with KubernetesContainerTests - so for now, changed to implicit ActorSystem (and left object/classes in current places). Good?

I don't remember any particular reason; I think it may be as simple as that since we hadn't needed to have our hands on an ActorSystem before in the stubbed out test client it wasn't plumbed through.

markusthoemmes · 2018-07-20T09:38:25Z

...src/test/scala/whisk/core/containerpool/docker/test/ApacheBlockingContainerClientTests.scala

+      .asInstanceOf[Timeout]
+      .t
+      .asInstanceOf[RetryableConnectionError]
+      .t shouldBe a[HttpHostConnectException]


This could be rewritten a little less verbose like:

result match { case Left(Timeout(RetryableConnectionError(_: HttpHostConnectException))) => // all good case _ => fail(s"$result was not a Timeout(RetryableConnectionError(HttpHostConnectException)))") }

I was not able to reproduce this locally though, the test always failed with a ConnectError in both implementations. Is this intermittent?

nice - much better! (worked for me)

markusthoemmes · 2018-07-20T09:40:30Z

tests/src/test/scala/whisk/core/containerpool/docker/test/AkkaContainerClientTests.scala

+    //seems like this varies, but often is ~64k or ~128k
+    val limit = 300.KB
+    val connection = new AkkaContainerClient(httpHost.getHostName, httpHost.getPort, timeout.millis, limit, 100)
+    Seq(true, false).foreach { code =>


Could you rename code to success or similar? Threw me off quite a bit when reading through (also in other occurences please).

markusthoemmes · 2018-07-20T09:42:29Z

tests/src/test/scala/whisk/core/containerpool/docker/test/AkkaContainerClientTests.scala

+    val waited = end.toEpochMilli - start.toEpochMilli
+    result should be('left)
+    result.left.get shouldBe a[Timeout]
+    result.left.get.asInstanceOf[Timeout].t shouldBe a[TimeoutException]


Same as below, could be rewritten to:

result match { case Left(Timeout(_ : TimeoutException) => // good case _ => fail(...) }

IMO that pronounces the nesting of the exceptions more and is clearer in what you're actually testing. WDYT?

markusthoemmes · 2018-07-20T09:44:02Z

tests/src/test/scala/whisk/core/containerpool/docker/test/AkkaContainerClientTests.scala

+    val limit = 300.KB
+    val connection = new AkkaContainerClient(httpHost.getHostName, httpHost.getPort, timeout.millis, limit, 100)
+    Seq(true, false).foreach { code =>
+      Seq("0123456789" * 100000).foreach { r =>


Why even use Seq.foreach if you only pass in one value?

Please add a comment on what that value is supposed to be, i.e.

// Generate a response that's 1MB val response = "0123456789" * 1024 * 1024

To make the numbers less magic.

chetanmeh · 2018-07-20T12:16:19Z

One anomaly in the PR conversation, I'm not sure where this is coming from?

@tysonnorris Yeh that is a confusing part. This happens because on Master builds CosmosDB test run properly but for PR runs they do not run. Hence you would see codecov showing drop in coverage for each PR. Do not have a good solution for it. One way may be to skip coverage calculation for such code paths

tysonnorris · 2018-07-20T20:37:03Z

@chetanmeh @markusthoemmes I think I have addressed all comments

chetanmeh · 2018-07-21T14:52:51Z

common/scala/src/main/scala/whisk/core/database/CouchDbRestStore.scala

@@ -513,7 +514,7 @@ class CouchDbRestStore[DocumentAbstraction <: DocumentSerializer](dbProtocol: St
      .getOrElse(Future.successful(true)) // For CouchDB it is expected that the entire document is deleted.

  override def shutdown(): Unit = {
-    Await.ready(client.shutdown(), 1.minute)
+    Await.result(client.shutdown(), 30.seconds)


Do we need to change this for this PR?

chetanmeh

LGTM. This should now enable a single Invoker to handle lot more concurrent connections to containers!

tysonnorris · 2018-07-21T17:52:05Z

@markusthoemmes I added one more forgotten required change: enable ActionContainer based tests to explicitly use akka vs apache http client. While it is true that this could be accomplished by coercing test's akka config to include akka-client: true, I think that until akka is the default client, tests should not be forced to run with multiple configs, but we still need to be able to force tests to run via a specific client (e.g. concurrency tests will fail on apache client, but not the akka client...).
For example forcing use of akka client will be done in container tests by changing:
withContainer(nodejsContainerImageName, env)(code) to
withContainer(nodejsContainerImageName, env, true)(code)
WDYT?

rabbah · 2018-07-21T18:50:07Z

If the additional parameter doesn’t cause breaking downstream changes 👍 otherwise a separate method that tests can opt into.

tysonnorris · 2018-07-23T15:26:01Z

Yeah the client is changed by setting the additional param in downstream tests, and default value is false (use old client). It does mean that there needs to be 2 separate tests to run tests with both clients.

rabbah · 2018-07-23T15:28:46Z

@tysonnorris the runtime tests now inherit properly from a common parent (the basic action runner tests) and that might provide a way for you to hide testing both old and new clients without requiring changes upstream... i didn't look too closely but mentioning it as it may be relevant.

markusthoemmes · 2018-07-23T15:36:16Z

I'm a bit unsure about a "per-test" flag here. I'd be okay to make the new akka-client the default for all those tests straight away (after a decent amount of local test runs to squash out obvious heisenbugs). No need to gradually enable them one after the other. WDYT?

tysonnorris · 2018-07-23T15:46:12Z

@rabbah we would need to run the same test cases twice for each test, with a different config each run. I'm not sure how to simply enable this structure without rewriting the tests?

Separately, if that is possible, we still need a way to disable some clients from being used in some tests - the reason I arrived at creating the new client is the concurrency tests simply don't work with the old apache client (at least they don't work in travis).

tysonnorris · 2018-07-23T16:15:00Z

@markusthoemmes @rabbah I'm ok to switch all the tests to use akka http (and not apache http) after running them locally to verify, if that works? (we can either run all akka, or a mix of akka and apache, but NOT all apache)

rabbah · 2018-07-23T16:21:32Z

i'd say run them all with the new client -- if it can't handle the unit tests, we have a problem ;)

…ing (truncation etc)

…ient

…che client

markusthoemmes · 2018-07-26T21:22:13Z

PG2 3412 🔵

…n. (apache#3812) HttpUtils (http client for invoker -> action container) uses org.apache.http client that is synchronous and poor performing for concurrent requests. I ran into problems using it with concurrent activation support. Instead of trying to force that client to work, this is work towards replacing it (or re-replacing it) with akka http based client.

tysonnorris added invoker wip labels Jun 26, 2018

tysonnorris force-pushed the invoker-http-client branch from 29960b2 to 52fd033 Compare July 11, 2018 22:55

tysonnorris requested a review from markusthoemmes July 14, 2018 15:49

markusthoemmes requested changes Jul 15, 2018

View reviewed changes

tysonnorris force-pushed the invoker-http-client branch from af68ed1 to 5011dea Compare July 17, 2018 05:04

chetanmeh reviewed Jul 17, 2018

View reviewed changes

tysonnorris commented Jul 17, 2018

View reviewed changes

chetanmeh requested changes Jul 17, 2018

View reviewed changes

markusthoemmes reviewed Jul 18, 2018

View reviewed changes

markusthoemmes reviewed Jul 19, 2018

View reviewed changes

markusthoemmes requested changes Jul 20, 2018

View reviewed changes

tysonnorris removed the wip label Jul 20, 2018

chetanmeh reviewed Jul 21, 2018

View reviewed changes

chetanmeh approved these changes Jul 21, 2018

View reviewed changes

markusthoemmes self-assigned this Jul 23, 2018

tysonnorris added 25 commits July 26, 2018 12:07

ContainerClient + akka http alternative to HttpUtils

e16c3bb

HttpUtils is default client for now

c7a3813

fixing tests

b9642d9

added Connection: close header

92c526d

updates to PoolingRestClient; added ContainerClientTests

9d192f9

updates to PoolingRestClient for parity with HttpUtils response handl…

ebda873

…ing (truncation etc)

updates to PoolingRestClient for parity with HttpUtils response handl…

ea5aa79

…ing (truncation etc)

enable pooling client (akka http) by default

fe73553

use squbs Timeout flow

3de9811

revert use of squbs; use ClientConnectionSettings.idleTimeout

0601fba

add test for retries on HttpHostConnectException

d7af161

cleanup unused import

18387b1

cleanup

1c11f93

review feedback and cleanup

e6c840b

review feedback

8b44627

include retry count in logging

c94a929

review comments

52d0356

review comments

24bf541

added test for truncating large responses

f3f62d3

code review feedback; renaming HttpUtils -> ApacheBlockingContainerCl…

15ec749

…ient

code review feedback

0e47ed2

code review feedback - don't create a separate test ActorSystem

3166332

allow ActionContainer tests to run explicitly with akka client vs apa…

71d847c

…che client

use akka client for all ActionContainer tests

059b4a7

cleanup; properly pass timeouts to test functions

cb3a7da

tysonnorris force-pushed the invoker-http-client branch from 1df04f3 to cb3a7da Compare July 26, 2018 19:07

markusthoemmes merged commit 15bb04a into apache:master Jul 26, 2018

ddragosd mentioned this pull request Jan 18, 2019

recreate http client on resume() #4185

Merged

21 tasks

ContainerClient + akka http alternative to HttpUtils #3812

ContainerClient + akka http alternative to HttpUtils #3812

Conversation

tysonnorris commented Jun 26, 2018

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

codecov-io commented Jun 29, 2018 • edited Loading

Codecov Report

markusthoemmes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tysonnorris Jul 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markusthoemmes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tysonnorris commented Jul 20, 2018

markusthoemmes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-io commented Jun 29, 2018 •

edited

Loading

tysonnorris Jul 15, 2018 •

edited

Loading