[RTC-230] Net condition simulation POC #280

sgfn · 2023-06-14T13:24:24Z

POC: right now, the test doesn't check anything.

Current issues:

The container images are massive -- server: 500MB, browser: 2GB
Several hacks here and there -- lots of room for improvement
Needs a test scenario, perhaps a more sophisticated and realistic one than simply "50% chance to drop each packet" (other options: here)
Needs to be integrated with CircleCI

Resources for future reference:

Running multiple netem commands
Playwright on Alpine
Playwright browsers manual installation -- note that this installs different versions of browsers than required by Stampede
Playwright deps (?), see also this issue
Playwright container Dockerfile -- again, different browser versions

codecov · 2023-06-14T13:25:48Z

Codecov Report

Merging #280 (20d777f) into master (c21eeb9) will decrease coverage by 0.29%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #280      +/-   ##
==========================================
- Coverage   63.08%   62.79%   -0.29%     
==========================================
  Files          44       44              
  Lines        2129     2129              
==========================================
- Hits         1343     1337       -6     
- Misses        786      792       +6

see 3 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a9d786...20d777f. Read the comment docs.

mickel8 · 2023-06-16T10:51:19Z

integration_test/docker-compose.yml

+x-browser-template: &browser-template
+  image: test_videoroom_browser_image
+  environment:
+    ERL_COOKIE: "panuozzo-pollo-e-pancetta"


integration_test/test_browser/lib/test_browser/application.ex

integration_test/test_browser/docker-entrypoint.sh

integration_test/docker-compose.yml

mickel8 · 2023-06-19T10:51:50Z

integration_test/test_packet_loss.sh

@@ -0,0 +1,49 @@
+#!/bin/bash


What about managing packet loss, starting dockers etc. from Elixir? I don't like the idea of creating a file to trigger packet loss

You mean create another TestController Elixir app instead of the shell script? I guess that could be done, but I don't think it'd solve the issue you're referring to.

Some means of synchronisation (i.e. creation of a file) is necessary because we don't know how long does the browser action :get_stats take to complete. Before applying packet loss, there are supposed to be 15 such actions in 1 second intervals -- however, they usually take around 20-30 seconds to complete. Therefore, the test controller can't know in advance when it's supposed to apply the network effect, as it MUST be done after the first bunch of stat collecting and before the second one.

There are limited ways to accomplish this -- having one of the containers create a file in the right moment seemed like the easiest one. I can change the controller code to sync in another way (e.g. keep polling the containers for certain logs, and enable the loss whenever it finds them; or, possibly, another solution) if you believe it to be a better option...

I guess that applying the packet loss is not possible from inside of the container? If so, the thing that comes to my mind is to create another elixir node, something like test runner, outside of the containers, but connected to the nodes in the containers via distributed erlang (like the containers between themselves), that's responsible for running the test and applying the packet loss when notified via simple message from the server node/container. Im not sure if that's doable with the docker-compose network etc. and might be an overkill, just something to consider.

Well, technically, we could run pumba netem in one of the containers (the one to apply loss on, or maybe even the server), but that would require us to expose the Docker daemon socket to the container itself, which to me seems like an anti-pattern at best, and a major security risk at worst...

I've thought about the approach with nodes, perhaps it really is the best one of the bunch. There'd need to be some fidgeting around with exposing epmd ports and such, but I've no doubt it's doable. I'll try to implement that.

First of all, we can move this to another PR.

I think, I would create a central point responsible for managing the whole test. This point would also be responsible for asking browsers for statistics 🤔

We can discuss this before implementing in more details

Fine by me, we can tackle this problem later on.

LVala · 2023-06-20T10:00:46Z

integration_test/test_browser/Dockerfile

What comes to my mind is to create release with mix release and use multi-stage Docker image like we do e.g. here. I'm not sure how would it play with the Playwright and browsers (or if it's possible at all), but it might make the image a little bit smaller and you could use the new jammy membrane_docker in the build stage and not bother with installing some of the dependencies by yourself. Similar story with the server Dockerfile.

Yeah, I've tried doing that... obviously didn't succeed, seeing as I left the 'bloated' image Dockerfiles. Unfortunately, when installing Playwright, there are a lot of moving parts and a lot of places where things can break.

For now, I think I'd rather leave them as-is, and the container images will have to be made slimmer sometime in the future.

LVala · 2023-06-20T11:26:20Z

integration_test/test_packet_loss.sh

@@ -0,0 +1,49 @@
+#!/bin/bash


I guess that applying the packet loss is not possible from inside of the container? If so, the thing that comes to my mind is to create another elixir node, something like test runner, outside of the containers, but connected to the nodes in the containers via distributed erlang (like the containers between themselves), that's responsible for running the test and applying the packet loss when notified via simple message from the server node/container. Im not sure if that's doable with the docker-compose network etc. and might be an overkill, just something to consider.

integration_test/test_videoroom/mix.exs

Rados13 · 2023-06-15T11:52:22Z

integration_test/test_browser/README.md

@@ -0,0 +1,21 @@
+# TestBrowser
+
+**TODO: Add description**


Probably something should be added here 😅

Rados13

I have a question, when I run the script locally it finished with logs like this:

FATA[0060] error running netem loss command: error running chaos command: failed to add packet loss for one or more containers: failed to stop netem container: failed to stop netem: failed to create exec configuration to check if command exists: Error response from daemon: Container f4f93f5eecf6bff6de163c295153f3390cdef1142ac7f38a593e89162c13567f is not running 
Network condition simulation over. Waiting for the docker-compose job to complete...

Doesn't it mean that we stop containers too early?
Besides that PR looks good.

sgfn · 2023-06-27T10:16:08Z

@Rados13 Well, depends on how you look at it 😄... It's unfortunate, but we'd need another synchronisation point to know when to stop/kill the netem command. This is what the script comment was referring to:

# The netem command will return an error when a container is stopped before the packet loss duration
# is up. This means we either need to kill it (and know when to do that), or ignore the error

It's possible that this issue could be resolved with running the netem cmd as a job (along with other modifications), but I fear that it would make the script even more convoluted and confusing.
Alternatively, we could simply 2>/dev/null the command, but that could possibly obscure other issues, should they occur.

I sincerely hope a better solution can be found once we rewrite the test controller in Elixir...

mickel8 · 2023-06-27T20:26:11Z

integration_test/test_videoroom/lib/test_videoroom/test_result_receiver.ex

+    spawn_link(fn ->
+      Process.sleep(@max_test_duration)
+      raise("Test duration exceeded!")
+    end)


I would use Process.send_after

sgfn requested review from Rados13 and LVala June 14, 2023 13:24

sgfn requested a review from mickel8 as a code owner June 14, 2023 13:24

mickel8 suggested changes Jun 19, 2023

View reviewed changes

LVala reviewed Jun 20, 2023

View reviewed changes

Rados13 reviewed Jun 23, 2023

View reviewed changes

sgfn added 5 commits June 26, 2023 13:22

[RTC-230] Net condition simulation POC

56b02f3

Fix integration tests, remove built images

1dbc0c5

Fix Dockerfile

b1dde9b

review fixes

7e68853

fix readme and stats.js

5157189

sgfn force-pushed the sgfn/RTC-230-simulate-net-condition branch from 41f9c8a to 5157189 Compare June 26, 2023 13:35

sgfn requested review from mickel8, Rados13 and LVala June 26, 2023 13:36

fixes and refactor

d102316

Rados13 approved these changes Jun 27, 2023

View reviewed changes

mickel8 approved these changes Jun 27, 2023

View reviewed changes

sgfn added 3 commits June 28, 2023 11:15

Use Process.send_after for test timeout

ab697be

Fix stats maybe not getting delivered

52b0326

fix compile warning

20d777f

sgfn merged commit b15437f into master Jun 29, 2023

sgfn deleted the sgfn/RTC-230-simulate-net-condition branch June 29, 2023 13:27

sgfn mentioned this pull request Jul 5, 2023

[RTC-230] Complete packet loss test #292

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RTC-230] Net condition simulation POC #280

[RTC-230] Net condition simulation POC #280

sgfn commented Jun 14, 2023 •

edited

Loading

codecov bot commented Jun 14, 2023 •

edited

Loading

mickel8 Jun 16, 2023

mickel8 Jun 19, 2023

sgfn Jun 20, 2023

LVala Jun 20, 2023 •

edited

Loading

sgfn Jun 20, 2023

mickel8 Jun 22, 2023

sgfn Jun 22, 2023

LVala Jun 20, 2023

sgfn Jun 20, 2023

LVala Jun 20, 2023 •

edited

Loading

Rados13 Jun 15, 2023

Rados13 left a comment •

edited

Loading

sgfn commented Jun 27, 2023 •

edited

Loading

mickel8 Jun 27, 2023

[RTC-230] Net condition simulation POC #280

[RTC-230] Net condition simulation POC #280

Conversation

sgfn commented Jun 14, 2023 • edited Loading

codecov bot commented Jun 14, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LVala Jun 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LVala Jun 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rados13 left a comment • edited Loading

Choose a reason for hiding this comment

sgfn commented Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

sgfn commented Jun 14, 2023 •

edited

Loading

codecov bot commented Jun 14, 2023 •

edited

Loading

LVala Jun 20, 2023 •

edited

Loading

LVala Jun 20, 2023 •

edited

Loading

Rados13 left a comment •

edited

Loading

sgfn commented Jun 27, 2023 •

edited

Loading