Skip to content
This repository has been archived by the owner on Sep 19, 2024. It is now read-only.

[RTC-230] Net condition simulation POC #280

Merged
merged 9 commits into from
Jun 29, 2023

Conversation

sgfn
Copy link
Member

@sgfn sgfn commented Jun 14, 2023

POC: right now, the test doesn't check anything.

Current issues:

  • The container images are massive -- server: 500MB, browser: 2GB
  • Several hacks here and there -- lots of room for improvement
  • Needs a test scenario, perhaps a more sophisticated and realistic one than simply "50% chance to drop each packet" (other options: here)
  • Needs to be integrated with CircleCI

Resources for future reference:

@sgfn sgfn requested review from Rados13 and LVala June 14, 2023 13:24
@sgfn sgfn requested a review from mickel8 as a code owner June 14, 2023 13:24
@codecov
Copy link

codecov bot commented Jun 14, 2023

Codecov Report

Merging #280 (20d777f) into master (c21eeb9) will decrease coverage by 0.29%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #280      +/-   ##
==========================================
- Coverage   63.08%   62.79%   -0.29%     
==========================================
  Files          44       44              
  Lines        2129     2129              
==========================================
- Hits         1343     1337       -6     
- Misses        786      792       +6     

see 3 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0a9d786...20d777f. Read the comment docs.

x-browser-template: &browser-template
image: test_videoroom_browser_image
environment:
ERL_COOKIE: "panuozzo-pollo-e-pancetta"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄

integration_test/test_browser/docker-entrypoint.sh Outdated Show resolved Hide resolved
integration_test/docker-compose.yml Outdated Show resolved Hide resolved
@@ -0,0 +1,49 @@
#!/bin/bash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about managing packet loss, starting dockers etc. from Elixir? I don't like the idea of creating a file to trigger packet loss

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean create another TestController Elixir app instead of the shell script? I guess that could be done, but I don't think it'd solve the issue you're referring to.

Some means of synchronisation (i.e. creation of a file) is necessary because we don't know how long does the browser action :get_stats take to complete. Before applying packet loss, there are supposed to be 15 such actions in 1 second intervals -- however, they usually take around 20-30 seconds to complete. Therefore, the test controller can't know in advance when it's supposed to apply the network effect, as it MUST be done after the first bunch of stat collecting and before the second one.

There are limited ways to accomplish this -- having one of the containers create a file in the right moment seemed like the easiest one. I can change the controller code to sync in another way (e.g. keep polling the containers for certain logs, and enable the loss whenever it finds them; or, possibly, another solution) if you believe it to be a better option...

Copy link
Member

@LVala LVala Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that applying the packet loss is not possible from inside of the container? If so, the thing that comes to my mind is to create another elixir node, something like test runner, outside of the containers, but connected to the nodes in the containers via distributed erlang (like the containers between themselves), that's responsible for running the test and applying the packet loss when notified via simple message from the server node/container. Im not sure if that's doable with the docker-compose network etc. and might be an overkill, just something to consider.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, technically, we could run pumba netem in one of the containers (the one to apply loss on, or maybe even the server), but that would require us to expose the Docker daemon socket to the container itself, which to me seems like an anti-pattern at best, and a major security risk at worst...

I've thought about the approach with nodes, perhaps it really is the best one of the bunch. There'd need to be some fidgeting around with exposing epmd ports and such, but I've no doubt it's doable. I'll try to implement that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, we can move this to another PR.

I think, I would create a central point responsible for managing the whole test. This point would also be responsible for asking browsers for statistics 🤔

We can discuss this before implementing in more details

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me, we can tackle this problem later on.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What comes to my mind is to create release with mix release and use multi-stage Docker image like we do e.g. here. I'm not sure how would it play with the Playwright and browsers (or if it's possible at all), but it might make the image a little bit smaller and you could use the new jammy membrane_docker in the build stage and not bother with installing some of the dependencies by yourself. Similar story with the server Dockerfile.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I've tried doing that... obviously didn't succeed, seeing as I left the 'bloated' image Dockerfiles. Unfortunately, when installing Playwright, there are a lot of moving parts and a lot of places where things can break.

For now, I think I'd rather leave them as-is, and the container images will have to be made slimmer sometime in the future.

@@ -0,0 +1,49 @@
#!/bin/bash
Copy link
Member

@LVala LVala Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that applying the packet loss is not possible from inside of the container? If so, the thing that comes to my mind is to create another elixir node, something like test runner, outside of the containers, but connected to the nodes in the containers via distributed erlang (like the containers between themselves), that's responsible for running the test and applying the packet loss when notified via simple message from the server node/container. Im not sure if that's doable with the docker-compose network etc. and might be an overkill, just something to consider.

integration_test/test_videoroom/mix.exs Show resolved Hide resolved
@@ -0,0 +1,21 @@
# TestBrowser

**TODO: Add description**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably something should be added here 😅

@sgfn sgfn force-pushed the sgfn/RTC-230-simulate-net-condition branch from 41f9c8a to 5157189 Compare June 26, 2023 13:35
@sgfn sgfn requested review from mickel8, Rados13 and LVala June 26, 2023 13:36
Copy link
Contributor

@Rados13 Rados13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question, when I run the script locally it finished with logs like this:

FATA[0060] error running netem loss command: error running chaos command: failed to add packet loss for one or more containers: failed to stop netem container: failed to stop netem: failed to create exec configuration to check if command exists: Error response from daemon: Container f4f93f5eecf6bff6de163c295153f3390cdef1142ac7f38a593e89162c13567f is not running 
Network condition simulation over. Waiting for the docker-compose job to complete...

Doesn't it mean that we stop containers too early?
Besides that PR looks good.

@sgfn
Copy link
Member Author

sgfn commented Jun 27, 2023

@Rados13 Well, depends on how you look at it 😄... It's unfortunate, but we'd need another synchronisation point to know when to stop/kill the netem command. This is what the script comment was referring to:

# The netem command will return an error when a container is stopped before the packet loss duration
# is up. This means we either need to kill it (and know when to do that), or ignore the error

It's possible that this issue could be resolved with running the netem cmd as a job (along with other modifications), but I fear that it would make the script even more convoluted and confusing.
Alternatively, we could simply 2>/dev/null the command, but that could possibly obscure other issues, should they occur.

I sincerely hope a better solution can be found once we rewrite the test controller in Elixir...

Comment on lines 19 to 22
spawn_link(fn ->
Process.sleep(@max_test_duration)
raise("Test duration exceeded!")
end)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use Process.send_after

@sgfn sgfn merged commit b15437f into master Jun 29, 2023
@sgfn sgfn deleted the sgfn/RTC-230-simulate-net-condition branch June 29, 2023 13:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants