Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broken test_ipc_max_dgram_size test needs to be reviewed #234

Open
jnpkrn opened this issue Nov 23, 2016 · 14 comments
Open

broken test_ipc_max_dgram_size test needs to be reviewed #234

jnpkrn opened this issue Nov 23, 2016 · 14 comments

Comments

@jnpkrn
Copy link
Contributor

jnpkrn commented Nov 23, 2016

https://travis-ci.org/ClusterLabs/libqb/jobs/178242766#L2722

../../tests/check_ipc.c:1498:F:ipc_max_dgram_size:test_ipc_max_dgram_size:0: Assertion 'init==try' failed: init==331264, try==425728

...triggered intermittently only with clang (3.4), upon unrelated change.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Nov 23, 2016

Build image provisioning date and time
Thu Feb  5 15:09:33 UTC 2015

Operating System Details

Distributor ID:	Ubuntu
Description:	Ubuntu 12.04.5 LTS
Release:	12.04
Codename:	precise

Linux Version
3.13.0-29-generic

Cookbooks Version
a68419e https://github.com/travis-ci/travis-cookbooks/tree/a68419e

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Nov 23, 2016

Verified this is indeed intermittent, the above link now points to a restarted run, which passed (note that the offset of the respective line is +3, unfortunately I haven't grabbed these 3 extraneous lines when it was possible, which might have shed more light into this, supposing they were related error messages).

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Nov 23, 2016

One of the possibilities that are hard to rule out is that parallel
matrix builds (e.g., multiple compilers) share the same /dev/shm
path (containers set up like that?) and it doesn't play very well
in some rare circumstances as similar pseudorandom paths are being
accessed...

@chrissie-c
Copy link
Contributor

Very odd. I'm not going to worry about it short-term, though it would be useful to know how the test systems are set up. Can we reproduce it with clang ourselves?

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Nov 24, 2016

No cycles to spend on trying to reproduce that though we are now aware about this inclination in Travis CI so we'll have at least some clues when/if this recidivate.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Nov 24, 2016

Some archeology:

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Nov 24, 2016

One more relevant hit: http://lists.corosync.org/pipermail/discuss/2013-May/002573.html

One quick thing to check is the location of your shared memory
I use travis ci for libqb and travis uses ubuntu vm's and I
know I had to do a workaround for the shared memory location
being moved from /dev/shm to /run/shm.

See: https://github.com/asalkeld/libqb/blob/master/.travis.yml

I'd suggest have a look at the output of:
mount | grep shm
tmpfs on /dev/shm type tmpfs (rw,nosuid,nodev,seclabel)

df -h | grep shm
tmpfs                    3.9G  2.9M  3.9G   1% /dev/shm

and see if you need to run that workaround. (libqb tries /dev/shm
first).

jnpkrn added a commit to jnpkrn/libqb that referenced this issue Nov 24, 2016
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Nov 24, 2016

Regarding the relevance to Python implied with the cookbook references
above, http://stackoverflow.com/a/30175343 seems to suggest it was to
solve some kind of issue with multiprocessing module in Python's
standard library.

chrissie-c added a commit that referenced this issue Nov 24, 2016
CI: make travis watch for the issue #234
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Nov 28, 2016

(see also #238)

chrissie-c added a commit that referenced this issue Nov 29, 2016
Continue with investigation of intermittent failures in Travis CI (#234)
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Nov 29, 2016

Diagnostic enhancement from #238 shed some more light here:

../../tests/check_ipc.c:1506:F:ipc_max_dgram_size:test_ipc_max_dgram_size:0: Assertion 'init==try' failed: init==0x50e00, try==0x67f00, i=28, errno=90

where errno of 90 means EMSGSIZE (Message too long).

One of the possibities is that some assumption that used to hold so far
(per the previous successful test runs) is actually unreliable in practice
and some factors of Travis environment just make it easier to prove it.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Dec 12, 2016

Another hit:

init==0x50e00, try==0x67f00, i=40, errno=90

From the diagnostics added so far, it seems that /dev/shm mounted
as tmpfs is quite small, just 64 MB, if it could be a culprit.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Dec 12, 2016

... PR #242 might help regarding this hypothesis.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Apr 6, 2017

Just got a report with occurrence of this issue on virtualized s390x:

ipc_max_dgram_size:test_ipc_max_dgram_size:0: Assertion 'init==try' failed: init==331264, try==331776

Mere 495M was allocated to /dev/shm.

@chrissie-c
Copy link
Contributor

It's testing socket buffers rather than SHM arenas so it might be a ulimit issue. Odd that it failed there though because that's comparing the reported maximum with the actual allocated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants