Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change int64_t to uint32_t used by flatbuffers for Ray event logging #1126

Closed
wants to merge 1 commit into from

Conversation

atumanov
Copy link
Contributor

@atumanov atumanov commented Oct 14, 2017

this PR addresses a local scheduler SIGSEGV on 32bit linux that occurs inside RayLogger_log_event due to the use of 64bit (long long) data types.

Program received signal SIGSEGV, Segmentation fault.
__strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50
50	../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S: No such file or directory.
(gdb) bt
#0  __strlen_sse2_bsf () at ../sysdeps/i386/i686/multiarch/strlen-sse2-bsf.S:50
#1  0x080a201e in redisvFormatCommand (target=0xbfeceb58, format=0x80ab841 "ZADD %b %s %b", ap=0xbfecebc8 "") at hiredis.c:262
#2  0x080a736c in redisvAsyncCommand (ac=0x90723f0, fn=0x0, privdata=0x0, format=0x80ab841 "ZADD %b %s %b", ap=0xbfecebc0 "xe\a\t\036") at async.c:654
#3  0x080a73ec in redisAsyncCommand (ac=0x90723f0, fn=0x0, privdata=0x0, format=0x80ab841 "ZADD %b %s %b") at async.c:669
#4  0x08078081 in RayLogger_log_event (db=0x9071e10, key=0x9076578 "event_log:\037\300\345\027v\r\300!>}\034\372\242\035}\376\320+^:", key_length=30,
    value=0x90761c8 "[[1507965381.056622, \"ray:get_task\", 1, {}], [1507965578.319059, \"ray:import_remote_function\", 1, {}], [1507965578.320148, \"ray:import_remote_function\", 2, {}], [1507965593.372685, \"ray:get_task\", 2, "..., value_length=938, timestamp=1507965593.3735039)
    at /home/atumanov/ray/src/common/logging.cc:100
#5  0x08062375 in process_message (loop=0x9068a38, client_sock=22, context=0x90738b8, events=1)
    at /home/atumanov/ray/src/local_scheduler/local_scheduler.cc:975
#6  0x0808770d in aeProcessEvents (eventLoop=0x9068a38, flags=3) at /home/atumanov/ray/src/common/thirdparty/ae/ae.c:412
#7  0x08087beb in aeMain (eventLoop=0x9068a38) at /home/atumanov/ray/src/common/thirdparty/ae/ae.c:455
#8  0x0806b408 in event_loop_run (loop=0x9068a38) at /home/atumanov/ray/src/common/event_loop.cc:58
#9  0x0806189e in start_server (node_ip_address=0xbfed1617 "127.0.0.1", socket_name=0xbfed15e0 "/tmp/scheduler17985046",
    redis_primary_addr=0xbfecf6ac "127.0.0.1", redis_primary_port=54774, plasma_store_socket_name=0xbfed15fa "/tmp/plasma_store75833613",
    plasma_manager_socket_name=0xbfed1629 "/tmp/plasma_manager4733243", plasma_manager_address=0xbfed1789 "127.0.0.1:25654", global_scheduler_exists=true,
    static_resource_conf=0xbfecf298,
    start_worker_command=0xbfed1647 "/home/atumanov/miniconda2/bin/python /home/atumanov/ray/python/ray/workers/default_worker.py --node-ip-address=127.0.0.1 --object-store-name=/tmp/plasma_store75833613 --object-store-manager-name=/tmp/"..., num_workers=2)
    at /home/atumanov/ray/src/local_scheduler/local_scheduler.cc:1298
#10 0x08057b09 in main (argc=19, argv=0xbfecf784) at /home/atumanov/ray/src/local_scheduler/local_scheduler.cc:1442

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2124/
Test PASSed.

@robertnishihara
Copy link
Collaborator

Interesting, if that's the fix, I would have expected the cast to size_t in #1122 to fix it.

@pcmoritz
Copy link
Contributor

There seem to be places inside of hiredis that might get called which mix int and size_t, maybe this causes some problem, maybe that's a problem?
https://github.com/redis/hiredis/blob/97d611e9b58ea5d26a203aafe5a586a00d1e380c/hiredis.c#L544

I agree that casting to size_t should have fixed it, any other ideas?

@atumanov
Copy link
Contributor Author

Just to make it clear, I was able to reproduce the SIGSEGV in the local scheduler and confirmed that this PR gets rid of the problem. The local scheduler on i686 no longer segfaults with this patch applied. It would be good to confirm that it also works on the ODROID platform.

@atumanov
Copy link
Contributor Author

the casting fix in #1122 is similar, but not the same. If you examine @arvindc95's seg_fault_add_casts.txt log carefully, you will notice ZADD %b %b %b, which causes the code to take a different path through redisvFormatCommand and segfaults on a different line. It is likely that the casting fix alone may have the same result as this PR.

@robertnishihara robertnishihara mentioned this pull request Nov 9, 2017
@robertnishihara
Copy link
Collaborator

Closing this for now since it has been a while. We'll likely still need to do something like this though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants