Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRS crash caused the operating system to shut down other programs. #536

Closed
Ji-Yuhang opened this issue Nov 25, 2015 · 15 comments
Closed

SRS crash caused the operating system to shut down other programs. #536

Ji-Yuhang opened this issue Nov 25, 2015 · 15 comments
Assignees
Labels
Bug It might be a bug. TransByAI Translated by AI/GPT.
Milestone

Comments

@Ji-Yuhang
Copy link

Ji-Yuhang commented Nov 25, 2015

http://pan.baidu.com/s/1ntq3tzV
This is the SRS and dump files.
At a certain moment during testing of live streaming, (due to a mistake, I am trying hard to reproduce it), SRS crashed and caused CentOS to shut down other programs on the server (such as MySQL, PHP, etc.).

Below is a part of /var/log/messages.

Nov 25 13:55:01 centos kernel: traps: srs[9768] general protection ip:493b24 sp:7f71c25fac90 error:0 in srs[400000+28e000]
Nov 25 13:55:03 centos abrt-hook-ccpp: Saved core dump of pid 9768 (/usr/local/srs/objs/srs) to /var/spool/abrt/ccpp-2015-11-25-13:55:01-9768 (27168768 bytes)
Nov 25 13:55:03 centos systemd-logind: Removed session 6.
Nov 25 13:55:03 centos abrt-server: Executable '/usr/local/srs/objs/srs' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Nov 25 13:55:03 centos abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2015-11-25-13:55:01-9768' exited with 1
Nov 25 13:55:03 centos abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2015-11-25-13:55:01-9768'

Below is my configuration file. The key point is that I added the engine mp to save the live stream as an MP4 file.

# the config for srs demo
# @see https://github.com/simple-rtmp-server/srs/wiki/v1_CN_SampleDemo
# @see full.conf for detail config.


listen              1935;
max_connections     1000;
http_server {
    enabled         on;
    listen          8080;
    dir             ./objs/nginx/html;
}
vhost __defaultVhost__ {


   enabled         on;
   gop_cache       off;
   tcp_nodelay on;
   queue_length 10;
   min_latency on;
   dvr {
       enabled             on;
       dvr_path            ./objs/nginx/html/rtmp/[app]/[stream].flv;
       #dvr_plan            segment;
       #dvr_duration        1800;
       #dvr_wait_keyframe   on;
       dvr_plan          session;
       #time_jitter         full;
   }
   hls {
       enabled         on;
       hls_path        ./objs/nginx/html/hls;
       hls_m3u8_file   [app]/[stream].m3u8;
       hls_ts_file     [app]/[stream]-[seq].ts;
       hls_fragment    10;
       hls_window      60;
       hls_storage     both;
       hls_acodec      aac;
       hls_vcodec      h264;
       hls_cleanup     on;
       hls_dispose     0;
   }


   transcode{
       enabled         on;
       ffmpeg          ./objs/ffmpeg/bin/ffmpeg;
       engine ld {
           enabled         on;
           vfilter {
               i               ./doc/srs-logo.png;
               filter_complex      'overlay=10:10';
               v                   quiet;
           }
           vcodec          libx264;
           vbitrate        500;
           vfps            30;
           vwidth          768;
           vheight         320;
           vthreads        16;
           vprofile        high;
           vpreset         superfast;
           vpreset         medium;
           vparams {     
           }
           acodec          libfdk_aac;
           abitrate        70;
           asample_rate    44100;
           achannels       2;
           aparams {
           }
           output          rtmp://127.0.0.1:[port]/[app]?vhost=[vhost]/[stream]_[engine];
       }
       engine mp {
           enabled         on;
           iformat         flv;
           oformat         mp4;
           vfilter {
               i               ./doc/srs-logo.png;
               filter_complex      'overlay=10:10';
               v                   quiet;
           }
           vcodec          libx264;
           vbitrate        500;
           vfps            30;
           vwidth          768;
           vheight         320;
           vthreads        16;
           vprofile        high;
           vpreset         superfast;
           vpreset         medium;
           vparams {     
           }
           acodec          libfdk_aac;
           abitrate        70;
           asample_rate    44100;
           achannels       2;
           aparams {
           }
#output          rtmp://127.0.0.1:[port]/[app]?vhost=[vhost]/[stream]_[engine];
           output          ./objs/nginx/html/mp4/[app]_[stream].mp4;
       }


   }


}


TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Nov 25, 2015

I didn't understand where in the log it says that the SRS crash caused the closure of other programs.

TRANS_BY_GPT3

@Ji-Yuhang
Copy link
Author

Ji-Yuhang commented Nov 26, 2015

I also didn't understand this. It's just that at some point we suddenly discovered that all the services on the server were shut down. After observing /var/log/messages, we found that SRS crashed. I have encountered SRS crashes before, but not like this time, where other services on the system were also shut down at the same time. I apologize, "SRS crash caused the operating system to shut down other programs," this is only my speculation.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Nov 26, 2015

This is impossible, you can check its logs if other services are closed. Linux is not that weak.

TRANS_BY_GPT3

@winlinvip winlinvip added the Bug It might be a bug. label Nov 26, 2015
@winlinvip winlinvip added this to the srs 2.0 release milestone Nov 26, 2015
@winlinvip
Copy link
Member

winlinvip commented Nov 26, 2015

Can you paste the stack of the crashed core of srs? I can check this issue.

TRANS_BY_GPT3

@Ji-Yuhang
Copy link
Author

Ji-Yuhang commented Nov 30, 2015

I'm not sure how to check this. I only have this dump file, which has been placed in the Baidu Cloud Drive linked at the beginning. Do you need me to compile a debug version and try to reproduce this issue? By the way, is there a method to save a stream as an mp4 file?

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Nov 30, 2015

The compiled version is by default a debug version, and with the core file, you can know where the crash occurred.

TRANS_BY_GPT3

@haofz
Copy link
Contributor

haofz commented Dec 2, 2015

Can you please inform me how to set up SRS to generate a core dump file when it crashes? Will it not generate if it is started in non-daemon mode?

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Dec 2, 2015

Search for core dump on Baidu, it's a standard for Linux programs, and it's available after setting environment variables.

TRANS_BY_GPT3

@Ji-Yuhang
Copy link
Author

Ji-Yuhang commented Dec 11, 2015

Core was generated by `./objs/srs -c ./conf/srs.conf'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000493b1a in SrsEncoder::on_thread_stop (this=0x176dc10)
at src/app/srs_app_encoder.cpp:128
128 ffmpeg->stop();
Missing separate debuginfos, use: debuginfo-install glibc-2.17-78.el7.x86_64 libgcc-4.8.3-9.el7.x86_64 libstdc++-4.8.3-9.el7.x86_64
(gdb) where
#0 0x0000000000493b1a in SrsEncoder::on_thread_stop (this=0x176dc10)
at src/app/srs_app_encoder.cpp:128
#1 0x00000000004a3201 in SrsReusableThread::on_thread_stop (this=0x176dc50)
at src/app/srs_app_thread.cpp:465
#2 0x00000000004a2707 in internal::SrsThread::thread_cycle (this=0x176dc70)
at src/app/srs_app_thread.cpp:232
#3 0x00000000004a2769 in internal::SrsThread::thread_fun (arg=0x176dc70)
at src/app/srs_app_thread.cpp:244
#4 0x000000000051643e in _st_thread_main () at sched.c:327
#5 0x0000000000516bae in st_thread_create (start=0xffffffffffffffff,
arg=0xffffffffffffffff, joinable=0, stk_size=33) at sched.c:591


The above is the result after using 'gdb ./objs/srs core.23738'.

TRANS_BY_GPT3

@Ji-Yuhang
Copy link
Author

Ji-Yuhang commented Dec 11, 2015

After checking the source code.

void SrsFFMPEG::stop()
{
    if (!started) {
        return;
    }


    // kill the ffmpeg,
    // when rewind, upstream will stop publish(unpublish),
    // unpublish event will stop all ffmpeg encoders,
    // then publish will start all ffmpeg encoders.
    int ret = srs_kill_forced(pid);
    if (ret != ERROR_SUCCESS) {
        srs_warn("ignore kill the encoder failed, pid=%d. ret=%d", pid, ret);
        return;
    }


    // terminated, set started to false to stop the cycle.
    started = false;
}

Discovered corresponding log: pid = -1
[2015-12-11 16:22:52.583][warn][23738][198][10] ignore kill the encoder failed, pid=-1. ret=1058

Is the program crashing due to the pid?

TRANS_BY_GPT3

@Ji-Yuhang
Copy link
Author

Ji-Yuhang commented Dec 11, 2015

I added a log output at the location of ffmpeg->stop.

void SrsEncoder::on_thread_stop()
{
    // kill ffmpeg when finished and it alive
    std::vector<SrsFFMPEG*>::iterator it;


    for (it = ffmpegs.begin(); it != ffmpegs.end(); ++it) {
        SrsFFMPEG* ffmpeg = *it;
        srs_trace("ffmpeg->stop(), ffmpeg = %p, %d",ffmpeg,ffmpeg);
        ffmpeg->stop();
    }
}

Line 128 is ffmpeg->stop.

[2015-12-11 16:51:58.058][trace][24074][150] send SIGTERM to pid=24160
[2015-12-11 16:51:58.587][trace][24074][151] ffmpeg->stop(), ffmpeg = 0x11f81e0, 18842080
[2015-12-11 16:51:58.587][trace][24074][151] send SIGTERM to pid=24160
[2015-12-11 16:51:58.588][warn][24074][154][104] client disconnect peer. ret=1004
[2015-12-11 16:51:58.591][trace][24074][150] SIGTERM stop process pid=24160 ok.
[2015-12-11 16:51:58.591][trace][24074][150] cleanup when unpublish
[2015-12-11 16:51:58.591][trace][24074][150] control message(unpublish) accept, retry stream service.
[2015-12-11 16:51:58.591][warn][24074][150][32] client disconnect peer. ret=1004
[2015-12-11 16:51:58.597][warn][24074][151][10] ignore kill the encoder failed, pid=0. ret=1058
[2015-12-11 16:51:58.597][trace][24074][151] ffmpeg->stop(), ffmpeg = 0x11cf158, 18674008

The second line of the log to the last line is the log between the two ffmpeg->stops. It can be seen that the program crashed while executing the second ffmpeg->stop. After reviewing the code for srs_kill_forced, I do not have the ability to continue debugging and find the specific cause of the error.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Jan 11, 2017

Killing pid 0 or -1 is abnormal. The SRS function that kills processes will check the pid, and it will not kill if it is <=0.

int srs_kill_forced(int& pid)
{
    int ret = ERROR_SUCCESS;
    
    if (pid <= 0) {
        return ret;
    }

Your log says:


[2015-12-11 16:51:58.587][trace][24074][151] ffmpeg->stop(), ffmpeg = 0x11f81e0, 18842080
[2015-12-11 16:51:58.587][trace][24074][151] send SIGTERM to pid=24160
[2015-12-11 16:51:58.597][warn][24074][151][10] ignore kill the encoder failed, pid=0. ret=1058
[2015-12-11 16:51:58.597][trace][24074][151] ffmpeg->stop(), ffmpeg = 0x11cf158, 18674008

And errno=10 means:

#define ECHILD      10  /* No child processes */

The two stops are for different FFMPEG calls. However, ignore kill pid=0, this pid only exists in the child process, and in the parent process, it is not possible for pid to be 0 unless the object has already been released.

The execution path is as follows:

The thread encoder-150 is killed with pid=24160.
[2015-12-11 16:51:58.058][trace][24074][150] send SIGTERM to pid=24160
When encoder-150 is waiting for the process to exit, it switches to other threads.

The encoder-151 thread also kills the process with pid=24160.
[2015-12-11 16:51:58.587][trace][24074][151] ffmpeg->stop(), ffmpeg = 0x11f81e0, 18842080
[2015-12-11 16:51:58.587][trace][24074][151] send SIGTERM to pid=24160
While waiting for the process to exit, the encoder-151 thread switches to another thread.

[2015-12-11 16:51:58.588][warn][24074][154][104] client disconnect peer. ret=1004

The encoder-150 thread switches back and successfully kills process 24160.
[2015-12-11 16:51:58.591][trace][24074][150] SIGTERM stop process pid=24160 ok.
[2015-12-11 16:51:58.591][trace][24074][150] cleanup when unpublish
[2015-12-11 16:51:58.591][trace][24074][150] control message(unpublish) accept, retry stream service.
[2015-12-11 16:51:58.591][warn][24074][150][32] client disconnect peer. ret=1004

The encoder-151 thread switches back and finds that process 24160 has been killed, so it returns an error.
[2015-12-11 16:51:58.597][warn][24074][151][10] ignore kill the encoder failed, pid=0. ret=1058

The encoder-151 enters the stop again, which is not right. It is possible that the stack has been corrupted.
[2015-12-11 16:51:58.597][trace][24074][151] ffmpeg->stop(), ffmpeg = 0x11cf158, 18674008

One thing that is certain is that there is a problem: encoder-150 and 151 both have the same pid 24160, which doesn't make sense.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Jan 16, 2017

Only when source->unpublish() is called at the same time will it cause ffmpeg->stop to be called repeatedly. Another possibility is that reloading will also cause ffmpeg->stop, and if source->unpublish occurs at this time, it will cause this problem.

Do you have a reload? It seems that the possibility of reload causing this is the highest, and the previous source->unpublish race condition will not occur.

TRANS_BY_GPT3

@winlinvip
Copy link
Member

winlinvip commented Jan 16, 2017

Source competition condition, I have opened a new bug: #742.

TRANS_BY_GPT3

winlinvip added a commit that referenced this issue Jan 16, 2017
winlinvip added a commit that referenced this issue Jan 16, 2017
@winlinvip
Copy link
Member

winlinvip commented Jan 17, 2017

I have temporarily closed this bug as it might be caused by the 'fly fd'. If you can provide the steps to reproduce it, I will reopen it. Currently, I am unable to reproduce it on my end.

TRANS_BY_GPT3

@winlinvip winlinvip self-assigned this Sep 24, 2021
@winlinvip winlinvip changed the title SRS崩溃导致操作系统关闭了其他程序 SRS crash caused the operating system to shut down other programs. Jul 27, 2023
@winlinvip winlinvip added the TransByAI Translated by AI/GPT. label Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug It might be a bug. TransByAI Translated by AI/GPT.
Projects
None yet
Development

No branches or pull requests

3 participants