Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/opt/fastcfs/fstore/下的serverd.pid会丢失 #41

Open
Monpati opened this issue Mar 12, 2024 · 9 comments
Open

/opt/fastcfs/fstore/下的serverd.pid会丢失 #41

Monpati opened this issue Mar 12, 2024 · 9 comments

Comments

@Monpati
Copy link

Monpati commented Mar 12, 2024

如题,三副本,突然01和02的faststore下线,查询日志发现是因为找不到/opt/fastcfs/fstore/serverd.pid这个文件,如果手动创建的话,再重启还是会失败,并且文件会被删除,请问这个应该如何解决?
6447ba820bdd134479a23f7f81d83667

@happyfish100
Copy link
Owner

systemd启动命令 和 直接用命令行启动两种方式不要混用。
另外,可以看一下faststore server日志报什么错。

@Monpati
Copy link
Author

Monpati commented Mar 12, 2024

systemd启动命令 和 直接用命令行启动两种方式不要混用。 另外,可以看一下faststore server日志报什么错。

[2024-03-12 17:08:29] ERROR - file: connection_pool.c, line: 227, connect to fstore server 192.168.3.51:21014 fail, errno: 111, error info: Connection refused
[2024-03-12 17:08:29] WARNING - file: cluster_relationship.c, line: 1304, round 1th select leader, alive server count: 2 < server count: 3, try again after 1 seconds.
[2024-03-12 17:08:29] INFO - file: cluster_relationship.c, line: 998, the leader server id: 3, ip 192.168.3.52:21014, retry count: 1, time used: 740 ms
[2024-03-12 17:08:30] INFO - file: cluster_relationship.c, line: 1318, abort election because the leader exists, leader id: 3, ip 192.168.3.52:21014, election time used: 1s
[2024-03-12 17:08:30] ERROR - file: replication/replication_processor.c, line: 200, 1th connect to replication peer: 2, 192.168.3.51:21015 fail, time used: 0s, errno: 111, error info: Connection refused
[2024-03-12 17:08:30] ERROR - file: replication/replication_processor.c, line: 200, 1th connect to replication peer: 2, 192.168.3.51:21015 fail, time used: 0s, errno: 111, error info: Connection refused
[2024-03-12 17:08:30] INFO - file: replication/replication_processor.c, line: 260, connect to replication peer id: 3, 192.168.3.52:21015 successfully
[2024-03-12 17:08:30] INFO - file: replication/replication_processor.c, line: 260, connect to replication peer id: 3, 192.168.3.52:21015 successfully
[2024-03-12 17:08:31] INFO - file: cluster_relationship.c, line: 2251, connect to leader id: 3, 192.168.3.52:21014 successfully
[2024-03-12 17:08:33] ERROR - file: recovery/binlog_fetch.c, line: 315, fstore server 192.168.3.52:21015 response message: data group id: 6, slave id: 1, the replica connection NOT established!
[2024-03-12 17:08:33] WARNING - file: recovery/binlog_fetch.c, line: 551, data group id: 6, waiting count: 0, result: 16, time used: 0 ms
[2024-03-12 17:08:33] ERROR - file: recovery/binlog_fetch.c, line: 315, fstore server 192.168.3.52:21015 response message: data group id: 3, slave id: 1, the replica connection NOT established!
[2024-03-12 17:08:33] WARNING - file: recovery/binlog_fetch.c, line: 551, data group id: 3, waiting count: 0, result: 16, time used: 1 ms
[2024-03-12 17:08:33] ERROR - file: recovery/binlog_fetch.c, line: 315, fstore server 192.168.3.52:21015 response message: data group id: 8, slave id: 1, the replica connection NOT established!
这个是fs_serverd.log这里的报错,我尝试了使用命令行启动,还是不行。
我确认firewalld和selinux都已经关闭了,也已经添加了密钥,fdir是正常的

@happyfish100
Copy link
Owner

从日志看,有一台 fstore server 没有启动。
你 ps 看下有 fs_serverd这个进程吗?

@Monpati
Copy link
Author

Monpati commented Mar 12, 2024

从日志看,有一台 fstore server 没有启动。 你 ps 看下有 fs_serverd这个进程吗?

三副本,其中03是正常的,剩下两台都是启动了之后过几十秒就会杀死自己,命令是同时执行的。会是存储的数据不同步的原因吗?我使用了/usr/bin/fdir_serverd --data-rebuild /data/storage1,/data/storage2 /etc/fastcfs/fdir/server.conf restart尝试恢复数据,但是提示没有这个参数了。
192.168.3.52是正常的节点,2.50和2.51是无法启动的节点
[2024-03-12 17:42:20] INFO - file: replication/replication_processor.c, line: 260, connect to replication peer id: 2, 192.168.3.51:21015 successfully after 3 retries
[2024-03-12 17:42:21] INFO - file: replica_handler.c, line: 929, replication peer id: 3, 192.168.3.52:21015 join in
[2024-03-12 17:42:21] INFO - file: replica_handler.c, line: 929, replication peer id: 3, 192.168.3.52:21015 join in
[2024-03-12 17:42:21] ERROR - file: dio/trunk_write_thread.c, line: 405, [fstore] open file "/data/storage2/0003/000007" fail, errno: 2, error info: No such file or directory
[2024-03-12 17:42:21] WARNING - file: /usr/include/sf/sf_func.h, line: 42, kill myself from caller {file: dio/trunk_write_thread.c, line: 722, func: batch_write}
[2024-03-12 17:42:21] CRIT - file: sf_service.c, line: 710, catch signal 3, program exiting...
[2024-03-12 17:42:22] WARNING - file: recovery/binlog_replay.c, line: 561, data group id: 132, is_online: 0, block {oid: 9007199257741014, offset: 0}, slice {offset: 32, length: 131040}, read bytes: 65504 != slice length, maybe delete later?
[2024-03-12 17:42:22] ERROR - file: dio/trunk_write_thread.c, line: 405, [fstore] open file "/data/storage1/0002/000005" fail, errno: 2, error info: No such file or directory
[2024-03-12 17:42:22] WARNING - file: /usr/include/sf/sf_func.h, line: 42, kill myself from caller {file: dio/trunk_write_thread.c, line: 722, func: batch_write}
[2024-03-12 17:42:23] INFO - file: fs_serverd.c, line: 483, program exit normally.

@happyfish100
Copy link
Owner

从日志看,有一台 fstore server 没有启动。
你 ps 看下有 fs_serverd这个进程吗?

@Monpati
Copy link
Author

Monpati commented Mar 13, 2024

从日志看,有一台 fstore server 没有启动。 你 ps 看下有 fs_serverd这个进程吗?

01和02ps查询不到fs_serverd这个进程,在启动了fs_serverd进程后,过几十秒后会自动杀死这个进程

@happyfish100
Copy link
Owner

你看下Linux 的系统日志,看下fs_serverd是如何被杀死的。
有如下三种可能:

  1. killed by systemd
  2. killed by Linux due to OOM
  3. fs_serverd coredump

可以在系统日志中搜索关键字 fs_serverd

@Monpati
Copy link
Author

Monpati commented Mar 13, 2024

3. fs_serverd coredump

e17fa9658d9b23eb2a251a8ea421b62e
这个是我之前在fs_serverd.log中找到的被删除原因,您知道为什么会这样吗?提到的dio/trunk_write_thread.c这个文件,我并没有找到

@happyfish100
Copy link
Owner

  1. fs_serverd coredump

e17fa9658d9b23eb2a251a8ea421b62e 这个是我之前在fs_serverd.log中找到的被删除原因,您知道为什么会这样吗?提到的dio/trunk_write_thread.c这个文件,我并没有找到

在 libdiskallocator 这个库中

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants