-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jobs getting enqueued but not processed #373
Comments
There's not a lot to go on here. I need more data.
|
|
The only unusual thing I see there is a panic recovery (maybe your NewRelic layer tries to recover?):
Otherwise I don't see any type of smoking gun. You are using Faktory client 1.3.0 but I don't see any bug fixes that would definitely help. I would recommend upgrading both Faktory and the client version if possible just so you are on the latest code and we can rule out any old, fixed bugs. |
@mperham this has happened twice in the last week. Output from the /debug page: Redis info:
Disk Usage:
|
You’re on a version, 1.4.1, that is a bit old. Can you upgrade to see if
that fixes the issue?
…On Wed, Apr 13, 2022 at 09:05 Brian Ploetz ***@***.***> wrote:
@mperham <https://github.com/mperham> this has happened twice in the last
week. Output from the /debug page:
Debugging:
[image: Screen Shot 2022-04-13 at 12 04 31 PM]
<https://user-images.githubusercontent.com/92764/163222819-13280ed5-de19-4034-b957-e51822675d1d.png>
Redis info:
# Server
redis_version:5.0.7
redis_git_sha1:bed89672
redis_git_dirty:0
redis_build_id:575dc7b6da705497
redis_mode:standalone
os:Linux 4.14.158-129.185.amzn2.x86_64 x86_64
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:9.2.0
process_id:10
run_id:840f9e5c395f841819460a39d1c9c8f9cf38c7f8
tcp_port:0
uptime_in_seconds:22380553
uptime_in_days:259
hz:10
configured_hz:10
lru_clock:5698488
executable:/usr/bin/redis-server
config_file:/tmp/redis.conf
# Clients
connected_clients:3
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:2
# Memory
used_memory:789480
used_memory_human:770.98K
used_memory_rss:3846144
used_memory_rss_human:3.67M
used_memory_peak:3098629
used_memory_peak_human:2.96M
used_memory_peak_perc:25.48%
used_memory_overhead:699998
used_memory_startup:550324
used_memory_dataset:89482
used_memory_dataset_perc:37.42%
allocator_allocated:755718
allocator_active:3808256
allocator_resident:3808256
total_system_memory:8053796864
total_system_memory_human:7.50G
used_memory_lua:37888
used_memory_lua_human:37.00K
used_memory_scripts:0
used_memory_scripts_human:0B
number_of_cached_scripts:0
maxmemory:0
maxmemory_human:0B
maxmemory_policy:noeviction
allocator_frag_ratio:5.04
allocator_frag_bytes:3052538
allocator_rss_ratio:1.00
allocator_rss_bytes:0
rss_overhead_ratio:1.01
rss_overhead_bytes:37888
mem_fragmentation_ratio:5.09
mem_fragmentation_bytes:3090426
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:83538
mem_aof_buffer:0
mem_allocator:libc
active_defrag_running:0
lazyfree_pending_objects:0
# Persistence
loading:0
rdb_changes_since_last_save:2
rdb_bgsave_in_progress:0
rdb_last_save_time:1649865643
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:0
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:487424
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0
# Stats
total_connections_received:3015
total_commands_processed:231609196
instantaneous_ops_per_sec:1
total_net_input_bytes:31338644618
total_net_output_bytes:1584414740
instantaneous_input_kbps:0.17
instantaneous_output_kbps:0.01
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:0
expired_stale_perc:0.00
expired_time_cap_reached_count:0
evicted_keys:0
keyspace_hits:5367534
keyspace_misses:7711769
pubsub_channels:0
pubsub_patterns:0
latest_fork_usec:116
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0
# Replication
role:master
connected_slaves:0
master_replid:a5f09e94d96c9a74e05086116c8f27b5e64e4671
master_replid2:0000000000000000000000000000000000000000
master_repl_offset:0
second_repl_offset:-1
repl_backlog_active:0
repl_backlog_size:1048576
repl_backlog_first_byte_offset:0
repl_backlog_histlen:0
# CPU
used_cpu_sys:10506.235012
used_cpu_user:9990.253246
used_cpu_sys_children:75.834772
used_cpu_user_children:403.868009
# Cluster
cluster_enabled:0
# Keyspace
db0:keys=1243,expires=0,avg_ttl=0
Disk Usage:
> df -h
Filesystem Size Used Available Use% Mounted on
overlay 98.3G 4.8G 93.1G 5% /
tmpfs 64.0M 0 64.0M 0% /dev
tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup
/dev/nvme0n1p1 98.3G 4.8G 93.1G 5% /etc/resolv.conf
/dev/nvme0n1p1 98.3G 4.8G 93.1G 5% /etc/hostname
/dev/nvme0n1p1 98.3G 4.8G 93.1G 5% /etc/hosts
shm 64.0M 0 64.0M 0% /dev/shm
fs-006ac882.efs.us-east-1.amazonaws.com:/
8.0E 0 8.0E 0% /var/lib/faktory
tmpfs 3.8G 0 3.8G 0% /proc/acpi
tmpfs 64.0M 0 64.0M 0% /proc/kcore
tmpfs 64.0M 0 64.0M 0% /proc/keys
tmpfs 64.0M 0 64.0M 0% /proc/latency_stats
tmpfs 64.0M 0 64.0M 0% /proc/timer_list
tmpfs 64.0M 0 64.0M 0% /proc/sched_debug
tmpfs 3.8G 0 3.8G 0% /sys/firmware
—
Reply to this email directly, view it on GitHub
<#373 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAAWX2FGI6X7GBRP5QEU2TVE3WDDANCNFSM5BE6CN4Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@mperham not quickly, but we'll get that work planned. |
@mperham OK, just ran into this again after upgrading to the latest and greatest Faktory. Here's the relevant info: Thread dump:
Debug page Info:
|
Looks like a ton of goroutines are trying to enqueue a job and all stuck waiting for the response. I need to put a read timeout on the IO so that this issue doesn't result in the process deadlocked completely.
|
^^^^^ You can try compiling your worker against main and see if it works better. |
@mperham I don't know if it helps and/or is relevant, but the few times we've run into this, it always seems to get stuck when jobs are enqueuing other jobs (
|
The one thing I noticed with that code is your use of the pool. I would urge you to use I traced thru the codepath for Push on the server-side and I don't see anything obviously wrong or that would block infinitely unless Redis is not responding for some reason. It's not clear to me which push middleware, if any, are running, that could be another possible source of trouble. |
Duly noted, will try switching over to |
@mperham this issue is absolutely killing us. Above you suggested using
Which is the preferred way of having a job enqueue another job? If it's with Thanks. |
helper.With is the preferred way since you can get access to it easily within a job func. Maybe the faq needs to be updated. |
Use TTIN to get stacks from Faktory and worker processes. Im not sure what else to do. |
Docker container contribsys/faktory:1.4.1
faktory_worker_go v1.4.0
Are you using an old version?
Have you checked the changelogs to see if your issue has been fixed in a later version?
https://github.com/contribsys/faktory/blob/master/Changes.md
https://github.com/contribsys/faktory/blob/master/Pro-Changes.md
https://github.com/contribsys/faktory/blob/master/Ent-Changes.md
Yes, using an old version, nothing in changelog of note as far as I can tell.
We've been running Faktory in production with no issues for over a year. All of a sudden today, we got an alert that the number of enqueued jobs was high. We saw the Enqueued size keep growing but nothing was getting processed. I restarted the worker process, which processed through the backlog, then everything got stuck again.
I sent the TTIN signal as noted here contribsys/faktory_worker_go#33 but I can't make heads or tails of the thread dump.
We killed the Faktory server itself and that seems to have fixed the issue. Unfortunately I did not get a thread dump on the server itself, but it had been running since 2020-08-28 09:49:22 -0400.
Unfortunately when you kill the server, you lose all of the enqueued jobs. I copied all of the enqueued jobs and their arguments from the Faktory UI and re-enqueued them manually in the new Faktory server, but is there a better way to gracefully shut down Faktory and bring it back up with the same data? Ironically all of the jobs in the Dead queue survived into the new instance and the old metrics (number of jobs run, etc), so I assume that means it is using the mounted volume into the container correctly for persisting redis data?
The text was updated successfully, but these errors were encountered: