Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Libdatadog error messages repeat the error message multiple times #283

Open
ivoanjo opened this issue Dec 1, 2023 · 2 comments
Open

Comments

@ivoanjo
Copy link
Member

ivoanjo commented Dec 1, 2023

Describe the bug

Libdatadog error messages repeat the error message multiple times. For instance, I've started the Ruby profiler without a Datadog agent, and this is the libdatadog error message I see:

failed ddog_prof_Exporter_send: error trying to connect: tcp connect error: Connection refused (os error 111): tcp connect error: Connection refused (os error 111): Connection refused (os error 111)

Since these are user-visible error messages, it's not great to have them be somewhat confusing.

To Reproduce
Steps to reproduce the behavior:

  1. Try to report a profile with the exporter, and do not start the agent
  2. Observe the error message from the exporter

Expected behavior

An error message that's not repeated.

@mantrala
Copy link

mantrala commented Dec 7, 2023

I've been seeing this issue since I upgraded to Rails 7 couple weeks ago on heroku. I have the datadog agent and it still throws this errors multiple times in a day.

This is the complete error

/app/vendor/bundle/ruby/3.2.0/gems/ddtrace-1.11.1/lib/datadog/profiling/http_transport.rb:54:in `export') Failed to report profiling data: failed ddog_prof_Exporter_send: error trying to connect: tcp connect error: Connection refused (os error 111): tcp connect error: Connection refused (os error 111): Connection refused (os error 111)

@ivoanjo
Copy link
Member Author

ivoanjo commented Dec 8, 2023

Hey @mantrala, sorry that you're running into the issue! It looks like something may be misconfigured with the heroku integration, could you reach out to support via https://docs.datadoghq.com/help/ so we can investigate the issue further?

ivoanjo added a commit to DataDog/dd-trace-rb that referenced this issue Jan 2, 2025
**What does this PR do?**

This PR builds atop #4237 and fixes a similar-ish issue in the profiler
caused by the same mishandling of ipv6 addresses.

In particular, when provided with an ipv6 address in the agent url,
the profiler would fail with an exception:

```
$ env DD_AGENT_HOST=2001:db8:1::2 DD_PROFILING_ENABLED=true \
bundle exec ddprofrb exec ruby -e "sleep 2"

dd-trace-rb/lib/datadog/profiling/http_transport.rb:27:in `initialize':
Failed to initialize transport: invalid authority (ArgumentError)
```

**Motivation:**

Luckily we didn't have any customers using this, as it fails immediately
and loudly, but it's still a bug on a configuration that should be
supported.

**Additional Notes:**

Since we had similar buggy logic copy-pasted in crashtracking and
profiling (crashtracking had been fixed in #4237) I chose to extract
out the relevant logic into the `AgentSettings` class, so that
both can reuse it.

**How to test the change?**

I've added unit test coverage for this issue to profiling, and
the snippet above can be used to end-to-end test it's working fine.

Here's how it looks on my machine now:

```
E, [2025-01-02T17:32:32.398756 #359317] ERROR -- datadog: [datadog]
(dd-trace-rb/lib/datadog/profiling/http_transport.rb:68:in `export')
Failed to report profiling data (agent: http://[2001:db8:1::2]:8126/):
failed ddog_prof_Exporter_send: error trying to connect: tcp connect
error: Network is unreachable (os error 101): tcp connect error:
Network is unreachable (os error 101): Network is unreachable (os error 101)
```

E.g. we correctly try to connect to the dummy address, and fail :)

(Note: The error message is a bit ugly AND repeats itself a bit.
That's being tracked separately
in DataDog/libdatadog#283 )
quinna-h pushed a commit to DataDog/dd-trace-rb that referenced this issue Jan 8, 2025
**What does this PR do?**

This PR builds atop #4237 and fixes a similar-ish issue in the profiler
caused by the same mishandling of ipv6 addresses.

In particular, when provided with an ipv6 address in the agent url,
the profiler would fail with an exception:

```
$ env DD_AGENT_HOST=2001:db8:1::2 DD_PROFILING_ENABLED=true \
bundle exec ddprofrb exec ruby -e "sleep 2"

dd-trace-rb/lib/datadog/profiling/http_transport.rb:27:in `initialize':
Failed to initialize transport: invalid authority (ArgumentError)
```

**Motivation:**

Luckily we didn't have any customers using this, as it fails immediately
and loudly, but it's still a bug on a configuration that should be
supported.

**Additional Notes:**

Since we had similar buggy logic copy-pasted in crashtracking and
profiling (crashtracking had been fixed in #4237) I chose to extract
out the relevant logic into the `AgentSettings` class, so that
both can reuse it.

**How to test the change?**

I've added unit test coverage for this issue to profiling, and
the snippet above can be used to end-to-end test it's working fine.

Here's how it looks on my machine now:

```
E, [2025-01-02T17:32:32.398756 #359317] ERROR -- datadog: [datadog]
(dd-trace-rb/lib/datadog/profiling/http_transport.rb:68:in `export')
Failed to report profiling data (agent: http://[2001:db8:1::2]:8126/):
failed ddog_prof_Exporter_send: error trying to connect: tcp connect
error: Network is unreachable (os error 101): tcp connect error:
Network is unreachable (os error 101): Network is unreachable (os error 101)
```

E.g. we correctly try to connect to the dummy address, and fail :)

(Note: The error message is a bit ugly AND repeats itself a bit.
That's being tracked separately
in DataDog/libdatadog#283 )
marcotc pushed a commit to DataDog/dd-trace-rb that referenced this issue Jan 8, 2025
* add supported version script and table

* update script

* rubocop lint

* modify script locations, add description to md table

* improve table output

* wip

* refactor code

* wip

* add supported versions

* add branch for testing

* remove json to avoid merge conflict issues

* update PR body

* Update .github/scripts/generate_table_versions.rb

Co-authored-by: Steven Bouwkamp <[email protected]>

* switch to use gem declarations instead of hardcoded mappings

* linting checks

* cleanup comments

* refactor code

* cleanup code

* Combine duplicate option table rows

The documentation for instrumenting rake had two rows for the same option key. This consolidates those entries into a single row.

* Enable type checking for AgentSettingsResolver/AgentSettings

Steep doesn't seem to be a big fan of Structs so I just went ahead
and turned the `AgentSettings` into a regular class that's equivalent
to the struct we had before.

(In particular, I decided to still keep every field as optional).

Ideally this would be a `Data` class, but we're far from dropping
support for Rubies that don't have it.

* Move url building behavior from `AgentBaseUrl` to `AgentSettings`

This is preparation to also share this behavior with profiling.

* Refactor crashtracking to use `AgentSettings#url`

The behavior from the old `AgentBaseUrl` is now contained in
`AgentSettings` so we can clean up the extra logic.

* [PROF-11078] Fix profiling exception when agent url is an ipv6 address

**What does this PR do?**

This PR builds atop #4237 and fixes a similar-ish issue in the profiler
caused by the same mishandling of ipv6 addresses.

In particular, when provided with an ipv6 address in the agent url,
the profiler would fail with an exception:

```
$ env DD_AGENT_HOST=2001:db8:1::2 DD_PROFILING_ENABLED=true \
bundle exec ddprofrb exec ruby -e "sleep 2"

dd-trace-rb/lib/datadog/profiling/http_transport.rb:27:in `initialize':
Failed to initialize transport: invalid authority (ArgumentError)
```

**Motivation:**

Luckily we didn't have any customers using this, as it fails immediately
and loudly, but it's still a bug on a configuration that should be
supported.

**Additional Notes:**

Since we had similar buggy logic copy-pasted in crashtracking and
profiling (crashtracking had been fixed in #4237) I chose to extract
out the relevant logic into the `AgentSettings` class, so that
both can reuse it.

**How to test the change?**

I've added unit test coverage for this issue to profiling, and
the snippet above can be used to end-to-end test it's working fine.

Here's how it looks on my machine now:

```
E, [2025-01-02T17:32:32.398756 #359317] ERROR -- datadog: [datadog]
(dd-trace-rb/lib/datadog/profiling/http_transport.rb:68:in `export')
Failed to report profiling data (agent: http://[2001:db8:1::2]:8126/):
failed ddog_prof_Exporter_send: error trying to connect: tcp connect
error: Network is unreachable (os error 101): tcp connect error:
Network is unreachable (os error 101): Network is unreachable (os error 101)
```

E.g. we correctly try to connect to the dummy address, and fail :)

(Note: The error message is a bit ugly AND repeats itself a bit.
That's being tracked separately
in DataDog/libdatadog#283 )

* Implement `==` for new `AgentSettings` class

Forgot this one, some of our tests relied on it!

* use Ruby 3.4.1 for test-memcheck GHA

* Update exceptions file with another variant of thread creation memory leak

Since our exceptions match on the stack, they are affected by internal
naming changes, and it looks like a new `ruby_xcalloc_body` function
is now showing up in the stack.

* Introduce Ruby 3.5 gemfile variant for testing with dev builds

This is waaay incomplete in terms of adding support for Ruby 3.5 but
should get us going for ASAN testing for now.

* Update list of files used to compute cache checksum

In practice this shouldn't make a difference, since the final lockfiles
are supposed to be a superset of the root-level gemfile BUT the
`Appraisals` file no longer exists anyway and "just in case" let's
have it anyway as it seems more correct.

* Bump Ruby 3.4 integration image to stable version

* Remove workaround for strscan issue

This is not expected to be an issue in 3.5 (and is probably fixed for
3.4 as well, but I'll leave that for a separate PR to not affect the
appraisals).

* Add unsafe api calls checker to track down issues such as #4195

This checker is used to detect accidental thread scheduling switching
points happening during profiling sampling.

See the bigger comment in unsafe_api_calls_check.h .

I was able to check that this checker correctly triggers for the bug
in #4195, and also the bug I'm going to fix next, which is the
use of `rb_hash_lookup` in the otel context reading code.

* Fix going into Ruby code when looking up otel context

`rb_hash_lookup` calls `#hash` on the key being looked up so it's safe
to use unless during sampling.

This can cause the same issue as we saw in #4195 leading to

```
[BUG] unexpected situation - recordd:1 current:0

-- C level backtrace information -------------------------------------------
ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820
ruby(rb_vm_bugreport) vm_dump.c:1151
ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042
ruby(rb_bug_without_die) error.c:1042
ruby(die+0x0) [0x55ba03ac0998] error.c:1050
ruby(rb_bug) error.c:1052
ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226
ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144
ruby(rb_ec_tag_state) eval_intern.h:155
ruby(rb_vm_exec) vm.c:2484
ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509
ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728
ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598
ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615
ruby(thread_start_func_2) thread.c:672
ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187
/lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353]
```

* Avoid trying to sample allocations when VM is raising exception

During my experiments to reproduce issues around allocation profiling,
I've noted that the VM is in an especially delicate state during
exception raising, so let's just decline to sample in this situation.

* Update tests with new signatures for test methods

* Check if symbol is static before calling SYM2ID on it

It occurs to me that if a symbol is dynamic, we were causing it to
become a static symbol (e.g. making it never be able to be garbage
collected).

This can be very bad! And also, we know the symbol we're looking for
must be a static symbol because if nothing else, our initialization
caused it to become a static symbol.

Thus, if we see a dynamic symbol, we can stop there, since by
definition it won't be the symbol we're looking after.

This is... really awkward to add a specific unit test for, so
I've just relied on our existing test coverage to show that this
has not affected the correctness of our otel code.

* Document that unsafe api calls checker is only for test code

* Add 3.4 support

* Update DevelopmentGuide

* Remove `racc` gem from 3.3 and 3.4 appraisal files

* [🤖] Lock Dependency: https://github.com/DataDog/dd-trace-rb/actions/runs/12595964519

* Remove strscan specification in 3.4 gemfile

* [🤖] Lock Dependency: https://github.com/DataDog/dd-trace-rb/actions/runs/12595969993

* add hardcoded

update workflow file

---------

Co-authored-by: Steven Bouwkamp <[email protected]>
Co-authored-by: Bradley Schaefer <[email protected]>
Co-authored-by: Ivo Anjo <[email protected]>
Co-authored-by: Andrey Marchenko <[email protected]>
Co-authored-by: Sarah Chen <[email protected]>
Co-authored-by: ivoanjo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants