Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS support + peers protocol #59

Merged
merged 4 commits into from
Mar 27, 2018
Merged

Conversation

wallin
Copy link
Contributor

@wallin wallin commented Feb 26, 2018

Implement support for TLS + use new peers protocol

@jhecking
Copy link
Contributor

Thank you! Haven't done a full review yet, but looks good at first glance.

Because of the dependency on the openssl gem, we would have to drop support for Ruby 2.2. But since official support for that version ends end of March anyway, that's probably ok.

@wallin
Copy link
Contributor Author

wallin commented Feb 26, 2018

Thanks @jhecking. Dropping Ruby 2.2 seems reasonable

I tried to align the tls parameters names with the python client
https://github.com/aerospike/aerospike-client-python/blob/master/doc/aerospike.rst

Let me know which ones would be prioritized to support in a first release.

Copy link
Contributor

@jhecking jhecking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TCP/TLS socket connection implementation looks good to me in principal. That said, these changes alone won't be enough to actually connect to a TLS enabled cluster.

The problem is that the Ruby client still uses the older "services" info protocol for node discovery. For TLS support, the newer "peers" protocol is required. The two protocols are similar and both serve the same purpose of allowing the client to discover all the nodes in the cluster. But only the "peers" protocol contains the necessary information the client needs to establish a secure connection, e.g. the port number on which the server accepts TLS connections.

The "peers" protocol also includes the server's "tlsname", which the client generally uses to validate the certificate presented by the server, rather than using the server's hostname/IP address.

To get an idea of what it would take to implement the "peers" protocol, you can take a look at the changes to the Java client, that were done to support TLS as well as the "peers" protocol: aerospike/aerospike-client-java@5b5b06b. The Ruby client was largely modelled after the Java client so the overall structure is quite similar.


def set_cert(ctx, options)
if options[:cert_file]
ctx.cert = OpenSSL::X509::Certificate.new(File.open(options[:cert_file]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use SSLContext#add_certificate instead of the cert= and key= accessor methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Will remove this deprecated setter in favor of add_certificate


def set_cipher_suite(ctx, options)
if options[:cipher_suite]
# TODO(wallin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use SSLContext#ciphers=:

ctx.ciphers = options[:cipher_suite]


def set_protocols(ctx, options)
if options[:protocols]
# TODO(wallin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can support a list of protocol versions. The best we can do with Ruby's OpenSSL library is to set a minimum and/or maximum protocol version using SSLContext#min_version and SSLContext#max_version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should default min_version to OpenSSL::SSL::TLS1_2_VERSION. (By default, the server only supports TLS v1.2.)

attr_reader :context

def initialize(host, port, timeout, ssl_options)
@host, @port, @timeout = host, port, timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this initialization happen in the Connection class itself? Since it's the same between the TCP and SSL socket implementations and the accessor methods are defined in Connection class as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I'll refactor this

def verify_certificate!(socket)
return unless context.verify_mode == OpenSSL::SSL::VERIFY_PEER
return if OpenSSL::SSL.verify_certificate_identity(
socket.peer_cert, host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should not use the server's host name here to verify the certificate presented by the server, but instead use a separate "tlsname". I'll comment more on this in the review summary.

@wallin
Copy link
Contributor Author

wallin commented Feb 27, 2018

Thanks @jhecking. Will have a look at the "peers" protocol, and make the appropriate changes.

private

def create_context(options)
OpenSSL::SSL::SSLContext.new().tap do |ctx|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another team member suggested, that it might be best if we just let the user pass in a pre-configured SSLContext instance, instead of creating one ourself and exposing a subset of it's config options through our own settings. I like that idea, as it gives the user full flexibility to configure the context as needed without us having to expose every single config option through our own settings. On the other hand it does require a bit more work on the user's part for a standard setup and means we can no longer set sensible defaults, like restricting the connection to TLS v1.2+.

So, how about we allow the user to pass in a pre-configured context optionally, and create one if not. E.g. just add another :context option to the ssl_options hash. If not provided, we continue to create a new context as you have already implemented. But in that case it would be ok to not expose more advanced options such as setting the desired cipher suites, etc.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhecking that's a great idea, to have both options. By allowing people to pass a custom context we'd potentially avoid a bunch of future requests asking for less common options. And as a lazy Ruby developer I like the simple option of just quickly enable SSL with sane defaults

In your experience, which options are the most commonly used by AS users? Anything besides the ones in this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your experience, which options are the most commonly used by AS users? Anything besides the ones in this PR?

I think we should keep it to the minimum needed to get a working connection to a standard server deployment, with sensible defaults for everything else. I would support:

  • :enable
  • :ca_file, :ca_path
  • :cert_file
  • :key_file, :key_file_pass_phrase

You can remove the :cipher_suite and :protocols options and just default SSLContext#min_version= to OpenSSL::SSL::TLS1_2_VERSION.

@wallin
Copy link
Contributor Author

wallin commented Mar 5, 2018

@jhecking I've come a bit on the way on the peer protocol. Could you just provide me with a couple of examples of, more or less complex, responses from the peer command? I tried looking for test cases for examples but couldn't find any. Some examples with combinations of multiple hosts, tls and no tls, IPv6, IPv4, would be great so I can create a complete parser

@jhecking
Copy link
Contributor

jhecking commented Mar 5, 2018

That's good to hear!

Here are a couple of example responses from a 3 node cluster. For reference, this is what the server side network services config looks like:

service {
    address 192.168.33.10
    address 10.0.2.15
    port 3000
    access-address 192.168.33.10
    alternate-access-address 10.0.2.15
    tls-port 3044
    tls-name aerospike
    tls-authenticate-client any
}

Nodes 2 and 3 use ports 3100/3144 and 3200/3244 respectively.

Here is the full response for the 4 peers-* commands:

> pp client.request_info 'peers-clear-std', 'peers-clear-alt', 'peers-tls-std', 'peers-tls-alt'
{"peers-clear-std"=>
  "3,3100,[[C1D4DC08D270008,,[192.168.33.10]],[C814DC08D270008,,[192.168.33.10:3200]]]",
 "peers-clear-alt"=>
  "3,3100,[[C1D4DC08D270008,,[10.0.2.15]],[C814DC08D270008,,[10.0.2.15:3200]]]",
 "peers-tls-std"=>
  "3,3144,[[C1D4DC08D270008,aerospike,[192.168.33.10]],[C814DC08D270008,aerospike,[192.168.33.10:3244]]]",
 "peers-tls-alt"=>
  "3,3144,[[C1D4DC08D270008,aerospike,[10.0.2.15]],[C814DC08D270008,aerospike,[10.0.2.15:3244]]]"}

Note that the server will omit the default port 3000 in the responses.

Here is the same response with TLS disabled on the cluster:

> pp client.request_info 'peers-clear-std', 'peers-clear-alt', 'peers-tls-std', 'peers-tls-alt'
{"peers-clear-std"=>
  "5,3000,[[C814DC08D270008,,[192.168.33.10:3200]],[BB94DC08D270008,,[192.168.33.10]]]",
 "peers-clear-alt"=>
  "5,3000,[[C814DC08D270008,,[10.0.2.15:3200]],[BB94DC08D270008,,[10.0.2.15]]]",
 "peers-tls-std"=>"5,,[]",
 "peers-tls-alt"=>"5,,[]"}

And now with TLS re-enabled but without alternate-access-address:

> pp client.request_info 'peers-clear-std', 'peers-clear-alt', 'peers-tls-std', 'peers-tls-alt'
{"peers-clear-std"=>
  "5,3000,[[C1D4DC08D270008,,[192.168.33.10:3100]],[BB94DC08D270008,,[192.168.33.10]]]",
 "peers-clear-alt"=>"5,,[]",
 "peers-tls-std"=>
  "5,3044,[[C1D4DC08D270008,aerospike,[192.168.33.10:3144]],[BB94DC08D270008,aerospike,[192.168.33.10]]]",
 "peers-tls-alt"=>"5,,[]"}

Unfortunately, I don't have IPv6 setup for testing on my dev environment and I'm having a bit of a problem trying to set it up. So I can't provide an example for IPv6 at the moment. But I believe it should look the same, only with the IPv4 addresses replaced with IPv6 addresses. Only I am not sure if the IPv6 addresses are enclosed in an extra set of square brackets or not.

@wallin
Copy link
Contributor Author

wallin commented Mar 6, 2018

Thanks @jhecking. I'll try and have something ready by next weekend.

As for IPv6, it seems like you need to enclose it in brackets, judging by this comment:
https://github.com/aerospike/aerospike-client-java/blob/e82206496cb3d33222315fb978f7c21ef63251d1/client/src/com/aerospike/client/Host.java#L80-L91

@codecov-io
Copy link

codecov-io commented Mar 10, 2018

Codecov Report

Merging #59 into master will increase coverage by 0.86%.
The diff coverage is 97.47%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #59      +/-   ##
==========================================
+ Coverage   93.91%   94.77%   +0.86%     
==========================================
  Files          98      140      +42     
  Lines        7375     8600    +1225     
==========================================
+ Hits         6926     8151    +1225     
  Misses        449      449
Impacted Files Coverage Δ
lib/aerospike/value/value.rb 100% <ø> (ø) ⬆️
spec/aerospike/index_spec.rb 100% <ø> (ø) ⬆️
spec/spec_helper.rb 100% <100%> (ø) ⬆️
spec/aerospike/cluster/find_node_spec.rb 100% <100%> (ø)
lib/aerospike/aerospike_exception.rb 100% <100%> (ø) ⬆️
lib/aerospike/node/refresh/info.rb 100% <100%> (ø)
lib/aerospike/node/verify/name.rb 100% <100%> (ø)
spec/aerospike/socket/base_spec.rb 100% <100%> (ø)
lib/aerospike/socket/tcp.rb 100% <100%> (ø)
spec/aerospike/node_spec.rb 100% <100%> (ø)
... and 96 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bc72c30...bf73d4e. Read the comment docs.

@wallin
Copy link
Contributor Author

wallin commented Mar 11, 2018

@jhecking I've modified to code to handle the peers protocol as well as refactored some of the code into smaller service modules. I'm going to add more unit tests, but the PR is ready for another review on a conceptual level I believe

@jhecking
Copy link
Contributor

jhecking commented Mar 11, 2018 via email

@wallin
Copy link
Contributor Author

wallin commented Mar 11, 2018

Thanks @jhecking. That will hopefully give me time to finish up the last pieces. Safe travels!

@jhecking
Copy link
Contributor

@wallin, are you still planning to do a lot of changes or does it make sense for me to start reviewing the code so far now?

@jhecking
Copy link
Contributor

@wallin, I see there are still a few test failures and a large number of tests that seem to be skipped due to some issues with detecting whether certain features are supported by the server. Anything in particular I can help you with?

@wallin
Copy link
Contributor Author

wallin commented Mar 15, 2018

@jhecking The major changes, peers protocol + TLS, are ready for review. Some notes

  • The TLS implementation hasn't been tested on an actual real server setup.
  • A fully custom SSLContext object cannot be passed yet, but this is trivial to add
  • IPv6 host parsing hasn't been implemented yet, but this is also a trivial task
  • I need help verifying that adding/removing nodes is picked up by the client, and that eg. partition map is updates correctly.
  • I suspect some old code can be removed but haven't made a huge effort investigating this, mostly focusing on adding more and extracting existing.

Looking forward to your feedback!

@wallin
Copy link
Contributor Author

wallin commented Mar 15, 2018

@jhecking Will look into why they are failing, but unfortunately can't continue until tomorrow night (PST). In the mean time, if you have time, please feel free to point out any errors you may find

@wallin
Copy link
Contributor Author

wallin commented Mar 15, 2018

@jhecking quick update. I believe I fixed the issue related to feature detection. Now remaining is the batch and udf failures

@jhecking
Copy link
Contributor

jhecking commented Mar 15, 2018 via email

Copy link
Contributor

@jhecking jhecking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, this took longer than expected to review and I haven't even looked at the specs (much) yet. :-)

Overall, I am very happy with the changes you implemented! Thanks again for taking the time to do some refactoring and code clean-up along the way!

One thing I am not totally convinced of yet, is what value there is in pulling out a bunch of business logic into small (tiny in some cases!) service modules. Especially if most of what the service does is to manipulate the state of some other object.

The Node::Verify::PartitionGeneration service is a good example. Other than "parsing" the partition-generation info value, the service mainly reads and updates a Node's partion_generation and partition_changed values. This couples the service very tightly to the Node implementation. E.g. the service needs to be aware that these values are Atomic, which could change at some point.

If the main goal is to move business logic out of the Node (and Cluster, etc.) classes, so that these classes become smaller, maybe a better approach would be to create separate classes for concepts like "partition generation" instead. A hypothetical PartitonGeneration class could encapsulate the actual partition generation value as well as the concept of whether the partition generation has changed (currently represented in Node#partition_changed). It could have two methods update, to set the new value and the changed flag as needed, and confirm (better name?), to reset the changed flag.

This might make testing easier in some cases as well. Some of your service module specs rely a lot on test doubles and stubbing, which can lead to brittle tests as the code base evolves.

But since you are the one doing most of the work, I'll leave it to you to consider this alternative approach.

I haven't had much time to do actual testing of the changes today. Will continue working on that tomorrow.

end

def tls_enabled?
(ssl_options || {}).key?(:enable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would return true if ssl_options includes a key :enable with a false value. But I would expect tls_enabled? to return false if I set ssl_options = { enable: false }.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

def tend
nodes = self.nodes
cluster_config_changed = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for clarity we should still initialize cluster_config_changed to false in case we don't go into any of the branches where it is being set to true.

node.reference_count.value = 0
node.responded.value = false
node.reset_reference_count!
node.partition_changed.value = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add another Node#reset_partition_changed! method? Or combine it with Node#reset_reference_count! - though I'm struggling to come up with a good name for such a combined method...

end
end
# refresh all known nodes
nodes.each { |node| Node::Refresh::Info.(node, peers) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a stylistic note: I would like to stick to the convention to use { ... } if the primary purpose of a block is to return a value and to use do ... end if the primary purpose of a block is its side-effects. I know it's just a preference and that the existing code does not always follow this convention, but I would like for new code to follow this "Weirich convention".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of this "Weirich convention", but I definitely makes sense. Thanks for the link. I've mostly tried to apply the "line count" convention.

module Create
class << self
def call(host, port, timeout: 30, tls_name: nil, ssl_options: {})
if !ssl_options.nil? && ssl_options[:enable] == true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should default ssl_options[:enable] to true and only fall back to a non-encrypted TCP connection if it is set to false explicitly. The primary use-case for ssl_options[:enable] is if someone has configured a TLS connection but wants to temporarily disable TLS, e.g. for debugging, without changing any of the TLS setting itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the "default on" TLS option, but could you clarify the use case with an example. I don't see the issue of just flipping the enable option while still keeping the rest of the configuration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I just didn't express myself clearly. Having TLS to "default on" and providing an easy switch to temp. disable TLS is exactly what I meant. I.e. it should be possible to set up TLS without explicitly setting enable to true:

ssl_options = {
  cert: ...,
  key: ...,
  ca_file: ...
}
policy = new Aerospike::ClientPolicy(ssl_options: ssl_options)
client = new Aerospike::Client(policy: policy)

Then, if I need to temp. disable TLS, I can just set enable to false without affecting the rest of the code:

ssl_options[:enable] = false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, I like that. So if I understand correctly TLS should be enabled as long as ssl_options is not nil, and then explicitly disabled with enable: false option?

return unless should_refresh?(node, peers)

conn = node.tend_connection
node.cluster.update_partitions(conn, node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are already pulling out most of the partitions refresh logic into it's own service module, then I think we can move some more logic from Cluster#update_partitions here. Maybe the partition tokenizer (old|new) should be initialized here and then passed into Cluster#update_partitions instead.

module Peers
class << self
def call(node, peers)
return if node.failures.value > 0 || !node.active?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should move this to a should_refresh? method and use node.failed?, analog to Refresh::Partitions.

::Aerospike.logger.warn("Peer node #{peer.node_name} is different than actual node #{nv.name} for host #{host}");
# Must look for new node name in the unlikely event that node names do not agree.
# Node already exists. Do not even try to connect to hosts.
break if Cluster::FindNode.(node.cluster, peers, nv.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to set node_validated to true if we find the node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed this. Thanks


generation = gen_string.to_i

if node.partition_generation.value != generation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this whole block should be moved into a new Node#update_partition_generation(generation) method? Or create a separate PartitionGeneration class altogether - more on this later in the review summary...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like this

Resolv.getaddresses(host.name)
end

@aliases = [].tap do |aliases|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are already rewriting this for stylistic reasons, why not go all the way? :-)

@aliases = addresses.map { |addr| Host.new(addr, host.port, host.tls_name) }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course! :)

@khaf
Copy link
Collaborator

khaf commented Mar 15, 2018

This looks really good. Thanks for your great work.

In addition to @jhecking 's concerns, there are a few more issues to consider:

  1. This library is used by others for DevOps; It should remain /mostly/ backwards compatible.
  2. A lot of peculiarities in the library were due to performance considerations. As part of validation process, we should benchmark the new code to make sure there is no major performance degradation.
  3. If code organization is changed too much, we will have trouble maintaining and developing the library in the future, since same changes in our other libraries will look completely different in this one. While this should not be a major concern regarding making progress here, it is still important, and will make adding new features much less time intensive.

@jhecking
Copy link
Contributor

Hi @khaf, thanks for your feedback!

I agree that the first 2 points you raise are important. But I do not think there is a particular issue here for the set of changes done by @wallin.

  1. There are no backward compatible API changes; in fact there are no API changes at all except for the addition of ssl_options to the client policy. The only consideration is whether we want to drop support for Ruby v2.2. I still think that's ok but I have also found a way to make the openssl gem dependency optional, since Ruby v2.2 still includes the OpenSSL library in its standard library.
  2. Will definitely run the benchmarks. But most of the changes are not in the hot path and only affect the tend operation. (Enabling TLS will obviously have some performance impact, off course.)

I am less concerned about the 3 point regarding code organization. If we truly consider this client community supported, then adhering to the particular code organization of another Aerospike language SDK should be of less importance than making it easy for other Ruby developers to contribute to the client.

@wallin
Copy link
Contributor Author

wallin commented Mar 15, 2018

Thanks @jhecking! I know it's a (too) big PR with a lot of changes. I'll address your feedback in the upcoming days

One thing I am not totally convinced of yet, is what value there is in pulling out a bunch of business logic into small (tiny in some cases!) service modules. Especially if most of what the service does is to manipulate the state of some other object.

The idea with the service modules it to extract non-trivial operations on models (such as eg. a Node). Essentially moving towards separation of concerns and SRP which also makes operations easier to test. Personally, I've found this a good way of creating maintainable code base that is also easy to navigate. But I think we're on the same page on this philosophy generally speaking.

That said. I will admit that I'm not fully content with how it's currently organised. Being new to AS in general, I was basically figuring out how everything works while implementing this, so I don't feel that I don't have a full holistic understanding of all the concepts. The service modules are implemented very much like the corresponding Java method, which is not always the right abstraction like you say.

I'm not a huge fan of how the Java implementation is organised either, but I didn't want to deviate too much from it in this first implementation. Eg. I think your idea with the PartitionGeneration class makes a lot of sense, but I'd prefer to do that in a future PR.

My focus with this PR was first and foremost to get the functionality in place while also setting a direction for organising the code, without changing to much at the same time. You can expect more PRs from me after this one, however I promise they will be smaller and more isolated :)

E.g. the service needs to be aware that these values are Atomic, which could change at some point.

I don't like this either. In upcoming PRs I want to improve the ergonomics of this by creating a mixin for declaring atomic attributes. eg. attr_atomic :features which would created helpers and hide the complexity of working with this. What do you think?

This might make testing easier in some cases as well. Some of your service module specs rely a lot on test doubles and stubbing, which can lead to brittle tests as the code base evolves.

I see your point but IMHO the purpose of unit tests is to test functionality and behavior in isolation, which is why I think as much as possibly should be stubbed and mocked. The idea is that tests should break as the code base evolves. As the abstractions become more clear, I imaging the tests will also improve.

@wallin
Copy link
Contributor Author

wallin commented Mar 15, 2018

@khaf @jhecking

  1. Echoing what @jhecking said. No changes to the external API, so we should be ok. Otherwise, I think it's a fair requirement to require a non-deprecated Ruby version. Generally, if backwards compatibility is a strict requirement, we should create a separate release branch for that imho.
  2. I've used the benchmark in examples folder with no notable difference in performance. I've also run more in depth profiling with ruby-prof, stackprof and allocation-tracer, also without notable differences. However, I did notice a couple of places that can be optimized, mostly related to allocations, but also there is a lot of string manipulation that probably can be avoided. I will submit separate PRs for these findings

I agree with @jhecking on 3. I appreciate the importance of providing consistent SDKs, but mostly from an API perspective. Eg. IMHO the Java internal implementation is not exemplary, so I don't think that structure is necessarily worth preserving for the sake of consistency

@jhecking
Copy link
Contributor

The idea with the service modules it to extract non-trivial operations on models (such as eg. a Node). Essentially moving towards separation of concerns and SRP which also makes operations easier to test. Personally, I've found this a good way of creating maintainable code base that is also easy to navigate. But I think we're on the same page on this philosophy generally speaking.

Yes! We agree on the goals. I'm just not sure whether the service modules -- in the current form anyway -- are necessarily the best way to achieve those goals.

In many cases, the current service modules are not much more than a single method of the Node or Cluster module moved into a separate file. They are often closely coupled to the model since they directly manipulate its internal state. You don't really achieve separation of concerns this way because the model and the service(s) share responsibility for a single concern. E.g. the Node model and the Node::Verify::PartitionGeneration service are both responsible for keeping track of the partition generation and whether it has changed: Node keeps the state but the PartitionGeneration service updates it. It would be better if both the state as well as the methods that manipulate it would be extracted to a separate domain model altogether. Node and this new domain model should communicate only via passing messages but should be unaware of each other's internal state.

But for now, let's focus on completing the implementation of the peers protocol and TLS support. I'm looking forward to further PRs from you to address these design issues. :-)

@jhecking
Copy link
Contributor

In upcoming PRs I want to improve the ergonomics of this by creating a mixin for declaring atomic attributes. eg. attr_atomic :features which would created helpers and hide the complexity of working with this. What do you think?

That's not what I would focus on. I actually doubt that all of these values need to be Atomic in the first place. Most of them are only accessed from the single tend thread. Likewise, I think we can probably do away with the mutex lock used by Cluster. Its main purpose is to prevent get/put/... commands from accessing the partition table while it's being updated by the tend thread. But there are better strategies to do that using copy-on-write. You'll note that the Java client hardly uses any atomics nor locks for performance reasons.

@jhecking
Copy link
Contributor

I've used the benchmark in examples folder with no notable difference in performance.

I believe you mean the benchmark in the tools folder? The examples do not include any benchmarks.

Gemfile Outdated
gem 'msgpack-jruby', :require => 'msgpack', :platforms => :jruby
gem 'msgpack', '~> 1.0', :platforms => [:mri, :rbx]
gem 'bcrypt'
gem 'openssl', platforms: :mri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to remove Ruby 2.2 from the CI support matrix, but I don't want to break compatibility with older Ruby releases unnecessarily. So can we change this to:

install_if -> { Gem::Version.new(RUBY_VERSION) >= Gem::Version.new('2.3.0') } do
  gem 'openssl', platforms: :mri
end

This will allow the aerospike gem to still be installed on Ruby v2.2 and older. Even TLS encryption might still work since the standard library includes a version of the OpenSSL library anyway. That version probably does not support the latest TLS versions, new ciphers, etc. So upgrading to a later Ruby version is still highly recommended but at least the client is still usable on older Ruby versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a similar note. What's your stance on supporting JRuby? I think it can be done without too much effort, but we can address that in a separate PR if you think it's a priority

tcp_sock = TCP.connect(host, port, timeout)

ctx = OpenSSL::SSL::SSLContext.new
ctx.set_params(ssl_options) if ssl_options && !ssl_options.empty?
Copy link
Contributor

@jhecking jhecking Mar 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail if ssl_options includes any keys for which there are no corresponding assignment methods on SSLContext. E.g. if ssl_options includes a key :enable, this will raise an error because set_params will attempt to call ctx.enable=.

I think we should revert back to what you had before and support only a few keys in ssl_options that we use to call ctx.add_certificate. For anything more complicated than that we allow the user to pass in a fully configured SSLContext instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option could be to only let ssl_options only represent options that can be passed to initialize the SSLContext skip the enable flag all together. If you temporarily want to disable SSL, you can just pass nil instead of the options. So:

  • if ssl_options is a Hash, we create a new context and call set_options
  • if ssl_options already is an SSLContext we just use that.

In a sense I think this is cleaner rather than maintaining a set of own options, because we can just refer to the official documentation for SSLContext. What do you think?

ssl_sock = new(tcp_sock, ctx)
ssl_sock.hostname = tls_name
ssl_sock.connect
ssl_sock.post_connection_check(host)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to check using tls_name instead of host.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed. good catch


peer.hosts.each do |host|
begin
nv = NodeValidator.new(node.cluster, node.host, node.cluster.connection_timeout, node.cluster.ssl_options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing a cluster_name param here.

@wallin
Copy link
Contributor Author

wallin commented Mar 19, 2018

Yes, that would be good. But if we keep growing the scope of this PR it'll never get done. :-)

My point exactly :) I just wanted to know if the current/old implementation was acceptable

If we just pass ssl_options straight through to SSLContext#set_params then I don't really see the point of having that option at all.

Point taken. I'll leave it as is, and we can extend and improve it in future PRs

I don't see a reason why we should try to keep the connection alive even longer

We can leave this as is for now. The reason I ask is that I see many other DB drivers offering configuration for this

you are also trying to use Cluster::FindNode in Node::Refresh::Friends and that's where the issue is.

Got it! Will fix this

@jhecking
Copy link
Contributor

My point exactly :) I just wanted to know if the current/old implementation [of find_nodes_to_remove] was acceptable

I'm ok either way: We can leave this for a separate PR or update to the new algorithm right away. The new implementation is definitely better!

We can leave this as is for now. The reason I ask is that I see many other DB drivers offering configuration for [TCP keep-alive]

Noted. I'll check with our other client devs what their views are on enabling TCP keep-alive.

@jhecking
Copy link
Contributor

Re. TCP keep-alive: I forgot that the server closes idle connections after 60 seconds, so TCP keep-alive is not so important. Most other Aerospike clients also close idle connections, e.g. the Java client has a maxSocketIdle limit of 55 seconds by default.

@wallin
Copy link
Contributor Author

wallin commented Mar 20, 2018

@jhecking sounds good. I'll leave everything as is so we can get this PR merged, and then go ahead and create issues for all of the updates/improvements we've talked about

I'll just spend some time debugging why the "friend" update isn't working properly, and hopefully, come up with a fix soon

@wallin
Copy link
Contributor Author

wallin commented Mar 20, 2018

@jhecking update on the issue I'm seeing with RefreshFriends. In my local setup, every node reports two IPs (eg. 192.68.50.4 and 10.0.2.15). When iterating through these the NodeValidator connects to the first IP non-blocking and shortly thereafter to the second IP, causing Errno::EALREADY to be raised and the prepare code to be skipped.

Any ideas on how to best handle this case?

@jhecking
Copy link
Contributor

I have pushed a couple more fixes to the aerospike:tls-support branch. Specifically:

  • NodeValidator didn't resolve IP addresses correctly.
  • Node::Refresh::Friends didn't add new hosts to a node's aliases list correctly.
  • I'd occasionally get exceptions when trying to close a socket that was already closed.
  • Improve the error message when unable to connect to a new host and the connection times out.

The last issue, is what you were running into, I think. Based on your IP addresses, it sounds like you are running the server in one or more Vagrant boxes, correct? I have the same setup. From my (macOS) host I am not able to connect to the 10.0.2.15 address that the server in the vagrant guest publishes. That would manifest itself in the Errno::EALREADY error. (I'm not 100% sure, why...)

After some struggles to get a second, valid IP address added to my vagrant guest, I'm now able to verify that the cluster lookup using the old services protocol (aka "friends") works as expected, even if the server publishes two separate IP addresses per node.

In a production setup, you would probably want to avoid publishing two separate IP addresses for clients, and specify the address to publish for clients using the access-address configuration. Unless you have two or more different sets of clients that need to connect via different IP addresses (e.g. internal/external IP), in which case you need to use alternate-access-address. See General Network Configuration docs for more details.

peers.peers.each do |peer|
next if ::Aerospike::Cluster::FindNode.(cluster, peers, peer.node_name)

node_validated = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be peers_validated? (And on lines 34 and 41 as well.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,51 @@
# frozen_string_literal: true

RSpec.describe Aerospike::Peers::Parse do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move the specs under spec/lib/aerospike to just spec/aerospike instead?

@jhecking
Copy link
Contributor

I think we are getting close to the point where this PR is ready to be merged! Some outstanding items:

  • We'll unfortunately have to add the standard license boilerplate to all new files - I've already done this and committed it at ddbfccf. Please pull from my branch again.
  • There are still a number of pending specs. Are you planning to complete some more of them?
  • We should probably do a rebase against master and split up the PR into a smaller set of commits. Maybe we should even split the PR into two separate PRs to add the peers protocol and TLS support separately? What do you think?

Anything else I'm missing?

@wallin
Copy link
Contributor Author

wallin commented Mar 21, 2018

Great! Thanks for adding the license texts and making the "friend fixes"

  • I'll go ahead and add some more specs for the less entangled modules. Some, like eg. Refresh::Peers, are still a bit too complex to write good unit tests for imho, so I'll wait with that
  • I'd be glad to merge this in a cleaner way, however, I don't have a clear picture of how that would happen, so feel free to take the lead on this one, and I'd be happy to oblige. One drawback of splitting up the PR in separate new ones, if that we'd have to re-do a lot of the manual testing that has gone into this PR.
  • I'll think about if there is something else, but nothing that comes to mind right now

@jhecking
Copy link
Contributor

I'll go ahead and add some more specs for the less entangled modules. Some, like eg. Refresh::Peers, are still a bit too complex to write good unit tests for imho, so I'll wait with that

Hmm, these four files still have very low test coverage: node/refresh/friends.rb, node/refresh/peers.rb, socket/ssl.rb, and socket/base.rb. I'm a bit worried especially about Node::Refresh::Friends and Socket::SSL because these two features do not get used in the tests at all; at least Node::Refresh::Peers and Socket::Base get covered somewhat in the course of the execution of the rest of the test suite.

But I'm not going to treat this as a blocker. Other than these four files, the test coverage for your changes is very good! Thanks for that! I'd just ask that you remove the specs that are currently empty except for a pending statement, if you don't have the intention to complete them at the moment.

I'd be glad to merge this in a cleaner way, however, I don't have a clear picture of how that would happen, so feel free to take the lead on this one, and I'd be happy to oblige. One drawback of splitting up the PR in separate new ones, if that we'd have to re-do a lot of the manual testing that has gone into this PR.

Sure, I can help with that. However, were you still going to add wrappers for the Node::Refresh::* service methods to Node? That's the last pending change that I am aware of at the moment.

@jhecking
Copy link
Contributor

I added a few specs for Socket::SSL class here: f086c45. Not great, but better than nothing. ;-)

Thinking about how we can setup the CI env. so that we can actually test TLS connections end-to-end.

@wallin
Copy link
Contributor Author

wallin commented Mar 24, 2018

were you still going to add wrappers for the Node::Refresh::* service methods to Node? That's the last pending change that I am aware of at the moment.

Done!

I added a few specs for Socket::SSL class here: f086c45

Thanks for taking that off my chest! however, I took the liberty to re-write them in a more Rspec way. Sorry, being very anal about this ;)

There are still a number of pending specs. Are you planning to complete some more of them?

I added some specs for socket/base.rb and socket/tcp.rb and deleted the rest of the pending files

'm a bit worried especially about Node::Refresh::Friends

I share your concern. At some point, I'd very much like to refactor that module into smaller methods so we can do meaningful unit tests, but for now, it's best to just keep it as similar to the Java implementation as possible. Would there be any way of testing this on an integration level? Like, disable peers support on the AS cluster?

end
end

describe '::create_context' do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are there two separate describe blocks for ::create_context?

let(:pkey) { resource('ssl', 'test.key.pem') }
let(:ssl_options) { { cert_file: cert, pkey_file: pkey } }

before { allow_any_instance_of(OpenSSL::SSL::SSLContext).to receive(:add_certificate) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change before { ... } to before do ... end? 'Cause I'm anal about this... ;-)

@jhecking
Copy link
Contributor

wrappers for the Node::Refresh::* service methods ... Done!

Thanks!

I took the liberty to re-write [the specs for Socket::SSL] in a more Rspec way. Sorry, being very anal about this ;)

No problem. We all got our pet peeves. ;-)

I added some specs for socket/base.rb and socket/tcp.rb and deleted the rest of the pending files

Yay!

Node::Refresh::Friends: I share your concern. At some point, I'd very much like to refactor that module into smaller methods so we can do meaningful unit tests, but for now, it's best to just keep it as similar to the Java implementation as possible. Would there be any way of testing this on an integration level? Like, disable peers support on the AS cluster?

I don't think we can disable peers support on the server. But we could make it a client side setting - for testing only. Just pull out this one line in Cluster#refresh_nodes into a new method and override the value based on some new setting in ClientPolicy:

peers.use_peers = nodes.all? { |node| node.supports_feature?('peers') }

@jhecking
Copy link
Contributor

Pull out this one line in Cluster#refresh_nodes into a new method and override the value based on some new setting in ClientPolicy.

Or maybe better yet, stub that method to always return false for the tests and forget about adding a setting to ClientPolicy which should never be used anyway.

@wallin
Copy link
Contributor Author

wallin commented Mar 24, 2018

Or maybe better yet, stub that method to always return false for the tests and forget about adding a setting to ClientPolicy which should never be used anyway.

Great idea. I did that and it seems to work, so at least now the code path is exercised

Let me know how you think we should proceed with merging

@jhecking
Copy link
Contributor

Great idea. I did that and it seems to work, so at least now the code path is exercised

Yay! Finally codecov is happy as well. :-)

Let me know how you think we should proceed with merging

I've taken all your changes, rebased them on master and then grouped them into just 3 commits for cleanup, peers protocol support and TLS support: https://github.com/aerospike/aerospike-client-ruby/compare/tls-support. You can run a diff against your own branch and the changes should be minimal. (I decided to remove support for Ruby 2.2.)

Let me know what you think. If it looks good to you, then you can hard reset your branch to the HEAD of my own branch and force push back to GitHub. And unless you think there is anything else left to do, I would then merge your PR back to master.

wallin and others added 4 commits March 26, 2018 17:18
* Move cluster/node/node_validator
* Add helper methods to Node to access reference_count, responded, active state
* Fix URL in license boilerplace
* Remove space at end of line
* NodeValidator: Improve resolution of aliases
* Specs: Always require spec_helper
* Codecov: Ignore specs
* Improve debug logs
* Misc. minor stylistic changes
* Requires Aerospike Server v3.10 or later. Replaces use of 'services'
  info command used on older servers for cluster discovery.
* Refactor tend logic: Remove from Cluster/Node models and break up into
  several smaller service modules.
* Add specs covering both old and new protocol.
* Rename and split up Cluster::Connection into Socket::TCP and Socket::SSL
* Add new ssl-options to ClientPolicy to setup TLS connection
* Refactor create-connection logic into separate service module
* Extend peers protocol support to use peers-tls-std when using TLS
* Refactor parsing logic for hosts lists to support optional tls-name
* Add new InvalidCredentials error class
* Remove Ruby 2.2 from CI test matrix
* Add external OpenSSL gem dependency
@wallin
Copy link
Contributor Author

wallin commented Mar 27, 2018

Let me know what you think. If it looks good to you, then you can hard reset your branch to the HEAD of my own branch and force push back to GitHub. And unless you think there is anything else left to do, I would then merge your PR back to master.

Great work. Just rebased and everything looks nice and tidy. I'm not missing anything afaik. Thanks for being so responsive and thorough in your comments, it's been a pleasure implementing this together with you.

@jhecking jhecking merged commit 41718d9 into aerospike:master Mar 27, 2018
@jhecking
Copy link
Contributor

Merged!

Thanks for being so responsive and thorough in your comments, it's been a pleasure implementing this together with you.

Likewise. Looking forward to your future contributions!

@jhecking
Copy link
Contributor

Tagged v2.6.0 and released it to RubyGems.org.

@wallin wallin deleted the tls-support branch April 3, 2018 02:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants