Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Networking #112

Merged
merged 14 commits into from
Oct 31, 2024
Merged

Networking #112

merged 14 commits into from
Oct 31, 2024

Conversation

nieznanysprawiciel
Copy link
Contributor

@nieznanysprawiciel nieznanysprawiciel commented Oct 15, 2024

This is first part about networking describing generalized Net module.
In the next PRs I will explain details of implementation of hybrid net and central net.


This change is Reviewable

@nieznanysprawiciel nieznanysprawiciel marked this pull request as ready for review October 16, 2024 15:12
Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 434 at r1 (raw file):
I think you skipped over one abstraction layer which to me seems essential to gain understanding of the network.

From what you write, I gather that Net's interface is defined by a bunch of GSB prefixes. Before we dive into this, however, I'd like to learn the following:

  • what kind of network-related operations are supported, e.g.
    • create a stream to node XYZ
    • create a datagram "connection" to node XYZ
    • discovery? is it Net whose responsibility it is to figure out the existence of other nodes? or is it some external component?
    • topics? is it Net, whose responsibility to manage their lifecycle and subscriptions?
    • broadcast to a topic (what are the guarantees here? I mean are there retries, streams, or is it fire and forget?)
  • how do addresses look like (it seems that it's a pair consisting of "default identity" and the specific "identity" within the node identified by the "default identity", right?)
  • what is transferred over the channels? are these byte streams or are these streams of messages?
  • what are the expectations on the channels? I presume that multiple channels between a pair of nodes should work more-or-less concurrently, i.e. Net interleaves buffers

If there are any mainstream technologies you can relate to in order to build intuition, they would be welcome.

I'm perfectly aware that I may sound obnoxious, but I genuinely don't know the answers to these questions and I believe I (as a test-reader) should.

Here is my dumb (and maybe incorrect) attempt on the first sentence of such an explanation:

In Golem Network nodes are identified by custom, Golem-specific identifiers. In order for those nodes to communicate with each other, they need to know how to establish direct communication. Hence in order for them to be able to communicate with each other the Net component needs to provide the following functionalities:

  • discovery (the nodes need to learn of each others' existence)
  • direct communication (creating a simple TCP/IP connection is usually not an option because both parties might not have public IP addresses, i.e. reside behing a NAT)
  • etc.

Code quote:

### Networking

arch-snapshot/arch.md line 476 at r1 (raw file):

- Receiving and processing broadcasted messages

##### Address translation

Maybe it's just me, but I got confused and thought of NAT's. Can you please call this section more distinctively, please? E.g. "GSB prefix mappings"?

Code quote:

##### Address translation

arch-snapshot/arch.md line 481 at r1 (raw file):

prefixed with `/net/{NodeId}` are reserved for the Net module, where it listens for incoming messages and forwards 
them to the Golem Network. Conversely, addresses starting with `/public/...` are available for yagna modules to expose 
public methods that can be called from other Nodes.   

nit: stray whitespace

Code quote:

···

arch-snapshot/arch.md line 543 at r1 (raw file):

be sent from a specific identity on one Node to a specific identity on a remote Node.

Another important aspect is that the Net module always checks if the target identity belongs to the local Node. If 

Please avoid this boilerplate.

Code quote:

Another important aspect is that the 

arch-snapshot/arch.md line 550 at r1 (raw file):

The Net module supports multiple channels for message transmission. The basic channel provides reliable message 
delivery via GSB, which is used for most control messages between Nodes.

Who creates the channels? Are these other components? Or is there just one such channel?

Code quote:

The Net module supports multiple channels for message transmission. The basic channel provides reliable message
delivery via GSB, which is used for most control messages between Nodes.

arch-snapshot/arch.md line 557 at r1 (raw file):

performance, as this would essentially embed TCP within TCP (or another reliable protocol implemented in Net). To 
address this, the Net module also allows for sending messages in an unreliable manner without packet delivery 
guarantee.   

How is this doable? Also through GSB?

Code quote:

performance, as this would essentially embed TCP within TCP (or another reliable protocol implemented in Net). To
address this, the Net module also allows for sending messages in an unreliable manner without packet delivery
guarantee.

arch-snapshot/arch.md line 561 at r1 (raw file):

The third option is the transfer channel. Mixing transfers with GSB control messages can cause delays, as large file 
transfers can quickly fill the sender’s buffer queue. To avoid this, it is recommended to use a separate channel 
specifically for transfers.

Am I correct that this is just a reliable byte stream? Can one create many of those?

Who performs the mixing of these channels and GSB control messages? Is it Net? If so, isn't it just a shortcoming of a Net's implementation to not be able to handle it gracefully? My operating systems mixes interactive and high throughput connections pretty well.

Code quote:

The third option is the transfer channel. Mixing transfers with GSB control messages can cause delays, as large file
transfers can quickly fill the sender’s buffer queue. To avoid this, it is recommended to use a separate channel
specifically for transfers.

This was referenced Oct 17, 2024
Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 6 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 476 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Maybe it's just me, but I got confused and thought of NAT's. Can you please call this section more distinctively, please? E.g. "GSB prefix mappings"?

Done.


arch-snapshot/arch.md line 543 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Please avoid this boilerplate.

Done.


arch-snapshot/arch.md line 557 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

How is this doable? Also through GSB?

I listed prefixes


arch-snapshot/arch.md line 561 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Am I correct that this is just a reliable byte stream? Can one create many of those?

Who performs the mixing of these channels and GSB control messages? Is it Net? If so, isn't it just a shortcoming of a Net's implementation to not be able to handle it gracefully? My operating systems mixes interactive and high throughput connections pretty well.

I would say that you are right, that this is poor workaround for Net's inability to handle this better.

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 3 unresolved discussions (waiting on @dopiera)

a discussion (no related file):
There are comments from hybrid net PR that require changes here. I'm thinking what would be right approach. We could either add new changes there or work still on this PR...



arch-snapshot/arch.md line 434 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think you skipped over one abstraction layer which to me seems essential to gain understanding of the network.

From what you write, I gather that Net's interface is defined by a bunch of GSB prefixes. Before we dive into this, however, I'd like to learn the following:

  • what kind of network-related operations are supported, e.g.
    • create a stream to node XYZ
    • create a datagram "connection" to node XYZ
    • discovery? is it Net whose responsibility it is to figure out the existence of other nodes? or is it some external component?
    • topics? is it Net, whose responsibility to manage their lifecycle and subscriptions?
    • broadcast to a topic (what are the guarantees here? I mean are there retries, streams, or is it fire and forget?)
  • how do addresses look like (it seems that it's a pair consisting of "default identity" and the specific "identity" within the node identified by the "default identity", right?)
  • what is transferred over the channels? are these byte streams or are these streams of messages?
  • what are the expectations on the channels? I presume that multiple channels between a pair of nodes should work more-or-less concurrently, i.e. Net interleaves buffers

If there are any mainstream technologies you can relate to in order to build intuition, they would be welcome.

I'm perfectly aware that I may sound obnoxious, but I genuinely don't know the answers to these questions and I believe I (as a test-reader) should.

Here is my dumb (and maybe incorrect) attempt on the first sentence of such an explanation:

In Golem Network nodes are identified by custom, Golem-specific identifiers. In order for those nodes to communicate with each other, they need to know how to establish direct communication. Hence in order for them to be able to communicate with each other the Net component needs to provide the following functionalities:

  • discovery (the nodes need to learn of each others' existence)
  • direct communication (creating a simple TCP/IP connection is usually not an option because both parties might not have public IP addresses, i.e. reside behing a NAT)
  • etc.

I added introduction which describes functionalities provided by net as well as requirements what should be internally implemented for those functionalities.

I didn't touch to points here:

  • addresses and identities - from interface perspective they are described in Handling identities. From internals perspective it is implementation detail, how the addressing works
  • Channels: here I'm starting to think that my naming as channels suggests that there is more to them then really is. I think I will rename them, maybe to transport types...
    The thing is that you can do nothing more than choose which type of transport to use. Period. The rest is implementation detail like the lifetime and creation, how is it routed, how many channels per identity end so on. Generally you shouldn't make any assumptions, because channels don't give you any guarantees.

arch-snapshot/arch.md line 550 at r1 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Who creates the channels? Are these other components? Or is there just one such channel?

I renamed channels to transport types to not suggest anything big

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 7 unresolved discussions (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 438 at r3 (raw file):
I think this is misleading - it seems to suggest that Net uses GSB for communication.

How about:

Developers interact with the network layer via
GSB (Golem Service Bus), allowing remote calls between Nodes to feel as seamless as local service calls.

Code quote:

Network, abstracting the complexity of underlying network operations. Communication is achieved through the
[GSB (Golem Service Bus)](#gsb), allowing remote calls between Nodes to feel as seamless as local service calls.

arch-snapshot/arch.md line 447 at r3 (raw file):
I though I finally understood how it works but this sentence makes me question myself.

From what I gather now:

  • the broadcast operation is not implemented in Net - it is implemented by the marketplace; this is because only marketplace is aware of all the offers it holds and can determine what exactly will be sent to the neighbors in reaction to broadcast message the Node receives
  • the Net implements a topology and allows sending messages to the neighborhood but not the whole broadcast protocol

Another way of looking at it: in order to for and offer to be forwarded by a Node, the marketplace needs to be involved.

Hence, I think we have a leaky abstraction after all.

How about changing this point to:

  • introducing a network topology, i.e.
    • the concept of neighbors - a subset of nodes on the network which the topology considers closest
    • the ability to list those neighbors
    • the ability to send messages to the nearest neighborhood; we call those "broadcast messages" and they are used by upper layers for broadcasting information across the network; these broadcast messages are sent for opaque "topics" for convenience
    • registering handlers for incoming broadcast messages based on specified topics

Code quote:

- Sending broadcast messages on specific topics across the network (The Network module provides functionality to send
messages to a subset of Nodes. It is the responsibility of other modules to implement algorithms that ensure
network-wide message reach if required)

arch-snapshot/arch.md line 525 at r3 (raw file):

To send a broadcast message, a module must send a GSB message to the Net module on the designated topic. The Net module 
then forwards this message to the network. Depending on the network's implementation, the message may be routed 
either to neighboring Nodes or to all Nodes across the network.

Can we just say that CentralNet is a flat topology where every node is a neighbor to every other node?

Code quote:

all Nodes across the network.

arch-snapshot/arch.md line 572 at r3 (raw file):

guarantee.

The third option is the transfer transport typ. Mixing transfers with GSB control messages can cause delays, as large 

nit: type

Code quote:

typ

arch-snapshot/arch.md line 579 at r3 (raw file):

- `/net/{RemoteId}`
- `/udp/net/{RemoteId}`
- `/transfer/net/{RemoteId}`

So only one concurrent transport is possible?

Code quote:

- `/transfer/net/{RemoteId}`

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 6 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 438 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think this is misleading - it seems to suggest that Net uses GSB for communication.

How about:

Developers interact with the network layer via
GSB (Golem Service Bus), allowing remote calls between Nodes to feel as seamless as local service calls.

Done.


arch-snapshot/arch.md line 447 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I though I finally understood how it works but this sentence makes me question myself.

From what I gather now:

  • the broadcast operation is not implemented in Net - it is implemented by the marketplace; this is because only marketplace is aware of all the offers it holds and can determine what exactly will be sent to the neighbors in reaction to broadcast message the Node receives
  • the Net implements a topology and allows sending messages to the neighborhood but not the whole broadcast protocol

Another way of looking at it: in order to for and offer to be forwarded by a Node, the marketplace needs to be involved.

Hence, I think we have a leaky abstraction after all.

How about changing this point to:

  • introducing a network topology, i.e.
    • the concept of neighbors - a subset of nodes on the network which the topology considers closest
    • the ability to list those neighbors
    • the ability to send messages to the nearest neighborhood; we call those "broadcast messages" and they are used by upper layers for broadcasting information across the network; these broadcast messages are sent for opaque "topics" for convenience
    • registering handlers for incoming broadcast messages based on specified topics

There is no ability to list neighbors. I think this is the cause of misunderstanding.
How is it leaky?
You have basic operations implemented o net level, but decision if after getting Offers you should rebroadcast them is in markets responsibility. Otherwise net would need to be aware about Offers. This would be abstraction leak.

I changes to your proposal, but removed listing neighbors.

I don't get opaque "topics". What opaque should mean in this context?


arch-snapshot/arch.md line 525 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

Can we just say that CentralNet is a flat topology where every node is a neighbor to every other node?

Yes, I wanted to use this interpretation when talking about Offers propagation to avoid making distinction.
Would you change this sentence?


arch-snapshot/arch.md line 579 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

So only one concurrent transport is possible?

You can do concurrent transfers. But you don't get certain guarantees.
Let's look at 2 gftp transfers as example. GFTP sends files in chunks. Let's assume that first transfer starts earlier.
GFTP is able to push chunks until sender buffer will fill up. From this moment pushing chunks will slow down.
Now second transfer starts. It will send it's chunks but the sender buffer is full until all data will be sent through the network. So it begins with back-pressure slowing it down.
Those 2 gftp transfers will have their chunks interleaved until one of them finishes.

But the problem is that our sender buffers are pretty large. 1MB I think. That means that in some cases you need to wait pretty long until all data will be sent. If we would mix other control messages with transfers, the control messages would probably timeout very often. Thanks to those transport types, control messages are separated.

This whole mechanism is only to protect control messages. It is poor implementation and more could be desired, but it does the job.

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 6 unresolved discussions (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 525 at r3 (raw file):

Previously, nieznanysprawiciel wrote…

Yes, I wanted to use this interpretation when talking about Offers propagation to avoid making distinction.
Would you change this sentence?

I think it's good that way.


arch-snapshot/arch.md line 579 at r3 (raw file):

Previously, nieznanysprawiciel wrote…

You can do concurrent transfers. But you don't get certain guarantees.
Let's look at 2 gftp transfers as example. GFTP sends files in chunks. Let's assume that first transfer starts earlier.
GFTP is able to push chunks until sender buffer will fill up. From this moment pushing chunks will slow down.
Now second transfer starts. It will send it's chunks but the sender buffer is full until all data will be sent through the network. So it begins with back-pressure slowing it down.
Those 2 gftp transfers will have their chunks interleaved until one of them finishes.

But the problem is that our sender buffers are pretty large. 1MB I think. That means that in some cases you need to wait pretty long until all data will be sent. If we would mix other control messages with transfers, the control messages would probably timeout very often. Thanks to those transport types, control messages are separated.

This whole mechanism is only to protect control messages. It is poor implementation and more could be desired, but it does the job.

So in short, the answer is yes but there is no conscious scheduling. I was under the impression that one uses this channel to stream raw bytes, not messages.

If that is true - how does one distinguish between byte streams?

If that is false - how is the stream transformed into messages or reconstructed from messages? Is this the user's responsibility?

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 4 unresolved discussions (waiting on @nieznanysprawiciel)


arch-snapshot/arch.md line 447 at r3 (raw file):

Previously, nieznanysprawiciel wrote…

There is no ability to list neighbors. I think this is the cause of misunderstanding.
How is it leaky?
You have basic operations implemented o net level, but decision if after getting Offers you should rebroadcast them is in markets responsibility. Otherwise net would need to be aware about Offers. This would be abstraction leak.

I changes to your proposal, but removed listing neighbors.

I don't get opaque "topics". What opaque should mean in this context?

I think you're right - let's just remove the word and merge this PR.


arch-snapshot/arch.md line 579 at r3 (raw file):
OK - ignore the previous comment. I think if you added a sentence along these lines to the previous paragraph, we can merge it

Functionally, this channel is equivalent to /net/{RemoteId}

Also maybe instead of the word transfer, please use "high-bandwidth workloads" - I think it will be slightly less confusing. "Transfers" suggested to me that is something different.

Consider this phrasing:

The third option is the transfer transport type. Functionally, this channel is equivalent to first channel type. The only reason for its existence is prioritization of control messages. It is desirable that control messages be sent right away, while high-bandwidth workloads (e.g. transferring an image) can be delayed. By splitting the channels we're avoiding a situation when a control message awaits to be sent behind a huge back-log of non-latency-sensitive messages. The network layer may implement this transport type by having a second TCP connection to ensure that.

Copy link
Contributor Author

@nieznanysprawiciel nieznanysprawiciel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 4 unresolved discussions (waiting on @dopiera)


arch-snapshot/arch.md line 447 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

I think you're right - let's just remove the word and merge this PR.

Done.


arch-snapshot/arch.md line 579 at r3 (raw file):

Previously, dopiera (Marek Dopiera) wrote…

OK - ignore the previous comment. I think if you added a sentence along these lines to the previous paragraph, we can merge it

Functionally, this channel is equivalent to /net/{RemoteId}

Also maybe instead of the word transfer, please use "high-bandwidth workloads" - I think it will be slightly less confusing. "Transfers" suggested to me that is something different.

Consider this phrasing:

The third option is the transfer transport type. Functionally, this channel is equivalent to first channel type. The only reason for its existence is prioritization of control messages. It is desirable that control messages be sent right away, while high-bandwidth workloads (e.g. transferring an image) can be delayed. By splitting the channels we're avoiding a situation when a control message awaits to be sent behind a huge back-log of non-latency-sensitive messages. The network layer may implement this transport type by having a second TCP connection to ensure that.

Done.

Copy link
Collaborator

@dopiera dopiera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion

@nieznanysprawiciel nieznanysprawiciel merged commit 7dc6874 into new-arch Oct 31, 2024
@nieznanysprawiciel nieznanysprawiciel deleted the describe-networking branch October 31, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants