Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Add a transaction management layer #7987

Merged
merged 11 commits into from
Jun 8, 2023
Merged

[python] Add a transaction management layer #7987

merged 11 commits into from
Jun 8, 2023

Conversation

davidiw
Copy link
Contributor

@davidiw davidiw commented Apr 30, 2023

The bulk of this work is in adding a transaction management layer with some other fixes along the way.

This provides a framework for managing as many transactions from a
single account at once

  • The AccountSequenceNumber allocates up to 100 outstanding sequence numbers to maximize the number of concurrent transactions in the happy path.
  • The transaction manager provides async workers that push a transaction from submission through to validating completion
    Together they provide the basic harness for scaling transaction
    submission on the Aptos blockchain from a single account.

This should be reasonably copyable into other languages and provides a starting point to helping others build their own transaction management planes.

@jjleng
Copy link
Contributor

jjleng commented May 2, 2023

if multiple SDK instances submit txns from the same account, we will see repeated 30s timeouts and seq number resyncs. The sequence number syncs do not always return the true latest sequence number due to full node sync delay, which would make things worse.

Wonder if we should start building the API with a cached state for accounts and then all clients don't need to worry about how to hack around the mempool capability as this PR did. cc @bowenyang007. Since the current mempool fails to provide a consistent view of sequence numbers and often swallows errors, adding a global cache state will simplify programming model a lot.

@davidiw
Copy link
Contributor Author

davidiw commented May 3, 2023

if multiple SDK instances submit txns from the same account, we will see repeated 30s timeouts and seq number resyncs. The sequence number syncs do not always return the true latest sequence number due to full node sync delay, which would make things worse.

No one should do that. The ASN code says this is only co-task safe. Similarly the doc accommodating it says the same thing. Will write negative language saying this is not intended for sharing across clients.

Wonder if we should start building the API with a cached state for accounts and then all clients don't need to worry about how to hack around the mempool capability as this PR did. cc @bowenyang007. Since the current mempool fails to provide a consistent view of sequence numbers and often swallows errors, adding a global cache state will simplify programming model a lot.

I think there's opportunity here to generalize into Rust, but ultimately if we can build a few hundred line one and put it into our popular languages it is pretty useful for light weight efforts. On the other hand, if we can get wasm out the door, this might be another alternative to write once and deploy everywhere.

@davidiw davidiw requested a review from igor-aptos May 3, 2023 21:08
Copy link
Contributor

@gregnazario gregnazario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of my comments are just around lack of comments. Otherwise the logic is good

ecosystem/python/sdk/aptos_sdk/account_sequence_number.py Outdated Show resolved Hide resolved
Comment on lines 61 to 82
if self.last_uncommitted_number is None or self.current_number is None:
await self.initialize()
if (
self.current_number - self.last_uncommitted_number
>= self.maximum_in_flight
):
await self.__update()
start_time = time.time()
while (
self.current_number - self.last_uncommitted_number
>= self.maximum_in_flight
):
if not block:
return None
await asyncio.sleep(self.sleep_time)
if time.time() - start_time > self.maximum_wait_time:
logging.warn(
f"Waited over 30 seconds for a transaction to commit, resyncing {self.account.address().hex()}"
)
await self.initialize()
else:
await self.__update()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I prefer comments to tell what we're checking for each of these.

It looks like:

  1. Check if the queue is initialized
  2. Check if there is over the number in flight
  3. Sleep if it's blocking on transactions, otherwise continue to send transactions (need to define block in docs)
  4. Poll and block to see if the mempool has cleared any transactions
  5. If it's clear (or block is false), submit the transaction, and let the next one grab the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a comment.

my concern is that this is duplicating the comments in the doc block and in the document.

accurate summary though

Comment on lines 99 to 139
"""
if self.last_uncommitted_number == self.current_number:
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a short cut, but maybe want to provide an input flag to sync regardless (if for some reason you've got something else also handling transactions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, just removing altogether

ecosystem/python/sdk/aptos_sdk/account_sequence_number.py Outdated Show resolved Hide resolved
start_time = time.time()
while self.last_uncommitted_number != self.current_number:
print(f"{self.last_uncommitted_number} {self.current_number}")
if time.time() - start_time > self.maximum_wait_time:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make this condition a lot, maybe move it into a function?

Comment on lines 107 to 108
while self.last_uncommitted_number != self.current_number:
print(f"{self.last_uncommitted_number} {self.current_number}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is roughly the same logic as the other loop, probably want to combine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not similar enough to where I want to refactor... I think I poked at this prior to and the logic was slightly different on how you get in and the order in which you test.

ecosystem/python/sdk/examples/transaction-batching.py Outdated Show resolved Hide resolved
ecosystem/python/sdk/examples/transaction-batching.py Outdated Show resolved Hide resolved
@gregnazario gregnazario enabled auto-merge (squash) May 3, 2023 21:53
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

ecosystem/python/sdk/aptos_sdk/account_sequence_number.py Outdated Show resolved Hide resolved
self.last_uncommitted_number = None
self.current_number = None

async def next_sequence_number(self, block: bool = True) -> Optional[int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this acquire_next_sequence_number to convey the notion that this might block for a long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like long function names 😬

Comment on lines +66 to +74
sequence_number = (
await self._account_sequence_number.next_sequence_number()
)
transaction = await self._transaction_generator(
self._account, sequence_number
)
txn_hash_awaitable = self._rest_client.submit_bcs_transaction(
transaction
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a thought (not sure you want to overcomplicate this) - one potential fault-tolerant improvement, is to have AccountSequenceNumber have two counters, issued and submitted, and next_sequence_number increments issued, and after submit_bcs_transaction is called , if transaction was potentially submitted (we received success response, or received no response) we notify AccountSequenceNumber to increment submitted counter
If we received failed response for submission, we notify AccountSequenceNumber to decrement issued number.

and AccountSequenceNumber is changed to never have issued be more than 1 above submitted.
that would make it such that single failure in submission doesn't look up the whole stream until expiry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue here is that if this isn't followed carefully, you can end up in a race condition...

I get 3 sequence numbers, submit 3, and only 2 fail, which 2 failed? what am I decrementing. So you end up adding complexity to figure out do I synchronize or set -1 and try again.

while True:
# Always start waiting for one
(
txn_awaitable,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't used SDK before, is this waiting for txn hash (polling the rest endpoint for txn hash) or what is it doing?

looking at ecosystem/python/sdk/aptos_sdk/async_client.py , I see that signature is
def submit_bcs_transaction(self, signed_transaction: SignedTransaction) -> str:
so return type is str, not something you can wait on?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str is the txn hash, we would benefit by typing that.

ecosystem/python/sdk/aptos_sdk/account_sequence_number.py Outdated Show resolved Hide resolved
Comment on lines 39 to 41
maximum_in_flight: int = 100
maximum_wait_time = 30
sleep_time = 0.01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come type hints in some cases but not others?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brain problems of the author.

Returns the next sequence number available on this account. This leverages a lock to
guarantee first-in, first-out ordering of requests.
"""
await self.lock.acquire()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd recommend using this as a context manager to make it impossible to forget to release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... I guess since we're not really catching anything....

while self.last_uncommitted_number != self.current_number:
print(f"{self.last_uncommitted_number} {self.current_number}")
if time.time() - start_time > self.maximum_wait_time:
logging.warn(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be good to let the user toggle this on or off.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually.... logging is still very early in this sdk

ecosystem/python/sdk/aptos_sdk/account_sequence_number.py Outdated Show resolved Hide resolved
ecosystem/python/sdk/aptos_sdk/account_sequence_number.py Outdated Show resolved Hide resolved
ecosystem/python/sdk/aptos_sdk/transaction_worker.py Outdated Show resolved Hide resolved
def stop(self):
"""Stop the tasks for managing transactions"""
if not self._started:
raise Exception("Start not yet called")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: RuntimeError or something instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the python sdk needs more love

async def _process_transactions_task(self):
try:
while True:
# Always start waiting for one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why we do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lemme add a larger comment, but basically we want to wait for the first one and then we can get into the while loop... where we could have hit a herd of now pending txns....

Copy link
Contributor

@0xmaayan 0xmaayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does http2 needed to use this layer?

account: AccountAddress
lock = asyncio.Lock

maximum_in_flight: int = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 is because of the mempool limitation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add a config here in the future....

for (output, sequence_number) in zip(outputs, sequence_numbers):
if isinstance(output, BaseException):
await self._processed_transactions.put(
(sequence_number, None, output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not "put" the exception instead on None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't return the awkward type of (exception|string) instead we return (int, optional[string], optional[exception])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I read it wrong, output is the exception

) -> (int, typing.Optional[str], typing.Optional[Exception]):
return await self._processed_transactions.get()

def stop(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop and start meant to be used anywhere other than in tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we use it in the actual transaction-batching demo

one need not call stop unless they want a clean shut down though...

Copy link
Contributor

@0xmaayan 0xmaayan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beautiful

@banool banool disabled auto-merge May 18, 2023 20:01
Copy link
Contributor

@banool banool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing!!

One small thing to fix before landing so I disabled auto merge.


_maximum_in_flight: int = 100
_maximum_wait_time: int = 30
_sleep_time: int = 0.01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_sleep_time: int = 0.01
_sleep_time: float = 0.01

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

omg

@davidiw davidiw enabled auto-merge (rebase) May 18, 2023 21:43
@sionescu sionescu requested a review from a team as a code owner May 22, 2023 20:12
@davidiw davidiw requested a review from 0xmigo as a code owner June 3, 2023 10:34
@github-actions

This comment has been minimized.

@davidiw davidiw requested a review from saharct as a code owner June 8, 2023 16:42
@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 853c642d3b08b313bb562499ecbbd019f1c7b511

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 853c642d3b08b313bb562499ecbbd019f1c7b511 (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : committed: 8855 txn/s, latency: 3721 ms, (p50: 3800 ms, p90: 5000 ms, p99: 6100 ms), latency samples: 301100
2. Upgrading first Validator to new version: 853c642d3b08b313bb562499ecbbd019f1c7b511
compatibility::simple-validator-upgrade::single-validator-upgrade : committed: 4703 txn/s, latency: 6776 ms, (p50: 7300 ms, p90: 8300 ms, p99: 8900 ms), latency samples: 178720
3. Upgrading rest of first batch to new version: 853c642d3b08b313bb562499ecbbd019f1c7b511
compatibility::simple-validator-upgrade::half-validator-upgrade : committed: 4444 txn/s, latency: 7200 ms, (p50: 7800 ms, p90: 8800 ms, p99: 9400 ms), latency samples: 168900
4. upgrading second batch to new version: 853c642d3b08b313bb562499ecbbd019f1c7b511
compatibility::simple-validator-upgrade::rest-validator-upgrade : committed: 6908 txn/s, latency: 4689 ms, (p50: 4800 ms, p90: 6300 ms, p99: 8400 ms), latency samples: 241780
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 853c642d3b08b313bb562499ecbbd019f1c7b511 passed
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

✅ Forge suite land_blocking success on 853c642d3b08b313bb562499ecbbd019f1c7b511

performance benchmark : committed: 5695 txn/s, submitted: 5697 txn/s, expired: 1 txn/s, latency: 6951 ms, (p50: 4500 ms, p90: 14900 ms, p99: 27500 ms), latency samples: 2431953
Max round gap was 1 [limit 4] at version 1521219. Max no progress secs was 3.53012 [limit 10] at version 1521219.
Test Ok

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

✅ Forge suite framework_upgrade success on aptos-node-v1.3.0_3fc3d42b6cfe27460004f9a0326451bcda840a60 ==> 853c642d3b08b313bb562499ecbbd019f1c7b511

Compatibility test results for aptos-node-v1.3.0_3fc3d42b6cfe27460004f9a0326451bcda840a60 ==> 853c642d3b08b313bb562499ecbbd019f1c7b511 (PR)
Upgrade the nodes to version: 853c642d3b08b313bb562499ecbbd019f1c7b511
framework_upgrade::framework-upgrade::full-framework-upgrade : committed: 4855 txn/s, latency: 6665 ms, (p50: 6900 ms, p90: 9400 ms, p99: 16500 ms), latency samples: 179660
5. check swarm health
Compatibility test for aptos-node-v1.3.0_3fc3d42b6cfe27460004f9a0326451bcda840a60 ==> 853c642d3b08b313bb562499ecbbd019f1c7b511 passed
Test Ok

@davidiw davidiw merged commit e5a981c into main Jun 8, 2023
@davidiw davidiw deleted the davidiw-python branch June 8, 2023 18:19
davidiw added 11 commits June 8, 2023 17:05
This provides a framework for managing as many transactions from a
single account at once
* The AccountSequenceNumber allocates up to 100 outstanding sequence numbers to maximize the number of concurrent transactions in the happy path.
* The transaction manager provides async workers that push a transaction from submission through to validating completion
Together they provide the basic harness for scaling transaction
submission on the Aptos blockchain from a single account.
this handles all the failures associated with network congestion,
meaning this is ready to ship for now... need more testing on other
failure cases.... such as intermittent network connectivity, lost
connections, bad upstreams.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants