Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create addresses.info #6308

Merged
merged 108 commits into from
Oct 14, 2024
Merged

Create addresses.info #6308

merged 108 commits into from
Oct 14, 2024

Conversation

hildobby
Copy link
Collaborator

@hildobby hildobby commented Jul 2, 2024

This PR creates addresses.info with aggregated high level information for all EVM addresses with chain-specific spells culminating into a crosschain one. I have thought of usecases for all addresses in there, it can be used:

  • to easily find a subset of addresses based on some heuristics

  • to join on as a filter to only query addresses in the time ranges they appeared in, potentially making queries and downstream spells more efficient (for example I think I can finally optimise Create attacks.address_poisoning #5995 into a more optimised runtime)

  • to do some address segmentation by chain it shows up on, if its a contract on any chain, where/when it was first funded, etc

  • I'll also be creating addresses.index using it, a spell that solely contains address and index where index is a BIGINT (or INT256 if there's more addresses than I anticipate) created incrementally based on when an address first appeared anywhere. This can then be used:

    • to efficiently store all kinds of spells with network graph data or spells containing a lot of addresses, ie: address 1039 sent to those distinct addresses [10, 212, 1334]
    • or even in spells with CTEs querying a lot of addresses, this can make that data more storage efficient allowing for more complex queries while staying under cluster capacity

For the incremental updates, the chain-specific macros constantly fetches the last block_number for every address (across txs and token transfers) so it can easily incrementally update based on those with no missing data. For the croschain version, a map is created for every address which has this data for all chains whre the address appeared:

map_from_entries(array[
            ('last_seen', CAST(last_seen AS varchar))
            , ('last_seen_block', CAST(last_seen_block AS varchar))
            , ('executed_tx_count', CAST(executed_tx_count AS varchar))
            , ('is_smart_contract', CAST(is_smart_contract AS varchar))
            , ('sent_count', CAST(sent_count AS varchar))
            , ('received_count', CAST(received_count AS varchar))
            ]) AS chain_stats

those maps are then updated on incremental runs with this line which ensures that only the chains with new data get overwritten:

map_concat(map_filter(t.chain_stats, (k, v) -> NOT contains(map_keys(nd.chain_stats), k)), nd.chain_stats) AS chain_stats

I think the spell is now ready, this PR will have most chains (and the crosschain spell) deactivated in prod so it can first build for ethereum and a couple of others, then I'll do some follow up PRs for the other chains and end with the crosschain spell.

@0xRobin lmk if you spot anything missing here, I've looked through extensively but a second pair of eyes might surface stuff I missed!

@hildobby hildobby added the WIP work in progress label Jul 2, 2024
@hildobby
Copy link
Collaborator Author

hey @jeff-dude, i'm trying to build this spell that will have 1 line per address with high level info (executed_tx_count, max_nonce, is_smart_contract, namespace, name, first_funded_by, first_tx_block_time, last_tx_block_time, first_tx_block_number, last_tx_block_number, first_received_block_time, first_received_block_number, last_transfer_block_time & last_transfer_block_number)

the goal is to easily get aggregated data for any evm address but this would also be super useful for filtering other spells and making them more efficient (ie filtering each address by first tx). this is just for ethereum for now but i want to make it for all evm chains and have a crosschain version of the spell.

in order to build it in spellbook, I use the last_updated column as an incremental predicate which for now is the last block_time that address had any in/outflow/tx but i might switch it to MAX(block_time) in the end. on each incremental run, all new data is fetched and aggregated by address, and if that address is already there, the new data is added to the new one but i'm failing to the row replacement to work properly, jeff you mentioned that this would work do you know what i did wrong here? i just want to take new data, index by address and replace the row with that address if it already exists

@hildobby hildobby added the question Further information is requested label Jul 22, 2024
@jeff-dude
Copy link
Member

at first glance, looks like it could be how you use timestamps in the incremental phase. for source filter, you consistently filter on block_time, but for target filter (in model config), it's based on a new field you build for timestamp, where greatest is obtained across various sources? it could be that the address + timestamp combo across tables differ, so when you filter both source/target on incremental, it doesn't properly find the row to match to. that leads to rewriting it instead, hence the failure on dupes post-incremental run.

i also notice you left join to {{ this }} in incremental code, but don't actually select any fields or filter on it at all. this won't actually do anything then? with the block number filter in the join condition, it would need to be inner join to work properly i believe.

@jeff-dude jeff-dude self-assigned this Jul 22, 2024
@hildobby hildobby removed the question Further information is requested label Jul 29, 2024
Comment on lines 105 to 108
FROM executed_txs
LEFT JOIN is_contract USING (address)
LEFT JOIN transfers USING (address)
LEFT JOIN {{ source('addresses_events_'~blockchain, 'first_funded_by')}} ffb USING (address)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Important to note that with this left join setup we will only have records for addresses that have executed transactions (EOAs).
If you want to include smart contracts and accounts, we should change the join type.

Copy link
Collaborator Author

@hildobby hildobby Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks I swapped it so first_funded_by is the FROM table with the rest being joined onto and executed_txs + transfers as FULL OUTER JOINs

Comment on lines 227 to 230
FROM executed_txs
LEFT JOIN is_contract USING (address)
LEFT JOIN {{ source('addresses_events_'~blockchain, 'first_funded_by')}} ffb USING (address)
LEFT JOIN transfers USING (address)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment on the join type used for including all the addresses

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks I swapped it so first_funded_by is the FROM table with the rest being joined onto and executed_txs + transfers as FULL OUTER JOINs

file_format = 'delta',
incremental_strategy = 'merge',
unique_key = ['address'],
merge_update_columns = ['executed_tx_count', 'max_nonce', 'is_smart_contract', 'namespace', 'name', 'first_funded_by', 'first_funded_by_block_time', 'tokens_received_count', 'tokens_received_tx_count', 'tokens_sent_count', 'tokens_sent_tx_count', 'first_transfer_block_time', 'last_transfer_block_time', 'first_received_block_number', 'last_received_block_number', 'first_sent_block_number', 'last_sent_block_number', 'received_volume_usd', 'sent_volume_usd', 'first_tx_block_time', 'last_tx_block_time', 'first_tx_block_number', 'last_tx_block_number', 'last_seen', 'last_seen_block'],
Copy link
Collaborator

@0xRobin 0xRobin Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason for specifying the columns, this looks to be all the columns?
If there are some columns you want to exclude from the incremental merge update, you can specify merge_exclude_columns instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It didn't work until those columns were added. Those are specified here as columns to be replaced, this is the first time I make (and see) a spell of this type, afaik all other incremental spells only work as append with no replace component unlike here where this appears to be necessary

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very strange behavior, normal incremental spells definitely work by replacing the incremental results.
I'll have a look at this through the CI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed these statements and everything runs fine from what I can tell.

@hildobby hildobby requested a review from 0xRobin October 13, 2024 23:30
Copy link
Collaborator

@0xRobin 0xRobin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hildobby
I have pushed some changes.

  • removed the merge_update_columns, they don't seem needed
  • replaced greatest with a combo of array_max and filter to allow it to handle null values
  • moved everything into the new addresses sector, this deserves it's own spot.

@0xRobin 0xRobin added ready-for-merging and removed in review Assignee is currently reviewing the PR labels Oct 14, 2024
@0xRobin 0xRobin merged commit 68233ca into duneanalytics:main Oct 14, 2024
2 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Oct 14, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants