Create `addresses.info` #6308

hildobby · 2024-07-02T12:34:14Z

This PR creates addresses.info with aggregated high level information for all EVM addresses with chain-specific spells culminating into a crosschain one. I have thought of usecases for all addresses in there, it can be used:

to easily find a subset of addresses based on some heuristics
to join on as a filter to only query addresses in the time ranges they appeared in, potentially making queries and downstream spells more efficient (for example I think I can finally optimise Create attacks.address_poisoning #5995 into a more optimised runtime)
to do some address segmentation by chain it shows up on, if its a contract on any chain, where/when it was first funded, etc
I'll also be creating addresses.index using it, a spell that solely contains address and index where index is a BIGINT (or INT256 if there's more addresses than I anticipate) created incrementally based on when an address first appeared anywhere. This can then be used:
- to efficiently store all kinds of spells with network graph data or spells containing a lot of addresses, ie: address 1039 sent to those distinct addresses [10, 212, 1334]
- or even in spells with CTEs querying a lot of addresses, this can make that data more storage efficient allowing for more complex queries while staying under cluster capacity

For the incremental updates, the chain-specific macros constantly fetches the last block_number for every address (across txs and token transfers) so it can easily incrementally update based on those with no missing data. For the croschain version, a map is created for every address which has this data for all chains whre the address appeared:

map_from_entries(array[
            ('last_seen', CAST(last_seen AS varchar))
            , ('last_seen_block', CAST(last_seen_block AS varchar))
            , ('executed_tx_count', CAST(executed_tx_count AS varchar))
            , ('is_smart_contract', CAST(is_smart_contract AS varchar))
            , ('sent_count', CAST(sent_count AS varchar))
            , ('received_count', CAST(received_count AS varchar))
            ]) AS chain_stats

those maps are then updated on incremental runs with this line which ensures that only the chains with new data get overwritten:

map_concat(map_filter(t.chain_stats, (k, v) -> NOT contains(map_keys(nd.chain_stats), k)), nd.chain_stats) AS chain_stats

I think the spell is now ready, this PR will have most chains (and the crosschain spell) deactivated in prod so it can first build for ethereum and a couple of others, then I'll do some follow up PRs for the other chains and end with the crosschain spell.

@0xRobin lmk if you spot anything missing here, I've looked through extensively but a second pair of eyes might surface stuff I missed!

…into addresses_info

hildobby · 2024-07-22T16:02:14Z

hey @jeff-dude, i'm trying to build this spell that will have 1 line per address with high level info (executed_tx_count, max_nonce, is_smart_contract, namespace, name, first_funded_by, first_tx_block_time, last_tx_block_time, first_tx_block_number, last_tx_block_number, first_received_block_time, first_received_block_number, last_transfer_block_time & last_transfer_block_number)

the goal is to easily get aggregated data for any evm address but this would also be super useful for filtering other spells and making them more efficient (ie filtering each address by first tx). this is just for ethereum for now but i want to make it for all evm chains and have a crosschain version of the spell.

in order to build it in spellbook, I use the last_updated column as an incremental predicate which for now is the last block_time that address had any in/outflow/tx but i might switch it to MAX(block_time) in the end. on each incremental run, all new data is fetched and aggregated by address, and if that address is already there, the new data is added to the new one but i'm failing to the row replacement to work properly, jeff you mentioned that this would work do you know what i did wrong here? i just want to take new data, index by address and replace the row with that address if it already exists

jeff-dude · 2024-07-22T19:53:45Z

at first glance, looks like it could be how you use timestamps in the incremental phase. for source filter, you consistently filter on block_time, but for target filter (in model config), it's based on a new field you build for timestamp, where greatest is obtained across various sources? it could be that the address + timestamp combo across tables differ, so when you filter both source/target on incremental, it doesn't properly find the row to match to. that leads to rewriting it instead, hence the failure on dupes post-incremental run.

i also notice you left join to {{ this }} in incremental code, but don't actually select any fields or filter on it at all. this won't actually do anything then? with the block number filter in the join condition, it would need to be inner join to work properly i believe.