Skip to content
This repository has been archived by the owner on Oct 4, 2019. It is now read-only.

tx/address indexing as optional feature, with separate db #475

Merged
merged 47 commits into from
Apr 13, 2018

Conversation

whilei
Copy link
Contributor

@whilei whilei commented Jan 18, 2018

geth_getAddressTransactions

Returns transactions for an address.

Usage requires address-transaction indexes using geth --atxi to enable and create indexes during chain sync/import, optionally using geth atxi-build to index pre-existing chain data.

Parameters
  1. DATA, 20 Bytes - address to check for transactions
  2. QUANTITY - integer block number to filter transactions floor
  3. QUANTITY - integer block number to filter transactions ceiling
  4. STRING - [t|f|tf|], use t for transactions to the address, f for from, or tf/'' for both
  5. STRING - [s|c|sc|], use s for standard transactions, c for contracts, or sc/``''` for both
  6. QUANTITY - integer of index to begin pagination. Using -1 equivalent to 0.
  7. QUANTITY - integer of index to end pagination. Using -1 equivalent to last transaction n.
  8. BOOL - whether to return transactions in order of oldest first. By default false returns transaction hashes ordered by newest transactions first.
params: [
   '0x407d73d8a49eeb85d32cf465507dd71d507100c1',
   123, // earliest block
   456, // latest block, use 0 for "undefined", ie. eth.blockNumber
   't', // only transactions to this address
   '', // both standard and contract transactions
   -1, // do not trim transactions for pagination (start)
   -1, // do not trim transactions for pagination (end)
   false // do not reverse order ('true' will reverse order to be oldest first)
]
Returns

Array - Array of transaction hashes, or an empty array if no transactions found

Example
// Request
curl -X POST --data '{"jsonrpc":"2.0","method":"geth_getAddressTransactions","params":["0xb5C694a4cDbc1820Ba4eE8fD6f5AB71a25782534", 5000000, 0, "tf", "sc", -1, -1, false],"id":1}' :8545

// Result
{"jsonrpc":"2.0","id":1,"result":["0xbdaa803ec661db62520eab4aed8854fdea7e04b716849cc67ee7d1c9d94db2d3","0x886e2197a1a703bfed97a39b627f40d8f8beed1fc4814fe8a9618281450f1046","0x4b7f948442732719b31d35139f4269ad021984975c23c35190ac89ef225e95eb","0x35aec85ad9718e937c4e7c11b6f47eebd557cc31b46afc7e19ac888e57e6cdcc","0x0cc2cd8e2b79ef43f441666c0f9de1f06e3690dc3fe64b6fe5d41976115f9184","0x0a06510426a311056e093d1b7a9aabafcb8ce723a6c5c40a9e02824db565844a"]}

geth atxi-build

Builds address-transactions indexes for existing chaindata. The command is idempotent.

Running a complete chain index from block 0 -> 5220000 on a 2009 Macbook Pro with 1GB cache took 3h30m, averaging ~300 blocks/second and ~1800 txs/second, and the /indexes database is 2.0GB.

$ geth atxi-build
Parameters
  1. --start=QUANTITY - custom floor at which to begin indexing. If unset, the build command will use it's persistent placeholder if the command has been run before.
  2. --stop=QUANTITY - custom ceiling at which to finish indexing. If unset, default value is blockchain head.
  3. --step=QUANTITY - custom increment for batching writes to db and setting persistent progress placeholder. Default value is 10000.

geth --atxi

Flag required to enable address-transaction indexing during sync and import, and to enable associated API for a geth instance.


Implementation

Creates a new database <chaindir>/indexes which holds key indexes of the form below, where each unit is a definite length.

atx-<common.Address (address)><8-byte block number uint64><t|f (to|from)><s|c (standard|contract)><common.Hash (tx hash)>
  • txa- = 4 bytes
  • address = 20 bytes
  • blockNumber uint64 = 8 bytes
  • direction = 1 byte
  • kindof = 1 byte
  • txhash = 32 bytes

The key index can then be resolved to individual values using these known lengths, eg. address = key[4:24]

Lookups by address use a prefix iterator on address.

[WIP] - Known issues currently:

  • Case not handled: delete/reorg blocks. If there is a chain reorg, and a once-canonical
    block is relegated to side chain, we should remove associated atxi's, which is to
    say the atx index should only reflect the known canonical chain.

  • ethdb.Database interface doesn't include iterator, so storage testing with MemDatabase
    will require changes to Database interface or some other kind of stub. Use level db in tmp dir

  • Build the indexes performance might be improved significantly by using a memory cache with Batch writes.

  • Does not use a bloom filter. Undecided. Unnecessary.

- api: debug_getAddressTransactions(<0xaddress>, <startBlockN>, <endBlockN>, <to|from|>)
- cli cmd: 'geth atxi-build [--start=number] [--stop=number]'
- cli flag: 'geth [--atxi|--add-tx-index]

The api returns an array of 0x-prefixed transactions hashes.

Creates a new database <chaindir>/indexes which holds key indexes of the form
txa-<common.Address (address)><8-byte block number uint64><t|f (to|from)><common.Hash (tx hash)>

Lookups by address use a prefix iterator on address, eg. txa-<my address>...

Known issues currently:
- Case not handled: delete/reorg blocks. If there is a chain reorg, and a once-canonical
  block is relegated to side chain, we should remove associated atxi's, which is to
  say the atx index should only reflect the known canonical chain.
- ethdb.Database interface doesn't include iterator, so storage testing with MemDatabase
  will require changes to Database interface or some other kind of stub
- Build the indexes performance might be improved significantly by using a memory cache with Batch writes.
@whilei
Copy link
Contributor Author

whilei commented Jan 18, 2018

Note that relevant changes are here: a778e22 , with the other 11,200 lines being from the dependencies (71c0b77). Which AFAIK are only for the progress bar, which could be removed/swapped, or whatever. 10k+ lines seems extravagant for a progress indicator.

return err
}
}
if block.NumberU64()%10000 == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Un-hardcore this, use step parameter.
And maybe introduce a processed block counter, instead of using block numbers? This is only my pedantic suggestion ;)

Copy link
Contributor Author

@whilei whilei Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea -- you're right. Will do. Just a relic of fast-and-furious sketching.

return putBatch.Write()
}

func WriteBlockAddTxIndexes(indexDb ethdb.Database, block *types.Block) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be batched.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we may extract a function that adds all transactions from block to a batch, and then use it here, and inside the loop in AddTxIndexesBatch above.

func WriteBlockAddTxIndexes(indexDb ethdb.Database, block *types.Block) error {
    putBatch := indexDb.NewBatch()
    addTransactions(block, putBatch)
    return putBatch.Write()
}
func (self *BlockChain) AddTxIndexesBatch(indexDb ethdb.Database, startBlockN, stopBlockN uint64) (err error) {
    ...
    for block != nil && block.NumberU64() <= stopBlockN {
        addTransactions(block, putBatch)
        // write each "step"...
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it.

}

// GetAddrTxs gets the indexed transactions for a given account address.
func GetAddrTxs(db ethdb.Database, address common.Address, blockStartN uint64, blockEndN uint64, toFromOrBoth string) []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename toFromOrBoth to direction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I like it. Will do.

}
}
if blockEndN > 0 {
txaI := new(big.Int).SetUint64(binary.LittleEndian.Uint64(blockNum))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we just use plain uint64s and compare directly with blockStartN and blockEndN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚡️

continue
}
}
if toFromOrBoth == "to" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified a bit: before the loop, map "to" -> 't', "from" -> 'f', ""/"both" -> 'b'

var direction byte = 'b'
if len(toFromOrBoth) > 0 {
    direction = toFromOrBoth[0]
}

and then just do:

if direction != 'b' && direction != torf[0] {
    continue
}

Gopkg.lock Outdated
@@ -13,6 +13,12 @@
revision = "2f1ce7a837dcb8da3ec595b1dac9d0632f0f99e8"
version = "v1.3.1"

[[projects]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see the usages of newly vendored packages, and many seems very unrelated to this task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea -- I used a progress bar for a while for the atxi-build command, then removed it and haven't pushed the commit removed the deps (oops)


indexDb := MakeIndexDatabase(ctx)
if indexDb == nil {
panic("indexdb is nil")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't you use glog.Fatal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

}

// resolveAddrTxBytes resolves the index key to individual []byte values
func resolveAddrTxBytes(key []byte) (address, blockNumber, toOrFrom, txhash []byte) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first returned value is used only in tests ;) But from logical point of view, returning everything seems right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I noticed that too, but agree that it seems right to have it. And it's worth it if only for tests.

@whilei
Copy link
Contributor Author

whilei commented Jan 19, 2018

It seems kind of awkward to put the api in debug_ but I don't know where else to put it... Parity uses a parity_ namespace for custom calls like this, maybe worth considering implementing a classic_ namespace?

@whilei
Copy link
Contributor Author

whilei commented Jan 19, 2018

Also, what do you think about adding another parameter contract to be able to filter for/against/agnostic on contract txs?

implemented in a7dbbb5

g

bug was with the address field for 'to' indexes being messed up cuz unnecessary pointer
solution: use key instead of tx hash (typo)
solution: rename keep -> diff

because it returns the difference, see usage in bv.reorg
... with shared but postponed txs (same tx, later block)

solution: rm all txs from old chain, add all txs for new chain
in case of reorg. atxi should only reflect canonical chain.
if err != nil || v == nil {
return 0
}
s := string(v)
Copy link
Contributor Author

@whilei whilei Jan 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no big deal, but could use binary package Uint64 instead

solution: implement that nonexclusively

cc @pyskell
@tzdybal
Copy link
Contributor

tzdybal commented Apr 3, 2018

Please re-format files, as there are some inconsistent parts.


Reviewed 1 of 135 files at r1, 1 of 127 files at r3, 1 of 8 files at r6, 2 of 6 files at r7, 6 of 9 files at r8, 4 of 4 files at r9.
Review status: 14 of 15 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.


cmd/geth/build_atxi_cmd.go, line 3 at r9 (raw file):

package main

import (

Formatting required.


cmd/geth/build_atxi_cmd.go, line 16 at r9 (raw file):

func buildAddrTxIndexCmd(ctx *cli.Context) error {

	ethdb.SetCacheRatio("chaindata", 0.5)

I don't like this hardcode. Do you think that introducing separate commandline parameter for indexing cache is reasonable?


cmd/geth/flag.go, line 818 at r9 (raw file):

	var (
		chaindir = MustMakeChainDataDir(ctx)
		cache    = ctx.GlobalInt(aliasableName(CacheFlag.Name, ctx))

I need clarification - you are using same cache size parameter for both databases, but set 'cache ratio' to 0.5 for both of them, to keep the global memory consumption within limits?


common/types.go, line 128 at r9 (raw file):

func HexToAddress(s string) Address    { return BytesToAddress(FromHex(s)) }

func EmptyAddress(a Address) bool {

Is this function necessary? It seems that there is only one usage, just below.


core/blockchain.go, line 798 at r9 (raw file):

		removals := [][]byte{}
		removeRemovals := func(removals [][]byte) {

I don't like the removals name collision.


core/database_util.go, line 220 at r7 (raw file):

Previously, whilei (ia) wrote…

it might be interesting to allow to.IsEmpty(), since doing so would let you grab just contracts (deployed by anyone) on the chain by passing the empty hash address 0x000...

# grab the latest contracts
⟠ curl -X POST --data '{"jsonrpc":"2.0","method":"debug_getAddressTransactions","params":["0x0000...", 5200000, 0, ""],"id":67}' :8545

SGTM


core/database_util.go, line 156 at r9 (raw file):

// for example.
func formatAddrTxBytesIndex(address, blockNumber, direction, kindof, txhash []byte) (key []byte) {
	key = txAddressIndexPrefix

Maybe it's the premature optimization, but we can pre-allocate slice (make([]byte, 0, 34+len(txhash) (replace 34+len(txhash) with actual value)), because the size is fixed and we know it.


core/database_util.go, line 217 at r9 (raw file):

			to = &common.Address{}
		}
		if to.IsEmpty() {

Is it possible, that to IsEmpty but it was not nil? Maybe both if blocks can be merged?


eth/backend.go, line 231 at r9 (raw file):

	if config.UseAddrTxIndex {
		// TODO: these are arbitrary numbers I just made up. Optimize?
		ethdb.SetCacheRatio("chaindata", 0.95)

Those arbitrary numbers are different in different places of the code.
Setting cache of size of 5% of global cache is not the best idea, especially in case of default of 128MB cache - 6.4MB of cache is useless.


Comments from Reviewable

@tzdybal
Copy link
Contributor

tzdybal commented Apr 3, 2018

Ah, now I understand why you split cache differently. For building index, it's big (for performance), and for normal "appending" it's small, because you don't want to waste memory that can be used by chain data? Am I right?

@tzdybal
Copy link
Contributor

tzdybal commented Apr 3, 2018

I have one more idea (definitely for another PR) - parallelize!

@whilei
Copy link
Contributor Author

whilei commented Apr 5, 2018

cmd/geth/build_atxi_cmd.go, line 16 at r9 (raw file):

Previously, tzdybal (Tomasz Zdybał) wrote…

I don't like this hardcode. Do you think that introducing separate commandline parameter for indexing cache is reasonable?

It think it may be reasonable, though hesitant because I think it may also be overkill, since global --cache flag can still be used, and in a few limited experiments with cache ratio (between chaindata and addr-tx-data databases) I didn't see any obvious gains beyond the currently hardcoded 50/50 sharing of globally-configured cache value.

Open to further configuration/un-hardcoding, but would suggest to delegate to follow-up PR as a nice-to-have.


Comments from Reviewable

@whilei
Copy link
Contributor Author

whilei commented Apr 5, 2018

cmd/geth/flag.go, line 818 at r9 (raw file):

Previously, tzdybal (Tomasz Zdybał) wrote…

I need clarification - you are using same cache size parameter for both databases, but set 'cache ratio' to 0.5 for both of them, to keep the global memory consumption within limits?

Yes -- just to keep the system adherent to global --cache value, since it will be divided by both databases.


Comments from Reviewable

@whilei
Copy link
Contributor Author

whilei commented Apr 5, 2018

cmd/geth/flag.go, line 818 at r9 (raw file):

Previously, whilei (ia) wrote…

Yes -- just to keep the system adherent to global --cache value, since it will be divided by both databases.

And the 50/50 ratio is arbitrary and based only on my own empirical experiments.


Comments from Reviewable

whilei added 7 commits April 5, 2018 13:11
solution: refactor to use common addr == zero-value fn
solution: removeRemovals -> deleteRemovalsFn
solution: implement pagination and reverse params for atxi api

This seems like a useful thing for most applications...
- reverse because most relevant transactions are latest txs
- pagination because some account have tens- or hundreds- of thousands of txs
@tzdybal
Copy link
Contributor

tzdybal commented Apr 5, 2018

Reviewed 42 of 42 files at r10.
Review status: all files reviewed at latest revision, 15 unresolved discussions.


cmd/geth/build_atxi_cmd.go, line 16 at r9 (raw file):

Previously, whilei (ia) wrote…

It think it may be reasonable, though hesitant because I think it may also be overkill, since global --cache flag can still be used, and in a few limited experiments with cache ratio (between chaindata and addr-tx-data databases) I didn't see any obvious gains beyond the currently hardcoded 50/50 sharing of globally-configured cache value.

Open to further configuration/un-hardcoding, but would suggest to delegate to follow-up PR as a nice-to-have.

This is quite complex - leave it as is for now.
My thoughts:

  • Currently full index database is about 4GB. There is no sense in setting cache for greater values.
  • Logically, caching of chaindata seems pointless, because every block is processed only once. On the other hand, setting some cache may help on database layer (because of data representation on disk).
  • Separate configuration option for atxi-cache is another solution (with pros and cons).

core/blockchain.go, line 798 at r9 (raw file):

Previously, tzdybal (Tomasz Zdybał) wrote…

I don't like the removals name collision.

Actually I didn't like the conflict between removals local variable and removals goroutine parameter ;)


Comments from Reviewable

@tzdybal
Copy link
Contributor

tzdybal commented Apr 5, 2018

:lgtm:

I like the pagination

Reviewed 6 of 6 files at r11.
Review status: all files reviewed at latest revision, 15 unresolved discussions.


Comments from Reviewable

solution: sort.Sort interface Less method should use be backwards

since we want higher block numbers first
@whilei
Copy link
Contributor Author

whilei commented Apr 5, 2018

@tzdybal One last open question -- do we want to stuck with debug_ namespace for debug_getAddressTransactions API method?

Or maybe establish a new etc_ namespace?...

This creates a new public RPC/JS API module 'geth_'.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants