tx/address indexing as optional feature, with separate db #475

whilei · 2018-01-18T04:51:17Z

geth_getAddressTransactions

Returns transactions for an address.

Usage requires address-transaction indexes using geth --atxi to enable and create indexes during chain sync/import, optionally using geth atxi-build to index pre-existing chain data.

Parameters

DATA, 20 Bytes - address to check for transactions
QUANTITY - integer block number to filter transactions floor
QUANTITY - integer block number to filter transactions ceiling
STRING - [t|f|tf|], use t for transactions to the address, f for from, or tf/'' for both
STRING - [s|c|sc|], use s for standard transactions, c for contracts, or sc/``''` for both
QUANTITY - integer of index to begin pagination. Using -1 equivalent to 0.
QUANTITY - integer of index to end pagination. Using -1 equivalent to last transaction n.
BOOL - whether to return transactions in order of oldest first. By default false returns transaction hashes ordered by newest transactions first.

params: [
   '0x407d73d8a49eeb85d32cf465507dd71d507100c1',
   123, // earliest block
   456, // latest block, use 0 for "undefined", ie. eth.blockNumber
   't', // only transactions to this address
   '', // both standard and contract transactions
   -1, // do not trim transactions for pagination (start)
   -1, // do not trim transactions for pagination (end)
   false // do not reverse order ('true' will reverse order to be oldest first)
]

Returns

Array - Array of transaction hashes, or an empty array if no transactions found

Example

// Request
curl -X POST --data '{"jsonrpc":"2.0","method":"geth_getAddressTransactions","params":["0xb5C694a4cDbc1820Ba4eE8fD6f5AB71a25782534", 5000000, 0, "tf", "sc", -1, -1, false],"id":1}' :8545

// Result
{"jsonrpc":"2.0","id":1,"result":["0xbdaa803ec661db62520eab4aed8854fdea7e04b716849cc67ee7d1c9d94db2d3","0x886e2197a1a703bfed97a39b627f40d8f8beed1fc4814fe8a9618281450f1046","0x4b7f948442732719b31d35139f4269ad021984975c23c35190ac89ef225e95eb","0x35aec85ad9718e937c4e7c11b6f47eebd557cc31b46afc7e19ac888e57e6cdcc","0x0cc2cd8e2b79ef43f441666c0f9de1f06e3690dc3fe64b6fe5d41976115f9184","0x0a06510426a311056e093d1b7a9aabafcb8ce723a6c5c40a9e02824db565844a"]}

geth atxi-build

Builds address-transactions indexes for existing chaindata. The command is idempotent.

Running a complete chain index from block 0 -> 5220000 on a 2009 Macbook Pro with 1GB cache took 3h30m, averaging ~300 blocks/second and ~1800 txs/second, and the /indexes database is 2.0GB.

$ geth atxi-build

Parameters

--start=QUANTITY - custom floor at which to begin indexing. If unset, the build command will use it's persistent placeholder if the command has been run before.
--stop=QUANTITY - custom ceiling at which to finish indexing. If unset, default value is blockchain head.
--step=QUANTITY - custom increment for batching writes to db and setting persistent progress placeholder. Default value is 10000.

geth --atxi

Flag required to enable address-transaction indexing during sync and import, and to enable associated API for a geth instance.

Implementation

Creates a new database <chaindir>/indexes which holds key indexes of the form below, where each unit is a definite length.

atx-<common.Address (address)><8-byte block number uint64><t|f (to|from)><s|c (standard|contract)><common.Hash (tx hash)>

txa- = 4 bytes
address = 20 bytes
blockNumber uint64 = 8 bytes
direction = 1 byte
kindof = 1 byte
txhash = 32 bytes

The key index can then be resolved to individual values using these known lengths, eg. address = key[4:24]

Lookups by address use a prefix iterator on address.

[WIP] - Known issues currently:

Case not handled: delete/reorg blocks. If there is a chain reorg, and a once-canonical
block is relegated to side chain, we should remove associated atxi's, which is to
say the atx index should only reflect the known canonical chain.
ethdb.Database interface doesn't include iterator, so storage testing with MemDatabase
will require changes to Database interface or some other kind of stub. Use level db in tmp dir
Build the indexes performance might be improved significantly by using a memory cache with Batch writes.
Does not use a bloom filter. ~~Undecided.~~ Unnecessary.

- api: debug_getAddressTransactions(<0xaddress>, <startBlockN>, <endBlockN>, <to|from|>) - cli cmd: 'geth atxi-build [--start=number] [--stop=number]' - cli flag: 'geth [--atxi|--add-tx-index] The api returns an array of 0x-prefixed transactions hashes. Creates a new database <chaindir>/indexes which holds key indexes of the form txa-<common.Address (address)><8-byte block number uint64><t|f (to|from)><common.Hash (tx hash)> Lookups by address use a prefix iterator on address, eg. txa-<my address>... Known issues currently: - Case not handled: delete/reorg blocks. If there is a chain reorg, and a once-canonical block is relegated to side chain, we should remove associated atxi's, which is to say the atx index should only reflect the known canonical chain. - ethdb.Database interface doesn't include iterator, so storage testing with MemDatabase will require changes to Database interface or some other kind of stub - Build the indexes performance might be improved significantly by using a memory cache with Batch writes.

whilei · 2018-01-18T05:05:39Z

Note that relevant changes are here: a778e22 , with the other 11,200 lines being from the dependencies (71c0b77). Which AFAIK are only for the progress bar, which could be removed/swapped, or whatever. 10k+ lines seems extravagant for a progress indicator.

tzdybal · 2018-01-18T22:19:54Z

core/blockchain.go

+				return err
+			}
+		}
+		if block.NumberU64()%10000 == 0 {


Un-hardcore this, use step parameter.
And maybe introduce a processed block counter, instead of using block numbers? This is only my pedantic suggestion ;)

Yea -- you're right. Will do. Just a relic of fast-and-furious sketching.

tzdybal · 2018-01-18T22:22:18Z

core/blockchain.go

+	return putBatch.Write()
+}
+
+func WriteBlockAddTxIndexes(indexDb ethdb.Database, block *types.Block) error {


This could be batched.

Actually we may extract a function that adds all transactions from block to a batch, and then use it here, and inside the loop in AddTxIndexesBatch above.

func WriteBlockAddTxIndexes(indexDb ethdb.Database, block *types.Block) error { putBatch := indexDb.NewBatch() addTransactions(block, putBatch) return putBatch.Write() }

func (self *BlockChain) AddTxIndexesBatch(indexDb ethdb.Database, startBlockN, stopBlockN uint64) (err error) { ... for block != nil && block.NumberU64() <= stopBlockN { addTransactions(block, putBatch) // write each "step"... } }

tzdybal · 2018-01-18T22:46:30Z

core/database_util.go

+}
+
+// GetAddrTxs gets the indexed transactions for a given account address.
+func GetAddrTxs(db ethdb.Database, address common.Address, blockStartN uint64, blockEndN uint64, toFromOrBoth string) []string {


Maybe rename toFromOrBoth to direction?

Nice, I like it. Will do.

tzdybal · 2018-01-18T22:49:08Z

core/database_util.go

+			}
+		}
+		if blockEndN > 0 {
+			txaI := new(big.Int).SetUint64(binary.LittleEndian.Uint64(blockNum))


Can't we just use plain uint64s and compare directly with blockStartN and blockEndN?

tzdybal · 2018-01-18T23:07:00Z

core/database_util.go

+				continue
+			}
+		}
+		if toFromOrBoth == "to" {


This can be simplified a bit: before the loop, map "to" -> 't', "from" -> 'f', ""/"both" -> 'b'

var direction byte = 'b' if len(toFromOrBoth) > 0 { direction = toFromOrBoth[0] }

and then just do:

if direction != 'b' && direction != torf[0] { continue }

tzdybal · 2018-01-18T23:14:40Z

Gopkg.lock

@@ -13,6 +13,12 @@
  revision = "2f1ce7a837dcb8da3ec595b1dac9d0632f0f99e8"
  version = "v1.3.1"

+[[projects]]


I can't see the usages of newly vendored packages, and many seems very unrelated to this task.

yea -- I used a progress bar for a while for the atxi-build command, then removed it and haven't pushed the commit removed the deps (oops)

tzdybal · 2018-01-18T23:16:56Z

cmd/geth/build_atxi_cmd.go

+
+	indexDb := MakeIndexDatabase(ctx)
+	if indexDb == nil {
+		panic("indexdb is nil")


Shouldn't you use glog.Fatal?

tzdybal · 2018-01-18T23:24:59Z

core/database_util.go

+}
+
+// resolveAddrTxBytes resolves the index key to individual []byte values
+func resolveAddrTxBytes(key []byte) (address, blockNumber, toOrFrom, txhash []byte) {


The first returned value is used only in tests ;) But from logical point of view, returning everything seems right.

Yea, I noticed that too, but agree that it seems right to have it. And it's worth it if only for tests.

makes compatible with --fast

- use step param for batch atxi fn - use blockprocessed counter for batch atxi fn instead of block number - use glog.Fatal instead of panic

whilei · 2018-01-19T03:45:01Z

It seems kind of awkward to put the api in debug_ but I don't know where else to put it... Parity uses a parity_ namespace for custom calls like this, maybe worth considering implementing a classic_ namespace?

whilei · 2018-01-19T05:08:02Z

Also, what do you think about adding another parameter contract to be able to filter for/against/agnostic on contract txs?

implemented in a7dbbb5

g bug was with the address field for 'to' indexes being messed up cuz unnecessary pointer

solution: use key instead of tx hash (typo)

solution: rename keep -> diff because it returns the difference, see usage in bv.reorg

... with shared but postponed txs (same tx, later block) solution: rm all txs from old chain, add all txs for new chain in case of reorg. atxi should only reflect canonical chain.

solution: write 'em

whilei · 2018-01-21T04:14:08Z

core/database_util.go

+	if err != nil || v == nil {
+		return 0
+	}
+	s := string(v)


no big deal, but could use binary package Uint64 instead

@pyskell

solution: implement that nonexclusively cc @pyskell

tzdybal · 2018-04-03T21:50:10Z

Please re-format files, as there are some inconsistent parts.

Reviewed 1 of 135 files at r1, 1 of 127 files at r3, 1 of 8 files at r6, 2 of 6 files at r7, 6 of 9 files at r8, 4 of 4 files at r9.
Review status: 14 of 15 files reviewed at latest revision, 8 unresolved discussions, some commit checks failed.

cmd/geth/build_atxi_cmd.go, line 3 at r9 (raw file):

package main

import (

Formatting required.

cmd/geth/build_atxi_cmd.go, line 16 at r9 (raw file):

func buildAddrTxIndexCmd(ctx *cli.Context) error {

	ethdb.SetCacheRatio("chaindata", 0.5)

I don't like this hardcode. Do you think that introducing separate commandline parameter for indexing cache is reasonable?

cmd/geth/flag.go, line 818 at r9 (raw file):

	var (
		chaindir = MustMakeChainDataDir(ctx)
		cache    = ctx.GlobalInt(aliasableName(CacheFlag.Name, ctx))

I need clarification - you are using same cache size parameter for both databases, but set 'cache ratio' to 0.5 for both of them, to keep the global memory consumption within limits?

common/types.go, line 128 at r9 (raw file):

func HexToAddress(s string) Address    { return BytesToAddress(FromHex(s)) }

func EmptyAddress(a Address) bool {

Is this function necessary? It seems that there is only one usage, just below.

core/blockchain.go, line 798 at r9 (raw file):

		removals := [][]byte{}
		removeRemovals := func(removals [][]byte) {

I don't like the removals name collision.

core/database_util.go, line 220 at r7 (raw file):

Previously, whilei (ia) wrote…

it might be interesting to allow to.IsEmpty(), since doing so would let you grab just contracts (deployed by anyone) on the chain by passing the empty hash address 0x000...
# grab the latest contracts
⟠ curl -X POST --data '{"jsonrpc":"2.0","method":"debug_getAddressTransactions","params":["0x0000...", 5200000, 0, ""],"id":67}' :8545

SGTM

core/database_util.go, line 156 at r9 (raw file):

// for example.
func formatAddrTxBytesIndex(address, blockNumber, direction, kindof, txhash []byte) (key []byte) {
	key = txAddressIndexPrefix

Maybe it's the premature optimization, but we can pre-allocate slice (make([]byte, 0, 34+len(txhash) (replace 34+len(txhash) with actual value)), because the size is fixed and we know it.

core/database_util.go, line 217 at r9 (raw file):

			to = &common.Address{}
		}
		if to.IsEmpty() {

Is it possible, that to IsEmpty but it was not nil? Maybe both if blocks can be merged?

eth/backend.go, line 231 at r9 (raw file):

	if config.UseAddrTxIndex {
		// TODO: these are arbitrary numbers I just made up. Optimize?
		ethdb.SetCacheRatio("chaindata", 0.95)

Those arbitrary numbers are different in different places of the code.
Setting cache of size of 5% of global cache is not the best idea, especially in case of default of 128MB cache - 6.4MB of cache is useless.

Comments from Reviewable

tzdybal · 2018-04-03T21:52:49Z

Ah, now I understand why you split cache differently. For building index, it's big (for performance), and for normal "appending" it's small, because you don't want to waste memory that can be used by chain data? Am I right?

tzdybal · 2018-04-03T22:18:06Z

I have one more idea (definitely for another PR) - parallelize!

solution: add them

whilei · 2018-04-05T18:09:28Z

cmd/geth/build_atxi_cmd.go, line 16 at r9 (raw file):

Previously, tzdybal (Tomasz Zdybał) wrote…

I don't like this hardcode. Do you think that introducing separate commandline parameter for indexing cache is reasonable?

It think it may be reasonable, though hesitant because I think it may also be overkill, since global --cache flag can still be used, and in a few limited experiments with cache ratio (between chaindata and addr-tx-data databases) I didn't see any obvious gains beyond the currently hardcoded 50/50 sharing of globally-configured cache value.

Open to further configuration/un-hardcoding, but would suggest to delegate to follow-up PR as a nice-to-have.

Comments from Reviewable

whilei · 2018-04-05T18:10:40Z

cmd/geth/flag.go, line 818 at r9 (raw file):

Previously, tzdybal (Tomasz Zdybał) wrote…

I need clarification - you are using same cache size parameter for both databases, but set 'cache ratio' to 0.5 for both of them, to keep the global memory consumption within limits?

Yes -- just to keep the system adherent to global --cache value, since it will be divided by both databases.

Comments from Reviewable

whilei · 2018-04-05T18:11:23Z

cmd/geth/flag.go, line 818 at r9 (raw file):

Previously, whilei (ia) wrote…

Yes -- just to keep the system adherent to global --cache value, since it will be divided by both databases.

And the 50/50 ratio is arbitrary and based only on my own empirical experiments.

Comments from Reviewable

solution: refactor to use common addr == zero-value fn

solution: removeRemovals -> deleteRemovalsFn

solution: implement pagination and reverse params for atxi api This seems like a useful thing for most applications... - reverse because most relevant transactions are latest txs - pagination because some account have tens- or hundreds- of thousands of txs

tzdybal · 2018-04-05T20:22:01Z

Reviewed 42 of 42 files at r10.
Review status: all files reviewed at latest revision, 15 unresolved discussions.

cmd/geth/build_atxi_cmd.go, line 16 at r9 (raw file):

Previously, whilei (ia) wrote…

It think it may be reasonable, though hesitant because I think it may also be overkill, since global --cache flag can still be used, and in a few limited experiments with cache ratio (between chaindata and addr-tx-data databases) I didn't see any obvious gains beyond the currently hardcoded 50/50 sharing of globally-configured cache value.

Open to further configuration/un-hardcoding, but would suggest to delegate to follow-up PR as a nice-to-have.

This is quite complex - leave it as is for now.
My thoughts:

Currently full index database is about 4GB. There is no sense in setting cache for greater values.
Logically, caching of chaindata seems pointless, because every block is processed only once. On the other hand, setting some cache may help on database layer (because of data representation on disk).
Separate configuration option for atxi-cache is another solution (with pros and cons).

core/blockchain.go, line 798 at r9 (raw file):

Previously, tzdybal (Tomasz Zdybał) wrote…

I don't like the removals name collision.

Actually I didn't like the conflict between removals local variable and removals goroutine parameter ;)

Comments from Reviewable

solution: implement sort.Sort interface for custom sortable struct

tzdybal · 2018-04-05T21:59:12Z

I like the pagination

Reviewed 6 of 6 files at r11.
Review status: all files reviewed at latest revision, 15 unresolved discussions.

Comments from Reviewable

solution: sort.Sort interface Less method should use be backwards since we want higher block numbers first

whilei · 2018-04-05T22:13:47Z

@tzdybal One last open question -- do we want to stuck with debug_ namespace for debug_getAddressTransactions API method?

Or maybe establish a new etc_ namespace?...

This creates a new public RPC/JS API module 'geth_'.

whilei added 2 commits January 18, 2018 13:35

dep ensure

71c0b77

whilei added 2 commits January 18, 2018 16:55

init batching sketch

d87a807

use delegated batching for atxi cmd

ffaed6a

tzdybal reviewed Jan 18, 2018

View reviewed changes

whilei added 11 commits January 19, 2018 09:32

dep ensure: remove progress bar deps

3a39d4c

Rename AddTxIndexesBatch->WriteBlockAddrTxIndexesBatch

031c198

write atxis for bc insertreceipts and reorg

63f5c83

makes compatible with --fast

remove atxis given bc rollback or reorg

4b45f42

improvements based on @tzdybal feedback

e09bd1a

- use step param for batch atxi fn - use blockprocessed counter for batch atxi fn instead of block number - use glog.Fatal instead of panic

rename toOrFromOrBoth -> direction

0fd5868

compare uint64s blockstart/stopNs instead of converting to bigints

ec6f4e6

simplify t/f direction check for iterator

674da81

rename more toOrFrom -> direction

cd429f7

refactor putBlockAddrTxsToBatch and atxi-build cmd

0c746d3

Add comments and remove unused fn

0c38d84

whilei added 2 commits January 19, 2018 13:06

a few minor tweaks and improvements, comments, etc

67769f6

move atxi placeholder to db k/v, polish cmd ui

4fc155f

whilei added Type: Feature PR: On Ice labels Jan 19, 2018

whilei added 6 commits January 20, 2018 13:45

write test for atxi, refactor, and fix bu

74d7fe0

g bug was with the address field for 'to' indexes being messed up cuz unnecessary pointer

test for fast/full sync implements atxi if enabled

cdb4055

problem: atxi not removing indexes

2ef0157

solution: use key instead of tx hash (typo)

problem: weirdly named signature var for tx Diff fn

258d418

solution: rename keep -> diff because it returns the difference, see usage in bv.reorg

problem: duplicate txs hashes return when chain reorg

2644af4

... with shared but postponed txs (same tx, later block) solution: rm all txs from old chain, add all txs for new chain in case of reorg. atxi should only reflect canonical chain.

problem: no tests for atxi rm or reorg

4bb36a1

solution: write 'em

whilei commented Jan 21, 2018

View reviewed changes

problem: api should use human-friendly t|f|tf/ft interface

f2eb1ba

solution: implement that nonexclusively cc @pyskell

whilei added 4 commits April 5, 2018 12:39

Merge master and resolve conflicts

e4f15d2

solution: remove redundant horizontal rule from usage output

513ad88

gofmt: project-wide

02efe7f

problem: merge missed atxi command/flag additions

a29921a

solution: add them

whilei added 7 commits April 5, 2018 13:11

solution: add comment to explain cache ratio sharing

54773f6

problem: unused addr.IsEmpty func

f5adee2

solution: refactor to use common addr == zero-value fn

problem: (refactor syntax) ugly 'removals' name collision

e465063

solution: removeRemovals -> deleteRemovalsFn

solution: optimize key slice with known length

4837014

solution: merge to addr check conditional for eloquence

42bf795

solution: add comment explaining atxi db cache ratio for sync (vs build)

95f000f

whilei added 2 commits April 5, 2018 16:15

problem: must ensure atxis are ordered by block number

d69ac3b

solution: implement sort.Sort interface for custom sortable struct

solution: fix variable shadowing and empty slice by struct init

6ac5a9a

problem: atxis returned are backward

efc2a42

solution: sort.Sort interface Less method should use be backwards since we want higher block numbers first

solution: move API debug_ -> geth_

2ec2a97

This creates a new public RPC/JS API module 'geth_'.

tzdybal approved these changes Apr 11, 2018

View reviewed changes

whilei merged commit 2ec2a97 into master Apr 13, 2018

shanejonas mentioned this pull request Apr 30, 2018

address page speed boost using compound index ethereumclassic/explorer#125

Closed

whilei deleted the feat/addr-tx-index-proto branch May 1, 2018 23:17

shanejonas mentioned this pull request May 7, 2018

Missing transactions emeraldpay/emerald-wallet#459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tx/address indexing as optional feature, with separate db #475

tx/address indexing as optional feature, with separate db #475

whilei commented Jan 18, 2018 •

edited

Loading

whilei commented Jan 18, 2018 •

edited

Loading

tzdybal Jan 18, 2018

whilei Jan 19, 2018 •

edited

Loading

tzdybal Jan 18, 2018

tzdybal Jan 18, 2018

whilei Jan 19, 2018

tzdybal Jan 18, 2018

whilei Jan 19, 2018

tzdybal Jan 18, 2018

whilei Jan 19, 2018

tzdybal Jan 18, 2018

tzdybal Jan 18, 2018

whilei Jan 19, 2018

tzdybal Jan 18, 2018

whilei Jan 19, 2018

tzdybal Jan 18, 2018

whilei Jan 19, 2018

whilei commented Jan 19, 2018

whilei commented Jan 19, 2018 •

edited

Loading

whilei Jan 21, 2018 •

edited

Loading

tzdybal commented Apr 3, 2018

tzdybal commented Apr 3, 2018

tzdybal commented Apr 3, 2018

whilei commented Apr 5, 2018

whilei commented Apr 5, 2018

whilei commented Apr 5, 2018

tzdybal commented Apr 5, 2018

tzdybal commented Apr 5, 2018

whilei commented Apr 5, 2018

tx/address indexing as optional feature, with separate db #475

tx/address indexing as optional feature, with separate db #475

Conversation

whilei commented Jan 18, 2018 • edited Loading

geth_getAddressTransactions

Parameters

Returns

Example

geth atxi-build

Parameters

geth --atxi

Implementation

whilei commented Jan 18, 2018 • edited Loading

Choose a reason for hiding this comment

whilei Jan 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whilei commented Jan 19, 2018

whilei commented Jan 19, 2018 • edited Loading

whilei Jan 21, 2018 • edited Loading

Choose a reason for hiding this comment

tzdybal commented Apr 3, 2018

tzdybal commented Apr 3, 2018

tzdybal commented Apr 3, 2018

whilei commented Apr 5, 2018

whilei commented Apr 5, 2018

whilei commented Apr 5, 2018

tzdybal commented Apr 5, 2018

tzdybal commented Apr 5, 2018

whilei commented Apr 5, 2018

whilei commented Jan 18, 2018 •

edited

Loading

whilei commented Jan 18, 2018 •

edited

Loading

whilei Jan 19, 2018 •

edited

Loading

whilei commented Jan 19, 2018 •

edited

Loading

whilei Jan 21, 2018 •

edited

Loading