-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp/ingest: Ingest pipeline prototype #1264
Conversation
…ts for Sequence() and Read()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great 🎉 ! I added a few minor comments.
"github.com/stellar/go/xdr" | ||
) | ||
|
||
const bufferSize = 50000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing you're planning on making this configurable later? Definitely not a blocker for the prototype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can make it configurable in a future. However, in a good pipeline the buffer contains mostly a few elements or the number of elements is constant.
|
||
entry, more := <-b.buffer | ||
if more { | ||
b.readEntries++ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like incrementing b.readEntries
may not be threadsafe? What if two goroutines call Read()
at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, added a mutex protecting this variable. It's displayed in stats only (which are also not super accurate for multiple reasons) but this one was easy to fix.
return nil | ||
} | ||
|
||
var _ io.StateReadCloser = &bufferedStateReadWriteCloser{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment explaining that these ensure the interface is satisfied? I'm not sure how idiomatic this is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is idiomatic.
exp/ingest/pipeline/main.go
Outdated
// from the same StateReader and write to the same StateWriter. | ||
// Example: you can calculate number of asset holders in a single processor but | ||
// you can also start multiple processors that sum asset holders in a shared | ||
// variable to calculate it faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this example, wouldn't the shared variable need a mutex and so not be any faster than a single-threaded approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right the example was stupid. I changed it to a processor saving data into a DB (which makes sense if you need to do some data conversions first, ex. strkey-encoding public key, converting balances etc.).
m.wroteEntries++ | ||
m.mutex.Unlock() | ||
|
||
var wg sync.WaitGroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the WaitGroup
? It seems like we know exactly how many times to call err := <- results
below, and that channel read will block correctly until all results have been read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, removed.
m.mutex.Lock() | ||
defer m.mutex.Unlock() | ||
|
||
m.closeAfter-- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could go below 0. Doesn't look like a bug, but may be confusing when debugging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It now returns error when below zero.
exp/ingest/pipeline/pipeline.go
Outdated
|
||
for i := 1; i <= jobs; i++ { | ||
wg.Add(1) | ||
go func(reader io.StateReadCloser, writer io.StateWriteCloser) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to pass these as arguments? It seems like it should be fine to use the variables scoped at the procesStateNode
level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, removed. I automatically do it to prevent bugs like Figure 8 in this paper. However, you're right that it's not needed in this case.
s.values = make(map[string]interface{}) | ||
} | ||
|
||
func (s *Store) Put(name string, value interface{}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How will this interface generalize to a postgres Store
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a short note to Store
in UML diagram I sent yesterday. Store
is responsible for storing artifacts or share data between processors in a single pipeline. For example, let's say you want to calculate average XLM balance. You can create a StateProcessor
that calculates this value and then saves it to the Store
(because there's no other way to pass data down the pipeline than by using StateWriteCloser
). Then the other processor will save this value to a DB.
return true | ||
} | ||
|
||
func (n *SimpleProcessor) CallCount() int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect CallCount
just to read a value rather than increment it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the name as it was confusing. It's important that this is atomic.
This PR was based on #1216. To see pipeline code only go here.
This is a prototype of ingestion pipeline of the new ingestion system. This is a result of experimentation with different approaches (like #1154 or ratchet). In this prototype pipeline is a tree of nodes. Each node reads from
StateReadCloser
(close #1309) provided by the previous node and writes toStateWriteCloser
that will be converted toStateReadCloser
read by the following node, and so on.Processors can save aggregated data in a
Store
shared across all processors in the pipeline. This means that processors can see data generated by processors in a different subtree.In this design state is never stored fully in memory. In fact, it uses very little memory because all processors are started at the same time and read data as soon it's available. Buffers are used to transform data between processors (read below) but in most cases (if there are no delays reading data from
StateReadCloser
s) they will be empty most of the time. However,StateReadCloser
andStateWriteCloser
are so generic that it would be possible to write an implementation that runs pipeline on a cluster of machines in a future.Classes and interfaces
(edit:
ingest/filters
below should beingest/pipeline
- unfortunately draw.io did no save my changes...)Most structs and methods have decent godoc but here's a quick summary:
Pipeline
Pipeline
represents a processing pipeline. User can add a tree usingAddStateProcessorTree
.PipelineNode
Tree can be constructed using helper
Node
method andPipelineNode
struct. See examples below.Store
Store
allows storing artifacts to be used by following nodes. Ex. aggregations.multiWriteCloser
multiWriteCloser
works likeio.MultiWriter
with two exceptions:Close()
-ing streams.It's used when a pipeline node has many children. It allows distributing written entries to all of the workers.
bufferedStateReadWriteCloser
bufferedStateReadWriteCloser
is a buffered struct implementing bothStateReader
andStateWriteCloser
. Acts like a pipe between pipeline nodes. Consider simpleA
->B
->C
pipeline:A
writes toStateWriteCloser
(which in reality isbufferedStateReadWriteCloser
).B
reads fromStateReadCloser
(which in reality isbufferedStateReadWriteCloser
that A wrote to).B
writes toStateWriteCloser
(which in reality is a newbufferedStateReadWriteCloser
).C
reads fromStateReadCloser
(bufferedStateReadWriteCloser
) thatB
writes to.bufferedStateReadWriteCloser
maintains internal buffer:Read
it locks until there's aWrite
adding data OR returnsio.EOF
when stream has been closed.Writes
to it, it locks until there's aRead
reading data from it.That way we can reason about memory usage and cap it at sum of max capacity of all buffers. It's using a channel internally.
StateProcessor
(interface)Defines method that processors must implement. Check godoc.
Data flow and concurrency
Processor can be run concurrently. When this happens, pipeline starts multiple workers running the same
ProcessState(store *Store, readCloser io.StateReadCloser, writeCloser io.StateWriteCloser) (err error)
method but they all read from the sameStateReadCloser
and write to the sameStateWriteCloser
. What is more, each processor can send data to multiple children.The following diagram explains how
bufferedStateReadWriteCloser
andmultiWriteCloser
help achieve this:When multiple workers are started,
multiWriteCloser
that writes toN
bufferedStateReadWriteCloser
is created whereN
is a number of children of the current processor. Children then read frombufferedStateReadWriteCloser
.Demo
EDIT You can run "Accounts for Signer" demo now:
go run -v ./exp/tools/accounts-for-signer/
.Demo runs the following pipeline:
That can be represented as a diagram:
After starting the demo, it will display updated stats every second, ex:
Notes:
•
(dots) represent workers and are displayed only when processor is run concurrently (currently 20 workers are started).rps
= reads per second,queued
- number of unread items queued in a buffer. You can uncommenttime.Sleep
inCountPrefixProcessor
to observe what happens when buffer is getting full.