Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

live joins? #3

Closed
dominictarr opened this issue Apr 27, 2013 · 14 comments
Closed

live joins? #3

dominictarr opened this issue Apr 27, 2013 · 14 comments

Comments

@dominictarr
Copy link

Would it be possible to use level-live-stream on the joins
so that you can get a changes feed of the relations as they are added?

also, is it possible to do open ended joins, perhaps:

db.join([{
  subject: db.v('author1'),
  predicate: 'maintains',
  object: db.v('module'),
},
{
  subject: db.v('module'),
  predicate: 'depends',
  object: db.v('module2')
},
{
  subject: db.v('author2'),
  predicate: 'maintains',
  object: db.v('module2')
}
], function (err, join) {
  console.log(join)
})

which might return:

{author1: 'mcollina', module: 'levelgraph', author2: 'rvagg', module2: 'levelup'}

Is that correct?

@mcollina
Copy link
Collaborator

These are two questions: 1) live stream joins and 2) open ended joins.

  1. Live stream joins are not currently supported. Doing that is very different from plain old joins, and require some fairly new algorithms that I need to study. Moreover, as stuff are implemented at the moment everything will blow up in a huge memory leak. In the end, that's another library.

  2. Yes. If it's not working it's a bug. Have you tried to pour NPM data into levelgraph? If you do, let me know!! :)

@dominictarr
Copy link
Author

I have a module for doing joins here that I use here https://github.com/dominictarr/level-inverted-index/blob/map-reduce/index.js#L99-L106

Yes, I have some levelup stuff for working with npm (which is a great dataset) over here https://github.com/dominictarr/npmd

will try doing this in levelgraph and see what happens.

@mcollina
Copy link
Collaborator

Your module for joins seems more efficient than the current approach, which is quite naive (but it works).
At the moment I am opening a new stream for each of the query condition for each of the possible results.
it is as bad as it sound, but it allow multiple possible solutions to be discovered.
Moreover this implementation limits the subsequent queries only to the 'bounds' variables, i.e. if the first condition finds out that x = 42, then the second condition reuses that result.

Would you like to try an implementation of the join using the your module? I believe it may be possible to get better throughput at the expense of RAM (the big joinhash in your module).
The problem with this approach is that it cannot use the knowledge of the result of the other conditions until the reduce phase. How can you handle binding multiple variables, when your getKey method have to return a string?
Still, it may be very tricky to get it done right, but implementing it will also allow to do live joins.

Ideally these two approach might be complementary, as a proper database can use multiple join strategy. I bet a pluggable system will be nice.

It would also be cool to get some numbers with a real dataset, so we can know which one works better (the NPM one will be super).

@dominictarr
Copy link
Author

Hmm, so I think using a hash-join you will need to have multiple joins.
I'm not sure it applies in the case of graphs, but if you wanted to join on two values, you could just hash a combination of keys.

If the second condition refuses the result,
does this mean my query will not return modules that I have have written, that I also depend on?

Agree, we need more generic join modules. Also, we can use db.getApproximateSize(range) to estimate which sides of the join is smaller, and thus which should be in memory.

Another thing - leveldb is probably pretty smart about caching - so maybe reading from the database is not so bad.

@mcollina
Copy link
Collaborator

Hmm, so I think using a hash-join you will need to have multiple joins.
I'm not sure it applies in the case of graphs, but if you wanted to join on two values, you could just hash a combination of keys.

The problem with hashing is that you can't hash what you do not know. In your query, the first condition is about variables author1 and module. So you can produce an hash on these two, but you must ignore module2 and author2.

If the second condition refuses the result,
does this mean my query will not return modules that I have have written, that I also depend on?

I'm not getting this question, but I'll try answering.
It will returns all the modules you have written that you depend on.
The first condition try binding some variables, but it cannot bind all of them. So it emits a "partial" solution, then the second condition search the store for triples matching the pattern and the already bound variables, and then it emits the matching context.

Another thing - leveldb is probably pretty smart about caching - so maybe reading from the database is not so bad.

It seems so, I think that having a real use case that performs badly can drive the optimization.

@dominictarr
Copy link
Author

Agree about the "real use case that performs badly". I'm gonna change to 0.10, and try sticking npm into levelgraph and running this query. Unfortunately, I have more pressing "real work" but I'll get to this!

@mcollina
Copy link
Collaborator

No hurry!
Adding 0.8 support should be easy, if you need it just ping me :).

You might also want to use #4, I believe it's stable enough.

@dominictarr
Copy link
Author

it's time to move to 10 anyway. would have done this on sunday, but I encounterd a few bugs, in my stuff and fixed them instead.

@mcollina
Copy link
Collaborator

mcollina commented Jul 2, 2013

I have been thinking about this live joining. It is doable, albeit not fast and/or optimizable on the first run.

How about something:

db.join([{
  stream: yourStream // note: this will only work for the very first condition.
  subject: db.v('author1'),
  predicate: 'maintains',
  object: db.v('module'),
},
{
  subject: db.v('module'),
  predicate: 'depends',
  object: db.v('module2')
},
{
  subject: db.v('author2'),
  predicate: 'maintains',
  object: db.v('module2')
}
], function (err, join) {
  console.log(join)
})

It should be easy to implement.
@dominictarr if you need it, I can give it a go.

@mcollina
Copy link
Collaborator

mcollina commented Jul 2, 2013

Another possible syntax:

db.join(yourstream, [{
  subject: db.v('author1'),
  predicate: 'maintains',
  object: db.v('module'),
},
{
  subject: db.v('module'),
  predicate: 'depends',
  object: db.v('module2')
},
{
  subject: db.v('author2'),
  predicate: 'maintains',
  object: db.v('module2')
}
], function (err, join) {
  console.log(join)
})

@mcollina
Copy link
Collaborator

mcollina commented Jul 2, 2013

The only take in this solution is that the live-stream will only be matched by the first condition, while the rest will be based on the internal graph's data.

Another solution might be:

db.join([{
  live: true, // works only on the first condition
  subject: db.v('author1'),
  predicate: 'maintains',
  object: db.v('module'),
},
{
  subject: db.v('module'),
  predicate: 'depends',
  object: db.v('module2')
},
{
  subject: db.v('author2'),
  predicate: 'maintains',
  object: db.v('module2')
}
], function (err, join) {
  console.log(join)
})

This is actually what I like most, it will need to depend on sublevel, but it is ok.

The real question is: should this be forwarding all "internal" matches first, or just the "live" results?

@dominictarr
Copy link
Author

by "internal", you mean it should match the current values in the database?

Most definately - that is extremely useful - I think the most useful is to be able to match the history, the live changes,
and either just the history or live changes.

However, since you are using streams for this, it is pretty straightforward to adapt the idea to both.

@mcollina
Copy link
Collaborator

mcollina commented Jul 2, 2013

By 'internal' I mean the current values of the database. Not sure about how to support the history (if it is not what is currently in the db).

The Join are implemented like a pipeline. An empty solution start, then it is denied or augmented with variable bindings at each step/condition. Implementing a live join will be starting from a live stream of triples matching the first condition. How does it sound?

What about the interface?

@mcollina mcollina mentioned this issue Jul 2, 2013
5 tasks
@mcollina
Copy link
Collaborator

mcollina commented Jul 9, 2013

Closing in favor of #24

@mcollina mcollina closed this as completed Jul 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants