Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Predicates: match keywords with multiple properties using Or-step #940

Closed
myitroad opened this issue Feb 28, 2018 · 10 comments
Closed

Comments

@myitroad
Copy link

Hi there,

There are some situation that confused me, When I use Text Predicate search to match multiple properties in JanusGraph.

Specification, I desire to match keywords with multiple properties. For example, match keywords with both properties moviename and rdfs:label, I write statements with Gremlin or-step, just show as below:

g.V().has('moviename',Text.textContains('英雄'))
g.V().has('rdfs:label',Text.textContains('英雄'))
g.V().where(has('moviename',Text.textContains('英雄')).or().has('rdfs:label',Text.textContains('英雄')))

In expectation, the 3rd statement generate a union set of the 1st and 2nd statement. But, in practice, the 3rd return null.
Statement execute result posted as follows:

  • Simple text search
gremlin> g.V().has('rdfs:label',Text.textContains('英雄'))
==>v[2240616]
==>v[2289712]
==>v[2424936]
==>v[2416688]
gremlin> g.V().has('moviename',Text.textContains('英雄'))
==>v[2240616]
==>v[2416688]
==>v[2289712]
==>v[2424936]
  • Text search with or-step
gremlin> g.V().where(has('moviename',Text.textContains('英雄')).or().has('rdfs:label',Text.textContains('英雄')))
12:36:28 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes

I wonder if the or-step is not compatible with text-predicate search.
Besides, if thers is any alternative ways that will reach my goals?

Thanks for your attention!


Supplementary - Component and version The janusgraph version is 0.2.0 released on 12 Oct 2017. The storage backend is HBase with version 1.1.2. The index backend is Elasticsearch with version 5.5.1.
  • JanusGraph schema
    Mixed schema built before import data:
                               |           |                  |        |               |      rdfs:label |              ENABLED
rdfs:labele4c2                 |     Mixed |   JanusGraphEdge |  false | ontology-demo |                 |                     
                               |           |                  |        |               |      rdfs:label |              ENABLED

                               |           |                  |        |               |       moviename |              ENABLED
movienamee6b7                  |     Mixed |   JanusGraphEdge |  false | ontology-demo |                 |                     
                               |           |                  |        |               |       moviename |              ENABLED
gremlin> g.V().or(
            __.outE('created'),
            __.inE('created').count().is(gt(1))).
              values('name')
==>marko
==>lop
==>josh
==>peter

statement profile

  • Profile of simple text search
gremlin> g.V().has('moviename',Text.textContains('英雄')).profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
JanusGraphStep([],[moviename.textContains(英雄)])                        4           4           0.645   100.00
    \_condition=(moviename textContains 英雄)
    \_isFitted=true
    \_query=[(moviename textContains 英雄)]:movienamev1de1
    \_index=movienamev1de1
    \_orders=[]
    \_isOrdered=true
    \_index_impl=ontology-demo1
  optimization                                                                                 0.297
                                            >TOTAL                     -           -           0.645        -
  • Profile of text search with or-step
gremlin> g.V().where(has('moviename',Text.textContains('英雄')).or().has('rdfs:label',Text.textContains('英雄'))).profile()
12:41:27 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
JanusGraphStep(vertex,[])                                           3188        3188          29.421    43.36
    \_condition=()
    \_isFitted=false
    \_query=[]
    \_orders=[]
    \_isOrdered=true
  optimization                                                                                 0.020
  scan                                                                                         0.000
    \_condition=VERTEX
    \_query=[]
    \_fullscan=true
OrStep([[HasStep([moviename.textContains(英雄)]),...                                            38.426    56.64
  HasStep([moviename.textContains(英雄)])                                                       17.141
  HasStep([rdfs:label.textContains(英雄)])                                                      18.127
                                            >TOTAL                     -           -          67.848        -

statement explain

  • Explain of simple text search
gremlin> g.V().has('moviename',Text.textContains('英雄')).explain()
==>Traversal Explanation
=========================================================================================================
Original Traversal                          [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]

ConnectiveStrategy                    [D]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
MatchPredicateStrategy                [O]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
FilterRankingStrategy                 [O]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
InlineFilterStrategy                  [O]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
IncidentToAdjacentStrategy            [O]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
AdjacentToIncidentStrategy            [O]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
RepeatUnrollStrategy                  [O]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
RangeByIsCountStrategy                [O]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
PathRetractionStrategy                [O]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
LazyBarrierStrategy                   [O]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
AdjacentVertexFilterOptimizerStrategy [P]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
JanusGraphLocalQueryOptimizerStrategy [P]   [GraphStep(vertex,[]), HasStep([moviename.textContains(英雄)])]
JanusGraphStepStrategy                [P]   [JanusGraphStep([],[moviename.textContains(英雄)])]
ProfileStrategy                       [F]   [JanusGraphStep([],[moviename.textContains(英雄)])]
StandardVerificationStrategy          [V]   [JanusGraphStep([],[moviename.textContains(英雄)])]

Final Traversal                             [JanusGraphStep([],[moviename.textContains(英雄)])]
  • Explain of text search with or-step
gremlin> g.V().where(has('moviename',Text.textContains('英雄')).or().has('rdfs:label',Text.textContains('英雄'))).explain()
==>Traversal Explanation
=========================================================================================================================================================
Original Traversal                          [GraphStep(vertex,[]), TraversalFilterStep([HasStep([moviename.textContains(英雄)]), OrStep, HasStep([rdfs:labe
                                               l.textContains(英雄)])])]

ConnectiveStrategy                    [D]   [GraphStep(vertex,[]), TraversalFilterStep([OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:l
                                               abel.textContains(英雄)])]])])]
MatchPredicateStrategy                [O]   [GraphStep(vertex,[]), TraversalFilterStep([OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:l
                                               abel.textContains(英雄)])]])])]
FilterRankingStrategy                 [O]   [GraphStep(vertex,[]), TraversalFilterStep([OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:l
                                               abel.textContains(英雄)])]])])]
InlineFilterStrategy                  [O]   [GraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContains(英雄)
                                               ])]])]
IncidentToAdjacentStrategy            [O]   [GraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContains(英雄)
                                               ])]])]
AdjacentToIncidentStrategy            [O]   [GraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContains(英雄)
                                               ])]])]
RepeatUnrollStrategy                  [O]   [GraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContains(英雄)
                                               ])]])]
RangeByIsCountStrategy                [O]   [GraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContains(英雄)
                                               ])]])]
PathRetractionStrategy                [O]   [GraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContains(英雄)
                                               ])]])]
LazyBarrierStrategy                   [O]   [GraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContains(英雄)
                                               ])]])]
AdjacentVertexFilterOptimizerStrategy [P]   [GraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContains(英雄)
                                               ])]])]
JanusGraphLocalQueryOptimizerStrategy [P]   [GraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContains(英雄)
                                               ])]])]
JanusGraphStepStrategy                [P]   [JanusGraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContain
                                               s(英雄)])]])]
ProfileStrategy                       [F]   [JanusGraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContain
                                               s(英雄)])]])]
StandardVerificationStrategy          [V]   [JanusGraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContain
                                               s(英雄)])]])]

Final Traversal                             [JanusGraphStep(vertex,[]), OrStep([[HasStep([moviename.textContains(英雄)])], [HasStep([rdfs:label.textContain
                                               s(英雄)])]])]
@myitroad
Copy link
Author

supplement:
“英雄” is Chinese characters, just see it as keywords.

@myitroad
Copy link
Author

I have find one alternative method, which use regex match replace contains match.

The statement as follows:

g.V().has('moviename',Text.textContains('英雄'))
g.V().has('rdfs:label',Text.textContains('英雄'))
g.V().or(__.has('moviename',Text.textContainsRegex('.*英雄.*')),__.has('rdfs:label',Text.textContainsRegex('.*英雄.*'))).dedup()

Looking forward more efficient methods.

@pluradj
Copy link
Member

pluradj commented Feb 28, 2018

In your schema description, it looks like the mixed indexes were created with Edge.class instead of Vertex.class. This seems incorrect based on the vertex-based queries you are using, so it could explain why your or() query isn't returning a result. If that doesn't solve the problem, some example data that reproduces your result would be helpful. Your queries worked fine in a simple test.

The reason you are seeing the WARN message is being tracked with this issue #163

@pluradj
Copy link
Member

pluradj commented Feb 28, 2018

This question also sounds remarkably similar to #922, and it seems like there might be something related involved.

@myitroad
Copy link
Author

myitroad commented Mar 1, 2018

Thank you for your time!
I carefully reviewed all steps you wrote, and two things have been identified:

First, the index for property, followed by JanusGraph docs 9.1.2. Mixed Index, should be created with Vertex.class, and it's works fine while simple query with only one Has-step, such as g.V().has('moviename',Text.textContains('英雄')).
Second,steps you wrote works fine in my environment. But in some test text, the Or-step with Text.textContains still return null. Just remove the space from the text will reproduce this problems.
Details are as follows:

  • Add vertices
gremlin> g.addV().property(moviename, 'O英雄').next()
==>v[8360]
gremlin> g.tx().commit()
==>null
gremlin> g.addV().property(rdfs_label, 'N英雄').next()
==>v[8272]
gremlin> g.tx().commit()
==>null
  • Simple textContains search
gremlin> g.V().has(moviename,Text.textContains('英雄')).valueMap(true)
==>[label:vertex,id:8360,moviename:[O英雄]]
gremlin> g.V().has(rdfs_label,Text.textContains('英雄')).valueMap(true)
==>[label:vertex,rdfs:label:[N英雄],id:8272]

gremlin> g.V().has(moviename,Text.textContains('O')).valueMap(true)
==>[label:vertex,id:8360,moviename:[O英雄]]
gremlin> g.V().has(rdfs_label,Text.textContains('N')).valueMap(true)
==>[label:vertex,rdfs:label:[N英雄],id:8272]
  • Or-step with textContains
gremlin> g.V().where(has(moviename,Text.textContains('英雄')).or().has(rdfs_label,Text.textContains('英雄'))).valueMap(true)
09:25:55 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
gremlin> g.V().or(has(moviename,Text.textContains('英雄')), has(rdfs_label,Text.textContains('英雄'))).valueMap(true)
09:26:09 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes

gremlin> g.V().where(has(moviename,Text.textContains('O')).or().has(rdfs_label,Text.textContains('N'))).valueMap(true)
09:17:29 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
gremlin> g.V().or(has(moviename,Text.textContains('O')), has(rdfs_label,Text.textContains('N'))).valueMap(true)
09:17:35 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
  • Or-step with textContains (Exact match)
gremlin> g.V().where(has(moviename,Text.textContains('O英雄')).or().has(rdfs_label,Text.textContains('N英雄'))).valueMap(true)
09:17:55 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>[label:vertex,rdfs:label:[N英雄],id:8272]
==>[label:vertex,id:8360,moviename:[O英雄]]
gremlin> g.V().or(has(moviename,Text.textContains('O英雄')), has(rdfs_label,Text.textContains('N英雄'))).valueMap(true)
09:18:03 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>[label:vertex,rdfs:label:[N英雄],id:8272]
==>[label:vertex,id:8360,moviename:[O英雄]]

@pluradj
Copy link
Member

pluradj commented Mar 1, 2018

gremlin> g.V().has(moviename,Text.textContains('英雄')).valueMap(true)
==>[label:vertex,id:8360,moviename:[O英雄]]
gremlin> g.V().has(moviename,Text.textContains('O')).valueMap(true)
==>[label:vertex,id:8360,moviename:[O英雄]]

Actually, this seems like it is not working as intended. textContains is supposed to match on exact words in the tokenized string. It is not supposed to match on a partial contains as shown above. Also note this particular behavior:

JanusGraph’s default tokenization splits the string on non-alphanumeric characters and removes any tokens with less than 2 characters.

Based on that, Text.textContains('O') actually should not return any results.

@myitroad
Copy link
Author

myitroad commented Mar 2, 2018

I have learnt JanusGraph default tokenization method, witch string less than 2-characters will be ignored.
Perhaps this problem is related to string tokenization strategy. And, I hope to find some alternative ways to resovle my problems.
Thank you for your kind reply.

@pluradj
Copy link
Member

pluradj commented Mar 2, 2018

Similar to what I described on 922, you could use textContainsRegex for partial matches on the tokens, if that's what you are trying to accomplish. If you think #922 duplicates what you are reporting here, please go ahead and close this issue.

@pluradj
Copy link
Member

pluradj commented Mar 3, 2018

You might need to investigate using Elasticsearch Analysis Plugins to properly tokenize your target language. I don't think the default configuration can handle your character set correctly.

@chupman
Copy link
Member

chupman commented Feb 7, 2019

To prevent confusion we have recently added a default template for new issues containing the guidelines as to what belongs in issues. Usage, configuration, and general questions should be asked in gitter, stackoverflow, or the janusgraph-users google group. Github issues are for reporting bugs, requesting new features, and tracking the development of JanusGraph. If your issue is still outstanding please consult one of the communities mentioned. If you still feel like your issue belongs here and was closed in error please feel free to repoen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants