Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isthmus: Improve the API for Substrait to Sql #276

Open
mbwhite opened this issue Jun 27, 2024 · 2 comments
Open

Isthmus: Improve the API for Substrait to Sql #276

mbwhite opened this issue Jun 27, 2024 · 2 comments

Comments

@mbwhite
Copy link
Contributor

mbwhite commented Jun 27, 2024

Background: this came out of our recent work using Substrait - and found that converting to SQL was bit cumbersome. I've outlined a suggestion below based on the code we've got. The implementation is the first pass and fully expect to get suggestions for improvement. More than happy to submit a PR for this and see it through to completion. Thanks!

SubstraitToSql
This currently works at the Relation level not the Plan Level. As such the final SQL statement doesn't contain the expected final names of the columns. What was a select id as identifier is lost to become a select id as $f1

The proposal here is work at the plan level where this information can be added via modifying the final relational. The information is in the plan and therefore could be used.

This would give symmetry to the APIs - and make them easier to use externally. Having to drop one level in the proto structure is cumbersome.

The downside of the approach is that in some cases calcite is creating an additional project relation. Have attempted to rationalise this via optimisers, but it felt better to leave the optimisation out. And get an SQL statement that was correct.

On the API structure..

Should the API for Isthmus be io.substrait.plan.Plan or io.substrait.proto.Plan. As this is an external API, then it would be useful to have consistency - especially with the return types that can't be overloaded? SqltoSubstrait is currently returning the protobuf versions. Is there a reason that Isthmus does not return the io.substrait.plan.Plan versions - this would seem to be the higher-level API?

Code

Add to the SubtraitToCalcite class

    public RelNode convert(Root plan) {
        List<String> intendedFieldNames = plan.getNames();
        Rel rel = plan.getInput();
        CalciteSchema rootSchema = toSchema(rel);
        RelBuilder relBuilder = createRelBuilder(rootSchema);
        SubstraitRelNodeConverter converter = createSubstraitRelNodeConverter(relBuilder);

        RelNode calciteRelNode = rel.accept(converter);

        // apply final project
        List<RexNode> newAlias = new ArrayList<RexNode>();
        relBuilder = relBuilder.push(calciteRelNode);
        for (var x = 0; x < intendedFieldNames.size(); x++) {
            var alias = relBuilder.alias(relBuilder.field(x), intendedFieldNames.get(x));
            newAlias.add(alias);
        }

        var updatedCalciteNode = relBuilder.project(newAlias, ImmutableList.of(), true).build();

        return updatedCalciteNode;
    }

In SubstraitToSql, this new convert() is then called by an updated toSql

   /** The quoting is needed - as calcite uses $f0 style for field names that need to be created;*/
    public static final SqlDialect DEFAULT_SQL_DIALECT = new SqlDialect(SqlDialect.EMPTY_CONTEXT.withIdentifierQuoteString("\"")) {
        @Override
        public boolean supportsApproxCountDistinct() {
            return true;
        }
    };

    public static List<String> toSql(Plan plan, SqlDialect sqlDialect) {
        try {
            var sqlStrings = new ArrayList<String>();

            SimpleExtension.ExtensionCollection extensions = SimpleExtension.loadDefaults();
            var converter = new SubstraitToCalcite_II(
                    extensions, new JavaTypeFactoryImpl(SubstraitTypeSystem.TYPE_SYSTEM));

            for (var root : plan.getRoots()) {
                var calciteRelNode = converter.convert(root);
                var sqlString = SubstraitToSql.toSql(calciteRelNode, sqlDialect);

                sqlStrings.add(sqlString);
            }

            return sqlStrings;
            }

            return sqlStrings;
        } catch (IOException e) {
            throw new RuntimeException("Unable to convert to SQL", e);
        }
    }

Externally this would look like

            var substrait_proto_a = new SqlToSubstrait().execute(sqlQuery, Arrays.asList(createTables));
            var plan = new ProtoPlanConverter().from(substrait_proto_a);           
            List<String> sqlStrings = SubstraitToSql.toSql(plan);
            sqlStrings.forEach(System.out::println);

Notes:

  • as mentioned what is the intention with return types io.substrait.proto or the high-level io.substrait.plan calls?
  • Error handling can be matched to the preferred handling in the isthmus library
  • Should the style of APIs be match - static vs instance?
@vbarua
Copy link
Member

vbarua commented Jun 27, 2024

Is there a reason that Isthmus does not return the io.substrait.plan.Plan versions - this would seem to be the higher-level API?

It predates my time working with the library, but it's been on my list of thing to address. I don't think there's a good reason for SqlToSubstrait to return protos, I suspect it was implemented early on and never revisited. In my opinions Isthmus should work with the Substrait POJOs (i.e. io.substrait.plan.Plan) and we should excise any references to the protos that we find in it.

As for the API in SubstraitToCalcite, I think the convert method should actually return a org.apache.calcite.rel.RelRoot and not just a org.apache.calcite.rel.RelNode. Something like:

public RelRoot convert(Root plan) {

From the RelRoot docs, the fields field can be used to assign output names for aliasing purposes.

The SubstraitRelVisitor (which converts from Calcite to Substrait) also does the wrong thing here IMO and throws away the RelRoot information:

public static Rel convert(RelRoot root, SimpleExtension.ExtensionCollection extensions) {
return convert(root.rel, extensions, FEATURES_DEFAULT);
}
public static Rel convert(
RelRoot root, SimpleExtension.ExtensionCollection extensions, FeatureBoard features) {
return convert(root.rel, extensions, features);
}

These should return a Substrait Root object.

Should the style of APIs be match - static vs instance?

I think it should be instance based, because I would like to see more configuration for the conversion.

I'm actually not super happy with the SQL conversion API as is, and aside from it being useful as an internal utility for fast testing I'm not sure how useful it is generally. Core Isthmus can handle converting from Calcite into Substrait (via the poorly named SubstraitRelVisitor) or from Substrait to Calcite (using SubstraitToCalcite). The SQL stuff feels tacked on and it's also not specific to Isthmus (or Substrait) because it should (ideally) only depend on Calcite if we layer the APIs correctly.

@mbwhite
Copy link
Contributor Author

mbwhite commented Jun 28, 2024

Thanks @vbarua - admit that I would agree with your views and opinions; things aren't quite lined-up. It will have evolved overtime as the standards and requirements have emerged.

I'll look in more detail at the SubstraitRelVisitor

UPDATED: Add this as an update based on more digging

  • I think that the SubstraitRelVisitor.convert() could be correct - in that the Calcite RelRoot information is being stored at the outer Substrait Plan level, not the node level; so I'm guessing this is why the convert is as it is.
  • When calcite is used initially to convert from SQL, it does include the Calcite RelRoot with the fields etc. It's perfectly reasonable then to recreate this on the way back out. So rather than adding a project relation as in my suggestion above, the RelRoot can come back. i.e.
        Map<Integer, String> fields = new HashMap<>();
        for (var x = 0; x < intendedFieldNames.size(); x++) {
            fields.put(x, intendedFieldNames.get(x));
        }
        var calciteRelRoot = new RelRoot(calciteRelNode,
                calciteRelNode.getRowType(),
                SqlKind.SELECT,
                fields.entrySet(),
                RelCollations.EMPTY,
                ImmutableList.of());
  • This does not have symmetry in both directions. so far so good so if Isthmus purely dealt with the Calcite at it's API then that would be ok.

However.... for the next step getting back to SQL, not so easy. And I think this is more Calcite than Substrait. There's no support in RelToSql to take notice of the RelRoot. Looking in detail at the RelRoot there's a project API to add a project to get back to a RelNode. Can't help but wonder if that API appeared as somebody was trying to do the same thing. (Calcite to SQL). The RelRoot.project(..) is doing effectively what's in my suggestion above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants