Skip to content

EVF Tutorial Special Columns

Paul Rogers edited this page Jun 13, 2019 · 5 revisions

Handling Special Columns

The log reader handles two special columns:

  • _unmatched_rows: Value of any unmatched row.
  • _raw: The entire unparsed row.

These columns also introduce some interesting semantics:

  • The special columns do not appear when doing a wildcard (*) query. The user must request them explicitly.
  • If the SELECT clause contains no fields, and does not contain the _raw field, then don't save the matched rows.
  • If the SELECT clause contains the _unmatched_rows column, then save those rows, else skip them.
  • However, if the SELECT clause contains nothing (because we're processing a SELECT COUNT(*) query, then do save the (empty) matched row.

The semantics here are unique to the log reader, but they do allow us to show how to use EVF to handle such odd cases. Let's start with the basics.

Define the Special Columns

Prior logic had to fit these into the array of columns, which was a bit awkward. We can revise this logic to exploit the EVF. We simply always define our "special" columns, storing their writers directly:

  private static final String RAW_LINE_COL_NAME = "_raw";
  private static final String UNMATCHED_LINE_COL_NAME = "_unmatched_rows";
  ...
  private ScalarWriter rawColWriter;
  private ScalarWriter unmatchedColWriter;
  ...
  private TupleMetadata defineSchema() {
    ...
    SchemaBuilder builder = new SchemaBuilder();
    ...
    builder.addNullable(RAW_LINE_COL_NAME, MinorType.VARCHAR);
    builder.addNullable(UNMATCHED_LINE_COL_NAME, MinorType.VARCHAR);
    TupleMetadata schema = builder.buildSchema();

    // Exclude special columns from wildcard expansion

    schema.metadata(RAW_LINE_COL_NAME).setBooleanProperty(
        ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
    schema.metadata(UNMATCHED_LINE_COL_NAME).setBooleanProperty(
        ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);

    return schema;
  }

Some things to note:

  • We add the special columns to the schema builder after the regex columns. This ensures that the column indexes for those columns are the same for both the columns array and the schema.
  • We define the special columns all the time; relying on EVF projection to materialize them only when needed.
  • Once the schema is built, we retrieve the special columns and set the EXCLUDE_FROM_WILDCARD which tells EVF not to include these columns in a wildcard query.

Write to the Special Columns

Next we must revise our draft nextLine() method to include the special columns. First, for the _raw column:

    if (lineMatcher.matches()) {
        rowWriter.start();
        rawColWriter.setString(line);
        loadVectors(lineMatcher);
        rowWriter.save();
    }
    ...
    rowWriter.start();
    unmatchedColWriter.setString(line);
    rowWriter.save();
    return true;
  }

We leverage the "dummy" feature of writers: if a special column is not projected, writing to it is a no-op.

As an aside, we could have written the following instead:

    writer.scalar(UNMATCHED_LINE_COL_NAME).setString(line);

The above form is handy if you must work with columns by name rather than position, for example if working with JSON. In the log reader, however, we cache the writer for a slight performance gain. Use whichever works best for your plugin.

Checking if a Column is Projected

The above code for _unmatched_rows is not quite right. According to the semantics identified earlier, we want to save the unmatched row only if the user requests it. This is easy enough to add using the isProjected() method on a writer to tell us if if has an actual materialized vector (it is projected), or if it is a dummy, unprojected column:

    if (unmatchedColWriter.isProjected()) {
      rowWriter.start();
      unmatchedColWriter.setString(line);
      rowWriter.save();
    }

Getting Fancy with Projection

The log reader has one more requirement: saved matched rows only in the following conditions:

Fields _raw _unmatched_rows Save Matched Row?
Yes N/A N/A Yes
No Yes N/A Yes
No No Yes No
No No No Yes

The first three columns ask if the user asked for at least one field, or for the two special columns. The fourth column tells us if we should save the matched rows. The last row above may be surprising: if nothing is projected at all, we must at least start/save the matched rows so we can count them.

To record our decision, we add a flag, saveMatchedRows. The exact logic is specific to the log reader, we just want to see how we can use EVF features to implement our rules.

Normal Empty Projection Case

If we did not have these complex rules, we could find out if we are in a COUNT(*) case by calling the isProjectionEmpty() method on the SchemaNegotiator passed to the batch reader open() method. When projection is empty, we just need to pass along a row count, we don't need to actually write any values. We might want to know that if we have some shortcut way to get the row count, such as from a file header or footer. See the isProjectionEmpty() Javadoc for details.

Determine When to Save Matched Rows

We can extend our bindColumns() function to use isProjected() to compute saveMatchedRows:

  private void bindColumns(RowSetLoader writer) {
    for (int i = 0; i < capturingGroups; i++) {
      columns[i].bind(writer);
      saveMatchedRows |= columns[i].colWriter.isProjected();
    }
    rawColWriter = writer.scalar(RAW_LINE_COL_NAME);
    saveMatchedRows |= rawColWriter.isProjected();
    unmatchedColWriter = writer.scalar(UNMATCHED_LINE_COL_NAME);

    // If no match-case columns are projected, and the unmatched
    // columns is unprojected, then we want to count (matched)
    // rows.

    saveMatchedRows |= !unmatchedColWriter.isProjected();
  }

Conditionally Save Matched Rows

Finally, we can modify the nextLine() function to conditionally save matched lines:

      if (saveMatchedRows) {
        rowWriter.start();
        rawColWriter.setString(line);
        loadVectors(lineMatcher);
        rowWriter.save();
      }

At this point, the special columns should work, and we should save rows per our complex requirements.


Next: Type Conversion

Clone this wiki locally