Skip to content

EVF Tutorial Scan Framework Creator

Paul Rogers edited this page May 26, 2019 · 7 revisions

Scan Framework Creator

We've created a batch reader for the log plugin. But, thus far been an "orphan": nothing calls it. Let's fix that.

EVF is a general-purpose framework: it handles many kinds of scans. We must customize it for each specific reader. Rather than doing so by creating subclasses, we instead assemble the pieces we needed through composition by providing a framework builder class. This builder class is what allows the Easy framework to operator with both "legacy" and EVF-based readers.

Define the Row Batch Reader Creator

Prior to the EVF, Easy format plugins were based on the original ScanBatch. At the start of execution, the Easy framework calls getRecordReader() in your plugin class to create a record reader for each file split. The Easy framework then passes all the readers to the scan batch.

With EVF, we use the record batch reader we just created. Instead of creating all readers up-front, EVF creates them on-the-fly as needed. It does so by calling a method on a "reader creator" class which you provide. Let's go ahead and create that class, again as a nested class within LogFormatPlugin:

  private static class LogReaderFactory extends FileReaderFactory {

    private final LogFormatPlugin plugin;

    public LogReaderFactory(LogFormatPlugin plugin) {
      this.plugin = plugin;
    }

    @Override
    public ManagedReader<? extends FileSchemaNegotiator> newReader(
        FileSplit split) {
       return new LogBatchReader(split, plugin.getConfig());
    }
  }

This is simple enough: EVF calls the newReader() method when it is ready to read the next file split. (The split names a file and, if we said the plugin is block splitable, it also names a block offset and length.)

We are free to create the log batch reader any way we like: the constructor is up to us. We already trimmed it down earlier, so we simply use that constructor here.

In an advanced case, we could even create a different reader depending on some interesting condition. For example, the Parquet reader has both a "new" and "old" version with different capabilities.

This class is not called anywhere, so the plugin should still run using the old reader.

Create the Scan Framework Builder

EVF supports a number of "scan frameworks" and a wide variety of options. We use the "builder" pattern to specify how we want the scan to work: we create a builder, pass it options to configure the framework, then let the Easy scan framework do the actual building for us. Here's how we configure the file scan framework by adding a method to the plugin class:

Note: This material describes a revised code structure that is proposed in a PR. This page will be updated once that code is in the master branch.

  @Override
  protected FileScanBuilder frameworkBuilder(
      FragmentContext context, EasySubScan scan) throws ExecutionSetupException {
    FileScanBuilder builder = new FileScanBuilder();
    builder.setReaderFactory(new LogReaderFactory(this));

    // The default type of regex columns is nullable VarChar,
    // so let's use that as the missing column type.

    builder.setNullType(Types.optional(MinorType.VARCHAR));

    // Pass along the output schema, if any

    builder.typeConverterBuilder().providedSchema(scan.getSchema());
    return builder;
  }

The log reader reads from a file, so we use the FileScanBuilder class. We could support the columns column to read into an array, like CSV, if we wanted.

We specify the builder for our batch readers by calling setReaderFactory() with an instance of the reader creator we defined earlier.

Next we call setNullType() to define a type to use for missing columns rather than the traditional nullable INT. We observe that the native type of a regex column is nullable Varchar. So, if the user asked for a column that we don't have, we should use that same type so that types remain unchanged when the user later decides to define that column.

After you add this method, the log reader will still use the old version of the reader because we've not told the Easy framework to call the method we just created.

Select the Traditional or Enhanced Scan Framework

We are now ready to switch over to the new, enhanced (EVF-based) reader. To do so, we simply set one option in our plugin configiruation:

 private static EasyFormatConfig easyConfig(Configuration fsConf, LogFormatConfig pluginConfig) {
    EasyFormatConfig config = new EasyFormatConfig();
    ...
    config.useEnhancedScan = true;
    return config;
  }

With this change, the EVF-based version is now live.

Conditionally Selecting the Original and EVF-Based Readers

If you are especially cautious, you can leverage the framework builder mechanism to offer both the new and old versions of your reader. Just override the useEnhancedScan() method. By default, the method just returns the option we set above:

  protected boolean useEnhancedScan(OptionManager options) {
    return easyConfig.useEnhancedScan;
  }

But, we could select the framework based on a system/session option as was done with the "v2" and "v3" versions of the text (CSV) reader in Drill 1.16.

  @Override
  protected boolean useEnhancedScan(OptionManager options) {
    return options.getBoolean(ExecConstants.ENABLE_V3_TEXT_READER_KEY);
  }

Test

With this method in place, our new version is "live". You should use your unit tests to step through the new code to make sure it works -- and to ensure you understand the EVF, or at least the parts you need.

Next Steps

We've now completed a "bare bones" conversion to the new framework. We'd be fine if we stopped here.

The new framework offers additional features that can further simplify the log format plugin. We'll look at those topics in the next section.


Next: Discover Schema While Reading

Clone this wiki locally