-
Notifications
You must be signed in to change notification settings - Fork 981
EVF Tutorial Late Schema
The tutorial thus far focused on the log reader which is an example of an "early schema" reader: the log reader has sufficient information to declare a schema before reading the file. Many plugins are of this form: CSV, Parquet, JDBC, etc.
However, other readers are truly "schema on read": they discover columns only as they are read. JSON is the classic example of such a "late schema" reader. Here we'll look at how to support schema discovery on the fly.
Almost everything we've discussed says the same, except that we don't call setTableSchema()
on our SchemaNegotiator
instance; we just let EVF create a result set loader with no columns:
@Override
public boolean open(FileSchemaNegotiator negotiator) {
...
// No call to setTableSchema()
loader = negotiator.build();
...
return true;
}
With late schema, we define columns on the fly. The JSON reader (not yet in master) is an example. Here is a highly simplified version, assuming all columns are VarChar:
void readNextColumn(TupleWriter writer) {
String value = // Get the value
String name = // Get the column name
ScalarWriter colWriter = writer.scalar(name);
if (colWriter == null) {
ColumnMetadata colSchema = MetadataUtils.newScalar(name, MinorType.VARCHAR, DataMode.OPTIONAL);
int colIndex = writer.addColumn(colSchema);
colWriter = writer.scalar(colIndex);
}
colWriter.setString(value);
}
Here we obtain the column writer by name. If the column has not yet been defined, we'll get a null
value. In this case, we create a new column by first defining the metadata for the column, then adding the column to the writer, and grabbing the newly-created column by index. Finally, we write the value to the underlying column vector.
There is actually a bit of magic going on here. Suppose our column is foo
, but our query was SELECT bar FROM ...
. When we add the column, we'll get back a dummy column writer since the query does not actually need the value of the foo
column. As before, if we care whether the value is projected, we could use the isProjected()
method to find out.
Next: Enhanced Error Reporting