-
Notifications
You must be signed in to change notification settings - Fork 981
BH JSON Reader
As previously mentioned, this project upgraded two readers to test out the various frameworks: the CSV reader and the JSON reader. The JSON reader has special challenges because of a set of semi-supported features. As we will see, upgrading JSON to handle supported features was simple. Problems arose, however, in the semi- and un-supported features of JSON. Additional issues arose when realizing that Drill's design for handling JSON has gaping design issues that we've tended to ignore, perhaps because JSON is not used as frequently as we'd expect.
Let's start with requirements.
- Use the Jackson JSON parser to produce a stream of JSON tokens.
- Create a record batch schema based on the type of the first value.
- Support two forms of records: a sequence of top-level objects, or a single array of objects.
- Read JSON scalars as either typed values or text (the so-called "all-text" mode.)
- Support maps and arrays of maps to any depth.
- Support varying types (maybe.)
- Support varying lists (maybe.)
- Suport two-dimensional lists.
- Support null values in place of lists, maps or scalars.
The original code handles all of the above with some caveats:
- Nulls cannot appear before a non-null value. (If they do, Drill guesses some arbitrary type.)
- Varying types is supported with unsupported Union and List vectors.
- List vectors support is very vague. List vectors themselves have many bugs. Other operators seem to not support a List of Unions.
The revised code has a different set of caveats:
- Handles any number of leading nulls, as long as a non-null value appears in the first batch.
- Handles List vectors, but other operators still don't handle them.
- Does not handle varying types (unions).
The discussion that follows will explain this odd turn of events.
Structurally, the revised JSON reader follows the overall structure of the existing reader.
Layer | Existing | Revised |
---|---|---|
Format Plugin | Derived from Easy plugin | Same, using new framework |
Reader | Derived from RecordReader
|
Derived from `Managed Reader |
Semantic Parser | JSONReader |
JSONLoader |
Vector writer | Complex writers | Result set loader |
Token Parser | Jackson parser | Jackson parser |
The json
package contains both the old and new JSON implementations. The old implementation is still used by the Drill function to parse JSON, by the Kafka plugin, and by the MapR-DB JSON format plugin.
The JSONFormatPlugin
is modified as described in the Easy Format Plugin section:
- Uses the new
EasyFormatConfig
to configure the plugin. - Implements
scanBatchCreator()
to create the scan batch mechanism. - Provides the
JsonScanBatchCreator
class which builds up the scan operator from the required components, including the file scan framework and so on. Sets the type of null columns to nullableVarChar
(a better guess than nullableInt
.) - Implements the
JsonReaderCreator
class to create JSON batch readers as needed.
As in the original implementation, the JsonBatchReader
acts as a bridge between the scan operator and the JSON semantic parser:
- Opens the file described by a
FileWork
object. - Retrieves JSON session options and passes them to the JSON semantic parser.
- Sets up a "type negotiator" (see below).
- On each call to
next()
, reads records from the semantic parser until the batch is full. - Releases resources on
close()
Continuing to follow the previous design, the new code defines a JSONLoader
interface, akin to the prior JSONReader
that defines the services to read JSON records. In keeping with a theme for this project, th interface is dead simple:
public interface JsonLoader {
boolean next();
void endBatch();
void close();
}
Here, next()
reads one record.
The bulk of the actual JSON parsing work resides in the new parser
package.
Briefly, the semantic parser creates a parse tree to parse each JSON element. Thus, there is a parser for the root object, another for each field, and more for nested maps. The idea is to use a separate class for each JSON element, then combine them into an overall parse tree for the file. This approach contrasts with the flags-and-states approach of the prior version: it kept track of differences by setting a large variety of flags. That approach, however, is very difficult to maintain, enhance and unit tests, hence the revised approach.
The prior JSON reader has two notable characteristics:
- Much of the JSON structure is represented in the "complex writers".
- The JSON parser uses a large collection of flags to handle the various parse states.
Unlike the older complex writers, the result set loader makes no assumptions about how a reader wishes to structure its data. As indicated by recent JIRA tickets, we want slightly different rules for Parquet (declare types as required) than we want for JSON (all types must be optional.) With the old approach, we must add flags and modes to get the complex writes to behave differently for different readers.
The revised approach separates concerns. The result set loader simply loads columns, and has no opinion about the structure of the data (other than it be valid for Drill.) Semantic rule about data models are pushed into each reader where they belong.
Thus, for the JSON loader, it is the reader that decides data types, handles complex structures and so on. In particular, the "parse nodes" in the new design do a bit more work than in the old JSON reader, but the result set loader does less work than the old complex writers.
The JsonLoaderImpl
class is the core of the semantic processor. It:
- Holds onto the JSON parse options provided by the caller (by the
JSONBatchReader
). - Reads and returns rows.
- Maintains the root of the parse tree for the document.
- Resolves ambiguous types (discussed below).
- Handles errors.
- Provides the input error recovery created by a community member.
The Jackson JSON parser does the low-level parsing, providing a series of JSON tokens.
It is worth noting some details of the Jackson parser. First, it maintains its own internal state machine about the syntax of the JSON document. In this way, it is not a simple "tokenizer", it is a full parser in its own right. This behavior means that it is very, very difficult to get the parser to recover from errors. Once the parser gets into an error state, it does not want to come back out.
Second, it means that the semantic layer parser does not have to worry about syntax errors or details. The Jackson parser handles, say, the colons that separate key/value pairs, handles the difference between quoted and unquoted identifiers, and will catch invalid token sequences. The semantic parser just worries about a stream of valid tokens.
Third, the parser cannot handle look-ahead. Look-ahead, however, is often needed, so we must provide our own implementation in the form of the TokenIterator
class.
In general, look-ahead is difficult because we have to buffer the value of prior tokens. Fortunately, when parsing JSON, the only tokens that are subject to look-ahead are simple tokens such as {
, [
, and so on. So, the token iterator caches only simple tokens, but not token values.
The semantic parser is based on the idea of a parse tree. Each node in the tree parses one JSON element, then passes control to child nodes to parse any contained elements. The idea is classic parser theory, but the implementation requires a bit of explanation. As we noted, the Jackson parser does syntax validation. All the semantic parser has to do is handle the meaning of the tokens (hence the name.)
Each parse node handles one JSON element: a map, a scalar, an array, etc. It understands the tokens expected in that specific point in the parse. Often this means dealing with a scalar or null, or dealing with a nested structure.
Since each node handles just one task, it is easy to reason about the nodes. Also, it is quite easy to assemble them to handle JSON structures of any depth.
Each parse node implements JsonElementParser
:
interface JsonElementParser {
String key();
JsonLoader loader();
JsonElementParser parent();
boolean isAnonymous();
boolean parse();
ColumnMetadata schema();
}
Note: many parse node classes appear as classes nested inside other top-level classes. This was simply for convenience when developing as the classes were subject to rapid change. If desired, the classes can be pulled out to be top-level classes now that the structure has settled down.
The top-level of a JSON document can be of two forms. First is the familiar sequence of objects:
{ a: ..., b: ... }
{ a: ..., b: ... }
The class RootTupleState
parses this form. This class builds on the convention in the result set loader that rows and maps are both just tuples, and can be handled by the same TupleReader
mechanism. Thus, the RootTupleState
derives from ObjectParser
which also parses JSON objects that represent Drill maps.
The other form of JSON is as an array of objects:
[
{ a: ..., b: ... },
{ a: ..., b: ... }
]
The RootArrayParser
class parses this form. The array form has the slight advantage that it is valid JSON (which the first is not.) But, for all practical purposes, the sequence-of-objects form is easier to write and much more common.
Both parsers must detect EOF. The sequence-of-objects detects actual EOF, the top-level-array form just looks for the closing ]
token.
The ObjectParser
class is the most complex part of the parser as it must handle a number of tasks:
- Recognize the list of name/value pairs that define the columns in a row or members of a map.
- If the key has not yet been seen before, use the type of the value to create a value vector to hold that value and a parser to parse the value.
- If the key has been seen before, retrieve the existing parser to parse the value.
The devil, as they say, is in the details. Suppose we see a new key. We must:
- Check if the key is projected. If not, create a
DummyValueParser
that will "free-wheel" over the (possibly complex) tree for the value. - Determine the type of the value by looking ahead one (for scalars and objects) or several (arrays) tokens. For example, if the next token is an integer, we define the field to be of type
nullable BIGINT
. - If the value is
null
or an empty array ([ ]
), then create a special "deferred decision" parser as described later.
Most of the complexity in the object parser class deals with the ambiguities that arise from leading nulls or empty arrays. (More below.)
ScalarParser
is the base class for the four JSON scalar types:
JSON Type | Scalar Parser | Drill Type |
---|---|---|
TRUE, FALSE | BooleanParser | TINYINT |
NUMBER_INT | IntParser | BIGINT |
NUMBER_FLOAT | FloatParser | FLOAT8 |
STRING | StringParser | VARCHAR |
Any (text mode) | TextParser | VARCHAR |
Drill provides an "all-text" mode for JSON in which all scalar types are parsed as text. This mode can handle some type variation (10 and 10.1, say), but cannot handle variation from, say a scalar to an object. The TextParser
class handles this case.
JSON allows arrays of any of the JSON types. All array parsers derive from the ArrayParser
. Array parsing is built up via composition of a parser for the array as a whole, plus a parser for the specific element type.
JSON Element Type | Array Parser Class | Drill Type |
---|---|---|
TRUE, FALSE | ScalarArrayParser + BooleanParser | TINYINT |
NUMBER_INT | ScalarArrayParser + IntParser | BIGINT |
NUMBER_FLOAT | ScalarArrayParser + FloatParser | FLOAT8 |
STRING | ScalarArrayParser + StringParser | VARCHAR |
Any (text mode) | ScalarArrayParser + TextParser | VARCHAR |
OBJECT | ObjectArrayParser + ObjectParser | MAP |
ARRAY | RepeatedListParser + element parser | (repeated) LIST |
JSON also allows hetrogeneous arrays as described in the List Vector section below.
In general, the array parser handles the array as a whole while the element parser parses each value. The array parser can handle null
in place of the array; this simply translates to an empty array.
JSON supports arrays to any depth. Drill has the special (and oddly named) RepeatedListVector
for this case. (The name Repeated
is, I suppose, in contrast to the "non-repeated" (but still repeated) List
vector. You'll get used to it...) The repeated list vector is really a "repeated repeated something vector": it provides another layer of offset vector on top of a repeated type that carries its own offset vector. (By the way, Arrow has done a fine job of cleaning up this mess: arrow has just one LIST
type that can be a list of anything, even other lists.)
So, to handle a multi-dimensional list, we have a RepeatedListParser
for the outer dimensions, then we have an ArrayParser
to parse the innermost dimension. This division of labor, while clunky to describe, turned out to be the cleanest implementation.
All of the above would be really quite simple if JSON were well-behaved. But, JSON in the wild is complex (as we'll discuss below.) A whole set of parsers handles the ambiguities inherent in the JSON model:
- Runs of nulls before seeing a type
- Runs of empty arrays before seeing a type
- Runs of nulls, followed by runs of empty arrays, followed by runs of empty 2D arrays, and so on.
First, let's take a step back and understand the problem. JSON is a universal data model. Drill uses a relational model, but claims to support all of JSON. JSON supports features that make no sense in a relational model. Without some kind of external schema, there is no "right" answer for ambiguous scenarios such as the above. All we can do is provide a series of hacks and workaround, then hope that they are sufficient.
Suppose our parser is presented with the following:
{ a: null }
... 1000 more records ...
{ a: null }
{ a: 10 }
In general, we have to choose a data type on the first record. But, that record lists column a
as null
. null
what? We don't know, and won't know until we've read the 1000+ records until the 10
finally appears.
The existing JSON reader simply guesses a type (nullable INT
?). The revised JSON reader uses a trick. It defers deciding on a type until the first non-null value appears. The NullTypeParser
class handles this case: it accepts null
values, but as soon as it sees a non-null value, it chooses the proper data type, creates the needed vector, and replaces itself with the proper parser node. This works because JSON types are always nullable, and Drill can "back-fill" null values (because NULL
is the default state for nullable values.)
Now, suppose instead of the above, we have:
{ a: null }
... 1000 more records ...
{ a: [] }
... 1000 more records ...
{ a: null }
{ a: [ 10 ] }
That is, as above, we have a run of nulls. We handle those as above. Then, we discover a record with an empty array. (We can also discover this directly without the preceding nulls.) We know we have an array, but what kind of array? Again, we won't know until we see an actual value.
As with the above case, once we see that the value is an (empty) array, we create an instance of the NullArrayParser
class. This class accepts either null
values or empty arrays. Once it sees an element value, it replaces itself with the proper array parser.
Note that we can play this game to any level:
{ a: null }
{ a: [] }
{ a: [[]] }
{ a: [[[]]] }
{ a: [[[ 10 ]]] }
In each case, we use a NullArrayParser
for the innermost empty array, replacing it with either a RepeatedListParser
or the proper array parser as we learn more about the contents.
The above sounds like a complete solution. But, we need to think about one more case. What if we never actually see an actual value? What if the file contains only the following:
{ a: null }
Or
{ a: [ ] }
Or, suppose we read a very large file, and reach the end of the first batch without ever seeing an actual type.
What do we do? We have two choices:
- Ignore the column: pretend it never appeared.
- Pick some type for the column.
The existing JSON reader, and the new version, choose the second: we pick a type for the column. But, what type? Should we use our traditional nullable INT
? Well, as it turns out, JSON will never produce a nullable INT
type: the closest it can come is nullable BIGINT
. So, we know that nullable INT
is always wrong. Can we do better?
Suppose that this is the second file in a scan operator and the first file did have a non-null value for a
. We can use that type to resolve the ambiguity. The TypeNegotiator
mechanism in the new code does exactly this. The problem, of course, is that we can be wrong, especially if a later record in the same file has a different value. Of course, in this case, we'd have end up with a schema change elsewhere in Drill, and Drill does not actually support schema changes very well (nor can it since Drill is relational.)
But, suppose there is no prior file. We can glean hints from the project list. Suppose the projected column is a[0]
. We know a
must be an array, though we don't know the type of the array. Or, suppose the project list is a.b
. We know a
must be a Map, but we don't know if it is a repeated map. These are very weak hints indeed, but still the code tries to use them.
Finally, suppose we have no information at all? We must guess. Currently the code guesses nullable VARCHAR
. This will be right in two cases: 1) either the column does turn out to be a string, or 2) the reader is operating in all-text mode and the column turns out to be a scalar.
Observation: After writing this, I'm realizing that the original code may have gone down the wrong path. It may actually be better to go with the first option discussed above: simply omit the column and don't create a vector until we actually know the type. We still get a schema change, but of the more benign "a new column appears" form.
The key lesson to take away is that there is no right answer. The problem is inherently ambiguous. The only way to resolve this ambiguity is for the user to communicate their intent by providing a schema. The above hacks can reduce the number of schema-related bugs. But, until a schema is provided, the ambiguity is inherent in the design and cannot be resolved with ever-more-clever hacks.
JSON is a universal data model: values can be of any type. The following is perfectly fine JSON:
{ a: 10 }
{ a: [ 10, 20 ] }
{ a: { _type: "int", value: 10 } }
In fact, it is common to see the above as schemas evolve:
- What was a scalar becomes an array as developers realize that they must store multiple values.
- What was a scalar becomes an object so that the file can include metadata, such as the
_type
key above.
Drill, on the other hand, is relational. The relational model cannot handle columns that shift from one type to another. Still, Drill tries to provide such support with the Union vector.
The Union vector is like a "C" union, but one that was implemented as a C "struct". Suppose we have an int and a string:
strict myUnion {
int type;
int int_value;
string str_value;
}
Drill's union vector is similar. It contains:
- A type vector which identifies the type to use for each row (and can mark that row as null.)
- A collection of vectors, with one vector for each type. (For the above example, we'd have
nullable INT
andnullable VARCHAR
vectors.)
This is a clever design, but it is absolutely incompatible with the Relational model in which columns are assumed to have a type. Still, Drill has implemented the feature. Only the (prior) JSON reader supports Union vectors. Some operators support the Union vectors, but most do not. JDBC and ODBC do not support unions (though the JDBC driver can convert the union to a Java object.)
The MapR commercial version of Drill does not support the union type, though the Apache version offers it as an (unsupported) option. QA does not test the Union vector, though some Drill unit tests do run simple tests.
Given all this, what do we do when upgrading JSON? Do we:
- Drop union support (since it is barely supported), or
- Keep union support and make it work (since it is claimed to be supported in open source).
This project tried to do the latter, but ended up with the former.
The column readers and writers (together "accessors") were extended to support the union vector as described in the Column Accessors section. This code is tested and works. (To make it work, we ended up having to fix a number of issues in the Union vector itself.)
The result set loader layer was also enhanced to handle unions. (Unions added quite a bit of complexity. But, since unions are basically maps keyed by type, the work was possible.)
The challenge arose when adding Union vector support to JSON. The old complex writers handle unions internally. If a writer for an INT
, say, is asked to store a VARCHAR
, it will automatically convert itself from an INT
writer on an INT
vector to a Union
writer on a Union
vector (with the prior INT
vector as its first subtype.)
Such logic is possible to add to the result set loader, but it would cause a vast increase in complexity. For performance, the result set loader allows clients to cache column writers. If writers can change type at any time, then caching cannot be done and performance suffers. Or, the column writers could provide an internal level of indirection to allow replacement, but this will also cause a performance impact.
This then raises a very difficult dilemma:
- Do we slow performance for all users of the column writers in order to support a barely-supported Union vector feature?
- Or, do we take a step back and figure out if we are even going in the right direction with Union vector support?
At present, the JSON rework took the second approach. The new code does not support the Union vector. Some issues to consider moving forward.
Above we discussed the value of a schema for resolving ambiguous null values. The schema could also provide a rule for handling heterogeneous values. For example, consider the following:
{ a: 10 }
{ a: 10.1 }
{ a: "-15" }
Drill fails on the sequence 10 (BIGINT), 10.1 (FLOAT8). But, a schema rule that said, "parse all a
as FLOAT8
" would resolve the issue cleanly. Or, suppose we have:
{ a: 10 }
{ a: -10 }
{ a: "FOO-10" }
Perhaps the above are part numbers. In this case, a rule that says, "parse a
as a VARCHAR
" would be much cleaner than a Union vector solution.
All the union vector does is kick the can down the road: it tries to get heterogeneous values into Drill so that the user can use complex SQL to map them into a common type. Having a schema allows that conversion to be done at read time. Doing the work at read time is faster (no need to create, then discard, the union values), more memory efficient (one vector rather than one for each of multiple types), and simpler for the user (specify a schema once per table rather than in every query.)
Of course, views are our answer to this. But, for a view to work, the data must be loaded into Drill vectors. And, since the data is ambiguous to Drill, Drill must support union vectors so that the view can work on the data to normalize it. Thus, a view-based solution requires that union vectors work.
These are all serious design issues. The conclusion was to not invest in the time needed to implement union support until the team has a chance to consider these issues, decide if the do need to be addressed, and if so, how they should be addressed.
As mentioned previously, the Drill unit tests do exercise Union Vector support in JSON. Since this support is not added to the new code, these existing unit tests fail. This then creates yet another dilemma:
- Do we disable the tests? (But, what if users rely on those features?)
- Do we add the very complex code need just to pass the tests? (But, see above for the costs of this approach.)
- Do we create two readers (as in Parquet): a legacy one for union queries and the new one for all other queries?
There are no good answers unless we get to the core of the issues identified above and resolve them.
For now, the new code will not pass tests that require union support.
Drill provides another obscure vector: the "non-repeated" ListVector
. The list vector is intended to mimic JSON lists:
- It provides a batch of lists: one list per data row.
- The list for a row can be null, or can contain values.
- The list can be of a single type (think of it as, say, a
nullable repeated BIGINT
.) - The list can be a
repeated UNION
.
The list vector is intended to support the following:
{ a: [ 10 ] }
{ a: null }
{ a: [ 10, 10.1, "-15" ] }
{ a: [ [10, 20], [30, 40] }
{ a: [ { b: 10 }, { b: 20 }, 30 ] }
Very clever indeed!
As we discussed above, the Union vector in drill is barely supported in Apache Drill, but unsupported in MapR Drill. The List vector is even more vague: it appears to be mostly unsupported (and unknown.) But, experiments show that the existing code will create a List vector for the following case:
{ a: [ ] }
{ a: [ 10 ] }
It seems that the JSON reader creates a list vector to handle an empty array. (As we discussed, the new code solves this issue at the parser level.)
When working with the List vector, it turned out that the code had many, many bugs. This suggests that the vector is not used for anything other than to represent a list of nulls.
Not realizing that the List vector doesn't work, this project proceeded to fix the vector so that it works with the column accessors. The largest source of bugs was in the code that "promotes" a single-type list to union, but other fixes were needed as well. The code works for the column accessors, but failed in the Project record batch when tried for real queries. For the reasons noted above for Union vectors, remaining work was tabled until we resolve the Union vector question.
The column accessors were enhanced to support the List vector, including its odd lifecycle:
- The List vector starts as a list of nulls.
- It then evolves to be a list of some type (
BIGINT
, say). - It then is "promoted" to a list of unions upon presentation of a second type (
VARCHAR
, say).
In this case, a "shim" layer was added to make the list always look like a list of Unions (termed "Variants" in the column accessors), with the type magic happening behind the scenes. The mechanisms needed to accomplish the above gymnastics are complex: probably too complex for the marginal value of the List vector given that Drill does not support the Union vector.
Recommendation: Solve the schema issue mentioned above, then the need for the List vector disappears. Then, rip out the list vector support from the accessors.
Unlike the Union vector, the List vector is supported in the revised JSON reader. Specialized Array classes, the ListParser
and its associated classes handle the rather complex parsing rules: handling nulls, type transitions and so on.
It is because JSON List support works that we were able to run a query far enough to discover the flaws in other operators. (It is not clear how those tests passed previously -- something to investigate.)
The JSON work was test-driven. Most of the JSON-specific unit tests pass, but status of the functional tests is unclear.
The JSON tests appear to have been written early in the project, and in a hurry. They tend to be rather rough, use obscure mechanisms, and sometimes "verify" results simply by printing results to the console.
The existing 'TestJsonReader` test was heavily rewritten to use more modern mechanisms and to verify results. See the classes in the JSON test package for details.
There is one test that illustrates some of the problems mentioned above: it checks for two entirely different result sets that are produced depending on the (random) order that the test query reads to JSON files. The test almost works, but it is certainly not clear that users benefit from this kind of non-deterministic behavior.
The JSON reader was subject to several rounds of the functional ("pre-commit") tests. Insufficient time was available to resolve all issues. The best path forward was to:
- Run the tests
- Identify issues
- Create a unit test that illustrates the problem
- Determine if this is just a "plain bug" or if it reveals the need for more digging and additional features
This is the way we discovered some of the type ambiguity issues, List vector issues and so on.
The discussion above highlighted open issues, especially those related to ambiguities in how Drill handles JSON data structures. If we set aside those issues, the code is solid and has passed most tests as described above.
Some future issues:
- Resolve the Union/List/Schema issues identified above.
- If we decide to keep Union vector support:
- Determine how to do the promotion from simple types to a Union vector.
- Determine why the List vector fails in downstream operators and fix the issues.
- If we decide to go the schema route:
- Rip out the List vector and Union support.
- Rip out the deferred type parser nodes.
- Add a schema metadata layer (the
SchemaNegotiator
points the direction).
- Finish updating the JSON unit tests, and ensure the code passes.
- Finish running the pre-commit tests and fixing any issues.
Then, there is one implementation issue:
- Move JSON options from session options to format plugin options to allow per-file (rather than per-session) values.
That is, JSON has many options such as all-text mode. These are set at the session level today. But, this is more of a per-file work-around, so it would be better placed in the plugin properties so it can be passed as a table-function option only in those queries that need it. (Even better would be for this to be part of per-file metadata, but that is a longer-term fix.)