-
Notifications
You must be signed in to change notification settings - Fork 981
Working with VARCHAR Data in UDFs
As described in the previous section, most Drill types provide a holder that contains all the data you need. But, the variable-width types (VARCHAR
and VARBINARY
) are more complex as they provide direct access to the underlying data stored in direct memory. This page explains how to work effectively with these types.
One thing to keep in mind: direct memory is exactly that: memory outside of the Java heap. Java can provide no protection. If your code writes to the wrong memory location, you can crash the Drillbit process with no logging to help you figure out your error.
Drill stores VARCHAR
as bytes encoded in UTF-8. To properly handle such data, you must write code that is aware of the UTF-8 encoding. Doing so is quite complex, so one choice is to convert the UTF-8 data to Java String
, then convert the result back to UTF-8. This conversion does, however, consume time. So, Drill often takes a shortcut: it works with the single-byte subset of UTF-8 (also known as ASCII.) While this can produce incorrect results with non-ASCII data, it seems to work well for many cases. (Yes, this does seem more like a bug than a feature...)
So, before you begin writing code, consider if your input data will always be ASCII. If so, you can work with the UTF-8 data a buffer of ASCII characters. But, if your data includes non-ASCII characters (or you wish to contribute your code to the Apache Drill project), you must either handled multi-byte characters, or do the conversion to String
.
The VarCharHolder
provides a view into the underlying VarCharVector
. Before we look at the holder, let's understand the vector.
The VarCharVector
is comprised of two parts, as described in the Drill documentation:
- A data vector which is a large buffer that contains all the column values concatenated together.
- An offset vector which provides the start and end position of the value for each row.
So, to work with the VARCHAR
value for a specific row, we need:
- The buffer that holds all the values
- The start position of the current value
- The end position of the current value (from which we can compute the length as
start - end
)
Unsurprisingly, the above is exactly what the VarCharHolder
provides:
public final class VarCharHolder implements ValueHolder{
public int start;
public int end;
public DrillBuf buffer;
The first two fields make sense, but what is a DrillBuf
? Here is where Drill assumes that UDF writers are already familiar with Drill internals. The DrillBuf
is an interface to a direct memory buffer. If you have downloaded Drill sources, and have set up a development environment, you can take a look at the DrillBuf
source code since Drill provides no Javadoc for any of its APIs.
For our purposes, we will need the following methods when working with parameters:
public byte getByte(int index);
public ByteBuf getBytes(int index, byte[] dst);
public ByteBuf getBytes(int index, byte[] dst, int dstIndex, int length);
The first methods allows us to work byte-by-byte. The second allows us to work with the entire value, such as when we wish to convert to strings (the length of the dst
buffer says how many bytes to copy.) The last version is general purpose for when we need slightly more control.
Let's write two functions: one which counts letters using the single-byte subset of UTF-8 (AKA ASCII), the other which uses Unicode. These illustrate the two ways to work with VARCHAR data, and the pitfalls of working only with ASCII.
Here is the ASCII (byte-oriented) version:
public class CountLettersFunctions {
@FunctionTemplate(
name = "countLettersAscii",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class CountLettersAsciiFunction implements DrillSimpleFunc {
@Param VarCharHolder input;
@Output IntHolder output;
@Override
public void setup() { }
@Override
public void eval() {
int len = input.end - input.start;
output.value = 0;
for (int i = 0; i < len; i++) {
int c = (input.buffer.getByte(input.start + i) & 0xFF);
if (Character.isAlphabetic(c)) {
output.value++;
}
}
}
}
}
Note that we are using a handy trick: defining more than one function in a single Java file. Drill uses this technique frequently.
Let's try some examples. Let's create letterInput.json
:
{ word: "aA 12 zZ!" }
{ word: "Αθήνα?" }
{ word: "-Mосква-" }
{ word: "海." }
Then run a query and look at the results:
SELECT word, countLettersAscii(word) FROM `cp`.`letterInput.json`
aA 12 zZ!,4
Αθήνα?,5
-Mосква-,7
海.,2
The first line is fine: 4 letters. The Greek is good (second line). But, the Russian (third line) is wrong: 7 characters instead of 6. And the Chinese (last line) reports 2 letters instead of 1. What is wrong?
The UTF-8 encoding of some of the characters is multi-byte and is overlapping ASCII letter range which is all our function can handle. (Another error that could occur, not shown here, is that a character is not counted if it falls outside the ASCII letter range.)
Clearly, in this case, we have to become UTF-8 experts. Or, we can pay the cost to convert UTF-8 to a Java String
. Lets do that next.
To convert an array of UTF-8 bytes to a string, we write:
byte bytes[] = ...
String str = new String(bytes, com.google.common.base.Charsets.UTF_8);
You must provide the character set argument. Otherwise, Java will use the default character set for whatever machine is running Drill, which can produce inconsistent results.
Similarly, to convert from a String
to an array of bytes:
byte bytes[] = String.getBytes(com.google.common.base.Charsets.UTF_8);
Drill provides a number of helper functions to work with VARCHAR
columns, defined in the org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers
class. (Remember to use the fully-qualified name when referencing these methods.)
public static String getStringFromVarCharHolder(VarCharHolder varCharHolder);
public static String getStringFromVarCharHolder(NullableVarCharHolder varCharHolder);
public static String toStringFromUTF8(int start, int end, DrillBuf buffer);
Note that the functions only convert from VARCHAR
to String
. We have to roll our own for the reverse conversion.
Here is the Unicode version (in the same Java file) in which we let Java convert from UTF-8 to UTF-16 (used by the Java String
class):
@FunctionTemplate(
name = "countLetters",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class CountLettersFunction implements DrillSimpleFunc {
@Param VarCharHolder input;
@Output IntHolder output;
@Override
public void setup() { }
@Override
public void eval() {
String value = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getStringFromVarCharHolder(input);
output.value = 0;
for (int i = 0; i < value.length(); i++) {
if (Character.isAlphabetic(value.charAt(i))) {
output.value++;
}
}
}
}
Note that in the loop over the characters, we use the length of the String, not the buffer, since, in the UTF-8 multi-byte encoding, the numbers are different.
Let's try the query again:
aA 12 zZ!,4
Αθήνα?,5
-Mосква-,6
海.,1
Bingo! As we can see, we must trade off speed for accuracy. We must pay the cost to convert to Unicode in order to perform operations that depend on the Unicode encoding.
We've also seen how to work with the DrillBuf
provided in the VarCharHolder
class.
Returning a VARCHAR
is a bit more complex. When working with parameters, Drill has already allocated the space for the incoming values, we simply work with that existing buffer. But for the output, we must create a buffer. For obscure reasons, Drill insists we put the return value in direct memory.
We don't write the value directly to the output vector; instead we write it to a temporary buffer which is copied to the output. (Because of this copy, it is not clear that there is much benefit creating a direct memory buffer. Either we should write to the output vector, or use a heap buffer for a temporary value. Sigh... We just have to work with the code the way it exists, not the way we'd like it to be.)
Suppose we want to create a function that returns the set of letter-like characters. We declare the output type as VARCHAR
as follows:
@FunctionTemplate(
name = "extractLetters",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class ExtractLettersFunction implements DrillSimpleFunc {
@Param VarCharHolder input;
@Output VarCharHolder output;
For reasons known only to Drill, we must tell Drill twice that we are returning a VARCHAR
: once in the @Output
annotation and a second time using the @Inject
annotation. Then, we must glue the two together in our code. (Since there is only one way to do this correctly, one wonders why Drill doesn't just do it for us...)
public static class ExtractLettersFunction implements DrillSimpleFunc {
...
@Inject DrillBuf outputBuf;
The injected buffer is initialized with a capacity of 256 bytes. If we need more, we must replace it with a larger buffer. The method for doing so is:
public DrillBuf reallocIfNeeded(final int size);
The buffer is reallocated to the new size. (Actually, it is allocated to the new size, rounded up to the next power of 2.) The return value is a new buffer that we use to replace the old one. Because of the way Drill works, both our old and new buffers are allocated until the query exits.
The buffer is injected once, then reused for each function call. The buffer is a wrapper around direct memory. Our code must allocate the required memory (as shown below.) But, since the buffer is reused, once our function runs for a while, we may be able to reuse the previous buffer, if it is long enough.
But, because we reuse the same buffer, and Drill will not release buffers until the query exits, we can end up wasting quite a bit of space if we constantly grow our buffer from 256 to 512, 1024 bytes and so on. There is nothing we can do to fix this; we can just be aware of the issue.
Drill uses Netty as its memory allocator. Because of the way Netty works, code should avoid allocating buffers larger than 16 MB. Such a large size is unlikely for VARCHAR
columns, but possible for VARBINARY
columns if you happen to be working with images, audio or other binary blob types.
We will need the following ByteBuf
methods:
public ByteBuf setByte(int index, int value);
public void setByte(int index, byte b);
public ByteBuf setBytes(int index, byte[] src);
public ByteBuf setBytes(int index, byte[] src, int srcIndex, int length);
The ByteBuf
return values are an artifact of the fact that DrillBuf
extends Netty's ByteBuf
class. Normally we don't use these values.
Let's put all this together to write a function that returns the set of letter-like characters rather than just counting them.
@FunctionTemplate(
name = "extractLetters",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class ExtractLettersFunction implements DrillSimpleFunc {
@Param VarCharHolder input;
@Output VarCharHolder output;
@Inject DrillBuf outputBuf;
@Override
public void setup() { }
@Override
public void eval() {
// Convert the input to a string.
String value = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getStringFromVarCharHolder(input);
// Build up the output string.
StringBuilder result = new StringBuilder();
for (int i = 0; i < value.length(); i++) {
char c = value.charAt(i);
if (Character.isAlphabetic(c)) {
result.append(c);
}
}
// Convert to the output buffer.
byte outBytes[] = result.toString().getBytes();
outputBuf = outputBuf.reallocIfNeeded(outBytes.length);
outputBuf.setBytes(0, outBytes);
output.buffer = outputBuf;
output.start = 0;
output.end = outBytes.length;
}
}
Let's focus on the eval()
method, as the rest has already been covered. eval()
divides into three blocks. The first we've already discussed, the second is just plain Java code. The third is where we must focus. We start by converting our output String to an array of UTF-8 bytes:
byte outBytes[] = result.toString().getBytes();
Next, we reallocate our working buffer to hold our data. The buffer is reused, so we need to reallocate memory only if our needs are larger than the current size:
outputBuf = outputBuf.reallocIfNeeded(outBytes.length);
Note that reallocIfNeeded()
returns a new buffer if reallocation occurs. If we forget to assign it back to our outputBuf
, we'll continue working the the old buffer, and we may cause the query to fail with an IndexOutOfBounds
exception.
With the buffer allocated, we can copy our UTF-8 bytes into the buffer:
outputBuf.setBytes(0, outBytes);
Finally, we tell the output VarCharHolder
where to find its buffer, and where to find the data within the buffer:
output.buffer = outputBuf;
output.start = 0;
output.end = outBytes.length;
Note that, in the generated code, the assignment to output.buffer
is done for us. But, in the interpreted environment, it is not done. So, to be robust, we set the output.buffer
field ourselves.
Let's try the function on the same input data as before: