-
Notifications
You must be signed in to change notification settings - Fork 981
Working with VARCHAR Data in UDFs
As described in the previous section, most Drill types provide a holder that contains all the data you need. But, the variable-width types (VARCHAR
and VARBINARY
) are more complex as they provide direct access to the underlying data stored in direct memory. This page explains how to work effectively with these types.
One thing to keep in mind: direct memory is exactly that: memory outside of the Java heap. Java can provide no protection. If your code writes to the wrong memory location, you can crash the Drillbit process with no logging to help you figure out your error.
Drill stores VARCHAR
as bytes encoded in UTF-8. To properly handle such data, you must write code that is aware of the UTF-8 encoding. Doing so is quite complex, so one choice is to convert the UTF-8 data to Java String
, then convert the result back to UTF-8. This conversion does, however, consume time. So, Drill often takes a shortcut: it works with the single-byte subset of UTF-8 (also known as ASCII.) While this can produce incorrect results with non-ASCII data, it seems to work well for many cases. (Yes, this does seem more like a bug than a feature...)
So, before you begin writing code, consider if your input data will always be ASCII. If so, you can work with the UTF-8 data a buffer of ASCII characters. But, if your data includes non-ASCII characters (or you wish to contribute your code to the Apache Drill project), you must either handled multi-byte characters, or do the conversion to String
.
The VarCharHolder
provides a view into the underlying VarCharVector
. Before we look at the holder, let's understand the vector.
The VarCharVector
is comprised of two parts, as described in the Drill documentation:
- A data vector which is a large buffer that contains all the column values concatenated together.
- An offset vector which provides the start and end position of the value for each row.
So, to work with the VARCHAR
value for a specific row, we need:
- The buffer that holds all the values
- The start position of the current value
- The end position of the current value (from which we can compute the length as
start - end
)
Unsurprisingly, the above is exactly what the VarCharHolder
provides:
public final class VarCharHolder implements ValueHolder{
public int start;
public int end;
public DrillBuf buffer;
The first two fields make sense, but what is a DrillBuf
? Here is where Drill assumes that UDF writers are already familiar with Drill internals. The DrillBuf
is an interface to a direct memory buffer. If you have downloaded Drill sources, and have set up a development environment, you can take a look at the DrillBuf
source code since Drill provides no Javadoc for any of its APIs.
For our purposes, we will need the following methods when working with parameters:
public byte getByte(int index);
public ByteBuf getBytes(int index, byte[] dst);
public ByteBuf getBytes(int index, byte[] dst, int dstIndex, int length);
The first methods allows us to work byte-by-byte. The second allows us to work with the entire value, such as when we wish to convert to strings (the length of the dst
buffer says how many bytes to copy.) The last version is general purpose for when we need slightly more control.
Let's write two functions: one which counts letters using the single-byte subset of UTF-8 (AKA ASCII), the other which uses Unicode. These illustrate the two ways to work with VARCHAR data, and the pitfalls of working only with ASCII.
Here is the ASCII (byte-oriented) version:
public class CountLettersFunctions {
@FunctionTemplate(
name = "countLettersAscii",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class CountLettersAsciiFunction implements DrillSimpleFunc {
@Param VarCharHolder input;
@Output IntHolder output;
@Override
public void setup() { }
@Override
public void eval() {
int len = input.end - input.start;
output.value = 0;
for (int i = 0; i < len; i++) {
int c = (input.buffer.getByte(input.start + i) & 0xFF);
if (Character.isAlphabetic(c)) {
output.value++;
}
}
}
}
}
Note that we are using a handy trick: defining more than one function in a single Java file. Drill uses this technique frequently.
Let's try some examples:
SELECT countLettersAscii('aA 12 zZ!') FROM (VALUES (1))
1 row(s):
EXPR$0<INT(REQUIRED)>
4
Perfect. We did include four letters. How about Greek?
SELECT countLettersAscii('Αθήνα') FROM (VALUES (1))
5
Ha! We got the right answer even without that fancy UTF-8 conversion. Now Russian:
SELECT countLettersAscii('Mосква') FROM (VALUES (1))
7
7? But there are only 6 characters in the input! Chinese?
SELECT countLettersAscii('海') FROM (VALUES (1))
2
But there is only one character? What is happening is that the UTF-8 encoding of some of the characters is multi-byte and is overlapping ASCII letter range which is all our function can handle. (Another error might be that a character is not counted if it falls outside the ASCII letter range.)
Clearly, in this case, we have to become UTF-8 experts. Or, we can pay the cost to convert UTF-8 to a Java String
. Lets do that next.
Here is the Unicode version (in the same Java file) in which we let Java convert from UTF-8 to UTF-16 (used by the Java String
class):
@FunctionTemplate(
name = "countLetters",
scope = FunctionScope.SIMPLE,
nulls = NullHandling.NULL_IF_NULL)
public static class CountLettersFunction implements DrillSimpleFunc {
@Param VarCharHolder input;
@Output IntHolder output;
@Override
public void setup() { }
@Override
public void eval() {
int len = input.end - input.start;
byte buf[] = new byte[len];
input.buffer.getBytes(input.start, buf);
String value = new String(buf);
for (int i = 0; i < value.length(); i++) {
if (Character.isAlphabetic(value.charAt(i))) {
output.value++;
}
}
}
}
Note that in the loop over the characters, we use the length of the String, not the buffer, since, in the UTF-8 multi-byte encoding, the numbers are different.
Let's try the queries again:
SELECT countLetters('aA 12 zZ!') FROM (VALUES (1))
4
SELECT countLetters('Αθήνα') FROM (VALUES (1))
5
SELECT countLetters('Mосква') FROM (VALUES (1))
6
SELECT countLetters('海') FROM (VALUES (1))
1
Bingo! As we can see, we must trade off speed for accuracy. We must pay the cost to convert to Unicode in order to perform operations that depend on the Unicode encoding.
We've also seen how to work with the DrillBuf
provided in the VarCharHolder
class.
Returning a VARCHAR
is a bit more complex. When working with parameters, Drill has already allocated the space for the incoming values, we simply work with that existing buffer. But for the output, we must create a buffer. For obscure reasons, Drill insists we put the return value in direct memory.
We don't write the value directly to the output vector; instead we write it to a temporary buffer which is copied to the output. (Because of this copy, it is not clear that there is much benefit creating a direct memory buffer. Either we should write to the output vector, or use a heap buffer for a temporary value. Sigh... We just have to work with the code the way it exists, not the way we'd like it to be.)
Drill uses the
We will need the following ByteBuf
methods:
public ByteBuf setByte(int index, int value);
public void setByte(int index, byte b);
public ByteBuf setBytes(int index, byte[] src, int srcIndex, int length);