Skip to content

Working with VARCHAR Data in UDFs

Paul Rogers edited this page Jan 7, 2018 · 11 revisions

As described in the previous section, most Drill types provide a holder that contains all the data you need. But, the variable-width types (VARCHAR and VARBINARY) are more complex as they provide direct access to the underlying data stored in direct memory. This page explains how to work effectively with these types.

One thing to keep in mind: direct memory is exactly that: memory outside of the Java heap. Java can provide no protection. If your code writes to the wrong memory location, you can crash the Drillbit process with no logging to help you figure out your error.

VARCHAR and UTF-8

Drill stores VARCHAR as bytes encoded in UTF-8. To properly handle such data, you must write code that is aware of the UTF-8 encoding. Doing so is quite complex, so one choice is to convert the UTF-8 data to Java String, then convert the result back to UTF-8. This conversion does, however, consume time. So, Drill often takes a shortcut: it works with the single-byte subset of UTF-8 (also known as ASCII.) While this can produce incorrect results with non-ASCII data, it seems to work well for many cases. (Yes, this does seem more like a bug than a feature...)

So, before you begin writing code, consider if your input data will always be ASCII. If so, you can work with the UTF-8 data a buffer of ASCII characters. But, if your data includes non-ASCII characters (or you wish to contribute your code to the Apache Drill project), you must either handled multi-byte characters, or do the conversion to String.

Structure of a VARCHAR Vector

The VarCharHolder provides a view into the underlying VarCharVector. Before we look at the holder, let's understand the vector.

The VarCharVector is comprised of two parts, as described in the Drill documentation:

  • A data vector which is a large buffer that contains all the column values concatenated together.
  • An offset vector which provides the start and end position of the value for each row.

So, to work with the VARCHAR value for a specific row, we need:

  • The buffer that holds all the values
  • The start position of the current value
  • The end position of the current value (from which we can compute the length as start - end)

VARCHAR Holder

Unsurprisingly, the above is exactly what the VarCharHolder provides:

public final class VarCharHolder implements ValueHolder{
    public int start;
    public int end;
    public DrillBuf buffer;

DrillBuf

The first two fields make sense, but what is a DrillBuf? Here is where Drill assumes that UDF writers are already familiar with Drill internals. The DrillBuf is an interface to a direct memory buffer. If you have downloaded Drill sources, and have set up a development environment, you can take a look at the DrillBuf source code since Drill provides no Javadoc for any of its APIs.

For our purposes, we will need the following methods when working with parameters:

  public byte getByte(int index);
  public ByteBuf getBytes(int index, byte[] dst);
  public ByteBuf getBytes(int index, byte[] dst, int dstIndex, int length);

The first methods allows us to work byte-by-byte. The second allows us to work with the entire value, such as when we wish to convert to strings (the length of the dst buffer says how many bytes to copy.) The last version is general purpose for when we need slightly more control.

Example of a VARCHAR Parameter

Let's write two functions: one which counts letters using the single-byte subset of UTF-8 (AKA ASCII), the other which uses Unicode. These illustrate the two ways to work with VARCHAR data, and the pitfalls of working only with ASCII.

ASCII Version

Here is the ASCII (byte-oriented) version:

public class CountLettersFunctions {

  @FunctionTemplate(
      name = "countLettersAscii",
      scope = FunctionScope.SIMPLE,
      nulls = NullHandling.NULL_IF_NULL)

  public static class CountLettersAsciiFunction implements DrillSimpleFunc {
    @Param VarCharHolder input;
    @Output IntHolder output;

    @Override
    public void setup() { }

    @Override
    public void eval() {
      int len = input.end - input.start;
      output.value = 0;
      for (int i = 0; i < len; i++) {
        int c = (input.buffer.getByte(input.start + i) & 0xFF);
        if (Character.isAlphabetic(c)) {
          output.value++;
        }
      }
    }
  }
}

Note that we are using a handy trick: defining more than one function in a single Java file. Drill uses this technique frequently.

Let's try some examples. Let's create letterInput.json:

{ word: "aA 12 zZ!" }
{ word: "Αθήνα?" }
{ word: "-Mосква-" }
{ word: "海." }

Then run a query and look at the results:

SELECT word, countLettersAscii(word) FROM `cp`.`letterInput.json`
aA 12 zZ!,4
Αθήνα?,5
-Mосква-,7
海.,2

The first line is fine: 4 letters. The Greek is good (second line). But, the Russian (third line) is wrong: 7 characters instead of 6. And the Chinese (last line) reports 2 letters instead of 1. What is wrong?

The UTF-8 encoding of some of the characters is multi-byte and is overlapping ASCII letter range which is all our function can handle. (Another error that could occur, not shown here, is that a character is not counted if it falls outside the ASCII letter range.)

Clearly, in this case, we have to become UTF-8 experts. Or, we can pay the cost to convert UTF-8 to a Java String. Lets do that next.

Converting VARCHAR Data to/from Java Strings

To convert an array of UTF-8 bytes to a string, we write:

  byte bytes[] = ...
  String str = new String(bytes, com.google.common.base.Charsets.UTF_8);

You must provide the character set argument. Otherwise, Java will use the default character set for whatever machine is running Drill, which can produce inconsistent results.

Similarly, to convert from a String to an array of bytes:

   byte bytes[] = String.getBytes(com.google.common.base.Charsets.UTF_8);

String Helper Functions

Drill provides a number of helper functions to work with VARCHAR columns, defined in the org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers class. (Remember to use the fully-qualified name when referencing these methods.)

public static String getStringFromVarCharHolder(VarCharHolder varCharHolder);
public static String getStringFromVarCharHolder(NullableVarCharHolder varCharHolder);
public static String toStringFromUTF8(int start, int end, DrillBuf buffer);

Note that the functions only convert from VARCHAR to String. We have to roll our own for the reverse conversion.

Unicode Version

Here is the Unicode version (in the same Java file) in which we let Java convert from UTF-8 to UTF-16 (used by the Java String class):

  @FunctionTemplate(
      name = "countLetters",
      scope = FunctionScope.SIMPLE,
      nulls = NullHandling.NULL_IF_NULL)

  public static class CountLettersFunction implements DrillSimpleFunc {
    @Param VarCharHolder input;
    @Output IntHolder output;

    @Override
    public void setup() { }

    @Override
    public void eval() {
      String value = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getStringFromVarCharHolder(input);
      output.value = 0;
      for (int i = 0; i < value.length(); i++) {
        if (Character.isAlphabetic(value.charAt(i))) {
          output.value++;
        }
      }
    }
  }

Note that in the loop over the characters, we use the length of the String, not the buffer, since, in the UTF-8 multi-byte encoding, the numbers are different.

Let's try the query again:

aA 12 zZ!,4
Αθήνα?,5
-Mосква-,6
海.,1

Bingo! As we can see, we must trade off speed for accuracy. We must pay the cost to convert to Unicode in order to perform operations that depend on the Unicode encoding.

We've also seen how to work with the DrillBuf provided in the VarCharHolder class.

Returning a VARCHAR

Returning a VARCHAR is a bit more complex. When working with parameters, Drill has already allocated the space for the incoming values, we simply work with that existing buffer. But for the output, we must create a buffer. For obscure reasons, Drill insists we put the return value in direct memory.

We don't write the value directly to the output vector; instead we write it to a temporary buffer which is copied to the output. (Because of this copy, it is not clear that there is much benefit creating a direct memory buffer. Either we should write to the output vector, or use a heap buffer for a temporary value. Sigh... We just have to work with the code the way it exists, not the way we'd like it to be.)

Declare the Output Type

Suppose we want to create a function that returns the set of letter-like characters. We declare the output type as VARCHAR as follows:

  @FunctionTemplate(
      name = "extractLetters",
      scope = FunctionScope.SIMPLE,
      nulls = NullHandling.NULL_IF_NULL)

  public static class ExtractLettersFunction implements DrillSimpleFunc {
    @Param VarCharHolder input;
    @Output VarCharHolder output;

The @Inject Annotation

For reasons known only to Drill, we must tell Drill twice that we are returning a VARCHAR: once in the @Output annotation and a second time using the @Inject annotation. Then, we must glue the two together in our code. (Since there is only one way to do this correctly, one wonders why Drill doesn't just do it for us...)

  public static class ExtractLettersFunction implements DrillSimpleFunc {
    ...
    @Inject DrillBuf outputBuf;

Allocating the Buffer

The injected buffer is initialized with a capacity of 256 bytes. If we need more, we must replace it with a larger buffer. The method for doing so is:

  public DrillBuf reallocIfNeeded(final int size);

The buffer is reallocated to the new size. (Actually, it is allocated to the new size, rounded up to the next power of 2.) The return value is a new buffer that we use to replace the old one. Because of the way Drill works, both our old and new buffers are allocated until the query exits.

The buffer is injected once, then reused for each function call. The buffer is a wrapper around direct memory. Our code must allocate the required memory (as shown below.) But, since the buffer is reused, once our function runs for a while, we may be able to reuse the previous buffer, if it is long enough.

But, because we reuse the same buffer, and Drill will not release buffers until the query exits, we can end up wasting quite a bit of space if we constantly grow our buffer from 256 to 512, 1024 bytes and so on. There is nothing we can do to fix this; we can just be aware of the issue.

Drill uses Netty as its memory allocator. Because of the way Netty works, code should avoid allocating buffers larger than 16 MB. Such a large size is unlikely for VARCHAR columns, but possible for VARBINARY columns if you happen to be working with images, audio or other binary blob types.

Writing to the Output Buffer

We will need the following ByteBuf methods:

  public ByteBuf setByte(int index, int value);
  public void setByte(int index, byte b);
  public ByteBuf setBytes(int index, byte[] src);
  public ByteBuf setBytes(int index, byte[] src, int srcIndex, int length);

The ByteBuf return values are an artifact of the fact that DrillBuf extends Netty's ByteBuf class. Normally we don't use these values.

Example That Returns a VARCHAR

Let's put all this together to write a function that returns the set of letter-like characters rather than just counting them.

  @FunctionTemplate(
      name = "extractLetters",
      scope = FunctionScope.SIMPLE,
      nulls = NullHandling.NULL_IF_NULL)

  public static class ExtractLettersFunction implements DrillSimpleFunc {
    @Param VarCharHolder input;
    @Output VarCharHolder output;
    @Inject DrillBuf outputBuf;

    @Override
    public void setup() { }

    @Override
    public void eval() {

      // Convert the input to a string.

      String value = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.getStringFromVarCharHolder(input);

      // Build up the output string.

      StringBuilder result = new StringBuilder();
      for (int i = 0; i < value.length(); i++) {
        char c = value.charAt(i);
        if (Character.isAlphabetic(c)) {
          result.append(c);
        }
      }

      // Convert to the output buffer.

      byte outBytes[] = result.toString().getBytes();
      outputBuf = outputBuf.reallocIfNeeded(outBytes.length);
      outputBuf.setBytes(0, outBytes);
      output.buffer = outputBuf;
      output.start = 0;
      output.end = outBytes.length;
    }
  }

Let's focus on the eval() method, as the rest has already been covered. eval() divides into three blocks. The first we've already discussed, the second is just plain Java code. The third is where we must focus. We start by converting our output String to an array of UTF-8 bytes:

      byte outBytes[] = result.toString().getBytes();

Next, we reallocate our working buffer to hold our data. The buffer is reused, so we need to reallocate memory only if our needs are larger than the current size:

      outputBuf = outputBuf.reallocIfNeeded(outBytes.length);

Note that reallocIfNeeded() returns a new buffer if reallocation occurs. If we forget to assign it back to our outputBuf, we'll continue working the the old buffer, and we may cause the query to fail with an IndexOutOfBounds exception.

With the buffer allocated, we can copy our UTF-8 bytes into the buffer:

      outputBuf.setBytes(0, outBytes);

Finally, we tell the output VarCharHolder where to find its buffer, and where to find the data within the buffer:

      output.buffer = outputBuf;
      output.start = 0;
      output.end = outBytes.length;

Note that, in the generated code, the assignment to output.buffer is done for us. But, in the interpreted environment, it is not done. So, to be robust, we set the output.buffer field ourselves.

Example Runs

Let's try the function on the same input data as before:

Clone this wiki locally