Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Custom Parser #168

Merged
merged 6 commits into from
Jan 14, 2025
Merged

Conversation

davebenvenuti
Copy link
Contributor

@davebenvenuti davebenvenuti commented Jan 10, 2025

We've decided to forgo support for our custom parser in favor of working with binary protobuf files we can get from the protoc executable. This PR removes the custom parser code to simplify things. As a consequence of this, our entire test suite now shells out to protoc and tests the output from that. We already had a subset of our tests doing this and now they all are. To make this easier, test/helper.rb defines parse_proto_string and parse_proto_file helper methods that can be used throughout our test suite.

protoc is also more strict about protobuf import statements than our custom parser was. This means I had to import another well-known type, google.api.FieldBehavior to get our SigstoreTest passing. I decided to move our well-known types to a new ProtoBoeuf::Google submodule to keep things a little better organized as well as to make import google/* statements work without having to jump through quite so many hoops.

I also added a .gitattributes and marked everything in lib/protoboeuf/google/**/*.rb as generated so it doesn't show up in PR diffs by default. If there's a good reason to have these more visible in PRs I'm happy to undo that.

Finally, I added debug as a dev dependency in our Gemfile because I personally find it useful and it's pretty widely used by Ruby developers.

I manually tested the executable with the following simple test proto file:

test.proto3

syntax = "proto3";

message TestMessage {
  bool flag = 1;
}
➜  protoboeuf git:(davebenvenuti/remove-custom-parser) ✗ protoc -o test.bproto test.proto3
➜  protoboeuf git:(davebenvenuti/remove-custom-parser) ✗ bundle exec protoboeuf test.bproto
# encoding: ascii-8bit
# rubocop:disable all
# frozen_string_literal: true

class TestMessage
  def self.decode(buff)
    allocate.decode_from(buff.b, 0, buff.bytesize)
  end

  def self.encode(obj)
    obj._encode("".b)
  end
  # required field readers

  attr_reader :flag

  def flag=(v)
    @flag = v
  end

  def initialize(flag: false)
    @flag = flag
  end

  def to_proto(_options = {})
    self.class.encode(self)
  end

  def decode_from(buff, index, len)
    @flag = false

    return self if index >= len
    ## PULL_UINT64
    tag =
      if (byte0 = buff.getbyte(index)) < 0x80
        index += 1
        byte0
      elsif (byte1 = buff.getbyte(index + 1)) < 0x80
        index += 2
        (byte1 << 7) | (byte0 & 0x7F)
      elsif (byte2 = buff.getbyte(index + 2)) < 0x80
        index += 3
        (byte2 << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
      elsif (byte3 = buff.getbyte(index + 3)) < 0x80
        index += 4
        (byte3 << 21) | ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) |
          (byte0 & 0x7F)
      elsif (byte4 = buff.getbyte(index + 4)) < 0x80
        index += 5
        (byte4 << 28) | ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
          ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
      elsif (byte5 = buff.getbyte(index + 5)) < 0x80
        index += 6
        (byte5 << 35) | ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
          ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
      elsif (byte6 = buff.getbyte(index + 6)) < 0x80
        index += 7
        (byte6 << 42) | ((byte5 & 0x7F) << 35) | ((byte4 & 0x7F) << 28) |
          ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
          ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
      elsif (byte7 = buff.getbyte(index + 7)) < 0x80
        index += 8
        (byte7 << 49) | ((byte6 & 0x7F) << 42) | ((byte5 & 0x7F) << 35) |
          ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
          ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
      elsif (byte8 = buff.getbyte(index + 8)) < 0x80
        index += 9
        (byte8 << 56) | ((byte7 & 0x7F) << 49) | ((byte6 & 0x7F) << 42) |
          ((byte5 & 0x7F) << 35) | ((byte4 & 0x7F) << 28) |
          ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
          ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
      elsif (byte9 = buff.getbyte(index + 9)) < 0x80
        index += 10

        (byte9 << 63) | ((byte8 & 0x7F) << 56) | ((byte7 & 0x7F) << 49) |
          ((byte6 & 0x7F) << 42) | ((byte5 & 0x7F) << 35) |
          ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
          ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
      else
        raise "integer decoding error"
      end

    ## END PULL_UINT64

    found = true
    while true
      # If we have looped around since the last found tag this one is
      # unexpected, so discard it and continue.
      if !found
        wire_type = tag & 0x7

        unknown_bytes = +"".b
        val = tag
        while val != 0
          byte = val & 0x7F

          val >>= 7
          # This drops the top bits,
          # Otherwise, with a signed right shift,
          # we get infinity one bits at the top
          val &= (1 << 57) - 1

          byte |= 0x80 if val != 0
          unknown_bytes << byte
        end

        case wire_type
        when 0
          i = 0
          while true
            newbyte = buff.getbyte(index)
            index += 1
            break if newbyte.nil?
            unknown_bytes << newbyte
            break if newbyte < 0x80
            i += 1
            break if i > 9
          end
        when 1
          unknown_bytes << buff.byteslice(index, 8)
          index += 8
        when 2
          value =
            if (byte0 = buff.getbyte(index)) < 0x80
              index += 1
              byte0
            elsif (byte1 = buff.getbyte(index + 1)) < 0x80
              index += 2
              (byte1 << 7) | (byte0 & 0x7F)
            elsif (byte2 = buff.getbyte(index + 2)) < 0x80
              index += 3
              (byte2 << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
            elsif (byte3 = buff.getbyte(index + 3)) < 0x80
              index += 4
              (byte3 << 21) | ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) |
                (byte0 & 0x7F)
            elsif (byte4 = buff.getbyte(index + 4)) < 0x80
              index += 5
              (byte4 << 28) | ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
                ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
            elsif (byte5 = buff.getbyte(index + 5)) < 0x80
              index += 6
              (byte5 << 35) | ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
                ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
            elsif (byte6 = buff.getbyte(index + 6)) < 0x80
              index += 7
              (byte6 << 42) | ((byte5 & 0x7F) << 35) | ((byte4 & 0x7F) << 28) |
                ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
                ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
            elsif (byte7 = buff.getbyte(index + 7)) < 0x80
              index += 8
              (byte7 << 49) | ((byte6 & 0x7F) << 42) | ((byte5 & 0x7F) << 35) |
                ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
                ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
            elsif (byte8 = buff.getbyte(index + 8)) < 0x80
              index += 9
              (byte8 << 56) | ((byte7 & 0x7F) << 49) | ((byte6 & 0x7F) << 42) |
                ((byte5 & 0x7F) << 35) | ((byte4 & 0x7F) << 28) |
                ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
                ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
            elsif (byte9 = buff.getbyte(index + 9)) < 0x80
              index += 10

              (byte9 << 63) | ((byte8 & 0x7F) << 56) | ((byte7 & 0x7F) << 49) |
                ((byte6 & 0x7F) << 42) | ((byte5 & 0x7F) << 35) |
                ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
                ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
            else
              raise "integer decoding error"
            end

          val = value
          while val != 0
            byte = val & 0x7F

            val >>= 7
            # This drops the top bits,
            # Otherwise, with a signed right shift,
            # we get infinity one bits at the top
            val &= (1 << 57) - 1

            byte |= 0x80 if val != 0
            unknown_bytes << byte
          end

          unknown_bytes << buff.byteslice(index, value)
          index += value
        when 5
          unknown_bytes << buff.byteslice(index, 4)
          index += 4
        else
          raise "unknown wire type #{wire_type}"
        end
        (@_unknown_fields ||= +"".b) << unknown_bytes
        return self if index >= len
        ## PULL_UINT64
        tag =
          if (byte0 = buff.getbyte(index)) < 0x80
            index += 1
            byte0
          elsif (byte1 = buff.getbyte(index + 1)) < 0x80
            index += 2
            (byte1 << 7) | (byte0 & 0x7F)
          elsif (byte2 = buff.getbyte(index + 2)) < 0x80
            index += 3
            (byte2 << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte3 = buff.getbyte(index + 3)) < 0x80
            index += 4
            (byte3 << 21) | ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) |
              (byte0 & 0x7F)
          elsif (byte4 = buff.getbyte(index + 4)) < 0x80
            index += 5
            (byte4 << 28) | ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
              ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte5 = buff.getbyte(index + 5)) < 0x80
            index += 6
            (byte5 << 35) | ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
              ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte6 = buff.getbyte(index + 6)) < 0x80
            index += 7
            (byte6 << 42) | ((byte5 & 0x7F) << 35) | ((byte4 & 0x7F) << 28) |
              ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
              ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte7 = buff.getbyte(index + 7)) < 0x80
            index += 8
            (byte7 << 49) | ((byte6 & 0x7F) << 42) | ((byte5 & 0x7F) << 35) |
              ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
              ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte8 = buff.getbyte(index + 8)) < 0x80
            index += 9
            (byte8 << 56) | ((byte7 & 0x7F) << 49) | ((byte6 & 0x7F) << 42) |
              ((byte5 & 0x7F) << 35) | ((byte4 & 0x7F) << 28) |
              ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
              ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte9 = buff.getbyte(index + 9)) < 0x80
            index += 10

            (byte9 << 63) | ((byte8 & 0x7F) << 56) | ((byte7 & 0x7F) << 49) |
              ((byte6 & 0x7F) << 42) | ((byte5 & 0x7F) << 35) |
              ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
              ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          else
            raise "integer decoding error"
          end

        ## END PULL_UINT64
      end
      found = false

      if tag == 0x8
        found = true
        ## PULL BOOLEAN
        @flag = (buff.getbyte(index) == 1)
        index += 1
        ## END PULL BOOLEAN

        return self if index >= len
        ## PULL_UINT64
        tag =
          if (byte0 = buff.getbyte(index)) < 0x80
            index += 1
            byte0
          elsif (byte1 = buff.getbyte(index + 1)) < 0x80
            index += 2
            (byte1 << 7) | (byte0 & 0x7F)
          elsif (byte2 = buff.getbyte(index + 2)) < 0x80
            index += 3
            (byte2 << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte3 = buff.getbyte(index + 3)) < 0x80
            index += 4
            (byte3 << 21) | ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) |
              (byte0 & 0x7F)
          elsif (byte4 = buff.getbyte(index + 4)) < 0x80
            index += 5
            (byte4 << 28) | ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
              ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte5 = buff.getbyte(index + 5)) < 0x80
            index += 6
            (byte5 << 35) | ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
              ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte6 = buff.getbyte(index + 6)) < 0x80
            index += 7
            (byte6 << 42) | ((byte5 & 0x7F) << 35) | ((byte4 & 0x7F) << 28) |
              ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
              ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte7 = buff.getbyte(index + 7)) < 0x80
            index += 8
            (byte7 << 49) | ((byte6 & 0x7F) << 42) | ((byte5 & 0x7F) << 35) |
              ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
              ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte8 = buff.getbyte(index + 8)) < 0x80
            index += 9
            (byte8 << 56) | ((byte7 & 0x7F) << 49) | ((byte6 & 0x7F) << 42) |
              ((byte5 & 0x7F) << 35) | ((byte4 & 0x7F) << 28) |
              ((byte3 & 0x7F) << 21) | ((byte2 & 0x7F) << 14) |
              ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          elsif (byte9 = buff.getbyte(index + 9)) < 0x80
            index += 10

            (byte9 << 63) | ((byte8 & 0x7F) << 56) | ((byte7 & 0x7F) << 49) |
              ((byte6 & 0x7F) << 42) | ((byte5 & 0x7F) << 35) |
              ((byte4 & 0x7F) << 28) | ((byte3 & 0x7F) << 21) |
              ((byte2 & 0x7F) << 14) | ((byte1 & 0x7F) << 7) | (byte0 & 0x7F)
          else
            raise "integer decoding error"
          end

        ## END PULL_UINT64
      end

      return self if index >= len
    end
  end
  def _encode(buff)
    val = @flag
    if val == true
      buff << 0x08

      buff << 1
    elsif val == false
      # Default value, encode nothing
    else
      raise "bool values should be true or false"
    end
    buff << @_unknown_fields if @_unknown_fields
    buff
  end

  def to_h
    result = {}
    result["flag".to_sym] = @flag
    result
  end
end

Copy link
Contributor

@rwstauner rwstauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize this is still a wip and maybe you've changed things already but this caught my eye

test/helper.rb Outdated Show resolved Hide resolved
Rakefile Outdated Show resolved Hide resolved
Rakefile Outdated Show resolved Hide resolved
@davebenvenuti davebenvenuti force-pushed the davebenvenuti/remove-custom-parser branch 6 times, most recently from 8d5926f to d55d102 Compare January 13, 2025 21:59
@davebenvenuti davebenvenuti force-pushed the davebenvenuti/remove-custom-parser branch from d55d102 to 67480e8 Compare January 13, 2025 22:00
@davebenvenuti davebenvenuti marked this pull request as ready for review January 13, 2025 22:23
Copy link
Contributor

@rwstauner rwstauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good.

README.md Show resolved Hide resolved
}
EOPROTO

gen = ProtoBoeuf::CodeGen.new(unit)
klass = Class.new { class_eval(gen.to_ruby) }

assert_nil(klass::Foo.new.null)
assert_equal(:NULL_VALUE, klass::Foo.new.null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like our parser was basically ignoring this unknown type 🤷

This symbol is consistent with how the other enums work.

@davebenvenuti davebenvenuti force-pushed the davebenvenuti/remove-custom-parser branch from 246d26b to b982b19 Compare January 14, 2025 17:17
@davebenvenuti davebenvenuti force-pushed the davebenvenuti/remove-custom-parser branch from b982b19 to d426705 Compare January 14, 2025 17:18
@@ -0,0 +1,104 @@
// Copyright 2024 Google LLC

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. Why is this a new file while other proto files were already in the repo? I wouldn't expect removing the parser to generate new types.

Copy link
Contributor Author

@davebenvenuti davebenvenuti Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our parser wasn't validating import statements. Now that our tests are parsing everything with protoc I had to import this new type to make the SigstoreTest pass because it's referenced in one of the protos:

import "google/api/field_behavior.proto";

I copied this over from here: https://github.com/googleapis/googleapis/blob/master/google/api/field_behavior.proto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few Stack Overflows/Google Groups posts about this and it seems cloning the googleapis repo or copying the individual file over is the solution: https://groups.google.com/g/grpc-io/c/iNvCzTF1QxQ

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still not convinced that we should be shipping this file as part of our library. As the Google Groups post mentions, the end-user is responsible for having this file in the "load path" when compiling proto files with protoc.

Just like how the protobuf library does not export the googleapis protos, I don't think we should either.

If the test/conformance process needs it, then we can clone the repo and/or download the file directly to have it included, but I don't think we should have it be a part of lib.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon reflection, I agree with @paracycle. We should probably figure out what is expected to be built-in and what isn't, and only include what is expected. I think that is only the "well known types", but I haven't done enough research to say for sure.

All rights reserved. They are subject to [this license agreement](https://github.com/protocolbuffers/protobuf/blob/32838e8c2ce88f1c040f5b68c9ac4941fa97fa09/LICENSE).

The `*.proto` files in `lib/protoboeuf/google/api` are from the [googleapis](https://github.com/googleapis/googleapis)
repository and are Copyright 2024 Google LLC. They are subject to the [Apache 2.0 license](https://github.com/Shopify/protoboeuf/blob/main/contrib/LICENSE.Apache2-0.txt).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link will be broken until this is merged

@davebenvenuti davebenvenuti merged commit e599e20 into main Jan 14, 2025
7 checks passed
@davebenvenuti davebenvenuti deleted the davebenvenuti/remove-custom-parser branch January 14, 2025 22:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants