Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 382/ support X-Robots-Tag as a typed http header XRobotsTag #393

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

hafihaf123
Copy link
Contributor

@hafihaf123 hafihaf123 commented Jan 12, 2025

Initial Implementation of X-Robots-Tag Header

This pull request introduces an initial implementation of the X-Robots-Tag header for the rama-http crate and closes #382

Summary of Changes:

  1. Header Registration:

    • Added the x-robots-tag header to the static headers list in rama-http-types/src/lib.rs.
  2. Implementation of the Header:

    • Created a XRobotsTag struct that represents the header and implements the rama_http::headers::Header trait.
    • Defined a Rule enum to represent indexing rules such as noindex, nofollow, and max-snippet. Complex rules like max-image-preview and unavailable_after are also included.
    • Designed an Element struct to represent a rule optionally associated with a bot name.
    • Added an iterator for XRobotsTag to iterate over its elements.
  3. File Structure:

    • The implementation resides in a new module at rama-http/src/headers/x_robots_tag/, which includes the following files:
      • rule.rs: Defines the Rule enum and parsing logic.
      • element.rs: Implements the Element struct and its parsing/formatting logic.
      • iterator.rs: Provides an iterator for XRobotsTag.
      • mod.rs: Combines and exposes the module’s functionality.
  4. Encoding and Decoding:

    • Encoding and decoding are implemented using the Header trait, supporting CSV-style comma-separated values.

Questions and Feedback Requested:

  1. Code Structure:

    • Is the location of the new module (x_robots_tag) appropriate?
    • Does the filename structure and organization align with project conventions?
  2. Implementation Design:

    • Are there any improvements or alternative approaches you’d suggest for the Rule and Element structs?
    • Is the use of Vec<Element> for XRobotsTag suitable, or are there optimizations to consider?
  3. Edge Cases and Standards:

    • Are there additional rules, formats, or edge cases that should be covered?

Testing:

  • For now, the module is not tested, I will add the tests after the code structure becomes stable.

I look forward to your feedback and suggestions for improvement. Thank you!

@hafihaf123 hafihaf123 force-pushed the issue-382/x-robots-tag branch from 1a46b37 to f3c6d97 Compare January 12, 2025 22:01
Copy link
Member

@GlenDC GlenDC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a bad start. Do take your time to complete it and add sufficient start. But great start indeed. Added a couple of remarks to help you with some more guidance.

rama-http-types/src/lib.rs Outdated Show resolved Hide resolved
rama-http/src/headers/x_robots_tag/element.rs Outdated Show resolved Hide resolved
rama-http/src/headers/x_robots_tag/element.rs Outdated Show resolved Hide resolved
rama-http/src/headers/x_robots_tag/element.rs Outdated Show resolved Hide resolved
rama-http/src/headers/x_robots_tag/rule.rs Outdated Show resolved Hide resolved
rama-http/src/headers/x_robots_tag/rule.rs Outdated Show resolved Hide resolved
rama-http/src/headers/x_robots_tag/rule.rs Outdated Show resolved Hide resolved
@hafihaf123 hafihaf123 force-pushed the issue-382/x-robots-tag branch from 0c8c3b3 to e66d95b Compare January 17, 2025 22:33
@hafihaf123
Copy link
Contributor Author

I've attempted to implement your suggestions, but I still have a couple of uncertainties:

  1. I've used the OpaqueError type almost everywhere an Error type is required. Is this acceptable, or should I consider a different approach?
  2. I've implemented the FromStr trait for Element, expecting a string slice in the format: "<bot_name:> indexing_rule, indexing_rule_with_value: value <...>". However, I'm unsure if this is the correct interpretation or the best approach.

Please note that this implementation is not yet complete, as the XRobotsTag::decode() function is not functional at this stage.

@hafihaf123 hafihaf123 requested a review from GlenDC January 17, 2025 22:57
@GlenDC
Copy link
Member

GlenDC commented Jan 19, 2025

1. I've used the `OpaqueError` type almost everywhere an `Error` type is required. Is this acceptable, or should I consider a different approach?

Questions like this are answered by thinking about its typical usage. Usually the parsing from the raw http header to the typed header will happen in the background as part of layer or web extractor or the like. So while you could try to return something like a custom error with exposure of the kind of error that happened I don't see it all that being useful here, even if you manually decode. Because even if you know what specific error happened, the result is either way the same, it's an invalid header so either you are okay with it or you are not.

TLDR: opaque Error is fine here. If there is ever a strong reason to expose certain error variants, it can be requested by opening a new issue about it. But that's a discussion for then.

2. I've implemented the `FromStr` trait for `Element`, expecting a string slice in the format: `"<bot_name:> indexing_rule, indexing_rule_with_value: value <...>"`. However, I'm unsure if this is the correct interpretation or the best approach.

The answer is that you'll want to ensure you support this: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Robots-Tag#syntax ; So purely looking at whether it's a valid implementation you would do by checking against that reference syntax. After that the question will be whether it's an optimal enough implementation, but let's first focussing on correctly implementing this.

@GlenDC
Copy link
Member

GlenDC commented Jan 20, 2025

You might want to read up a bit on how other folks have done it, even in other languages. E.g. look at https://github.com/nielsole/xrobotstag/blob/4cd7d8885a3e26fda9dd1a4075663cd3b80617f0/parser.go#L97. There's a lot to like about that implementation, but of course not to be copied as-is, given there is also plenty not to like about it. But what I personally take away from an implementation like this is:

  • I like the idea of just having a single RobotTag, which would look something like:
struct RobotsTag {
     bot_name: Option<HeaderValueString>,
     all: bool,
     no_follow: bool,
     unavailable_after: Option<chrono::DateTime<chrono::Utc>>, 
     // ...
     no_ai: bool,
     // ...
     custom_rules: Option<Vec<CustomRule>>,
}

CustomRule would be a private construct which can be as simple as

struct CustomRule {
      key: HeaderValueString,
      value: Option<HeaderValueString>,
}
  • you are already doing pretty well on parsing, but really there is no need for regexes here, at least not us directly
  • you can make use of the chrono crate and try the formats that are specified in the documentation, there are three possibilities: RFC 822, RFC 850, or ISO 8601. Using chrono makes that trivial.

As such the only things we really need to expose here are the XRobotsTag typed header and the RobotsTag as a single "element". So 2 structs.

Not sure if you are a fan of the builder pattern but I like it a lot for cases like this, as for each rule you can expose this method to RobotsTag

fn new() == Default::default

The default one would be basically an empty tag, not rendered at all. Which is okay as the Headers trait allows you to not encode anything in the encode step, which you would do in case all your tags would be empty.

fn all(&self) -> bool { self.all }
fn with_all(mut self, all: bool) -> Self { self.all = all; self }
fn set_all(&mut self, all: bool) ->  &mut Self { self.all = all; self }

This allows them to be constructed in any way. Feel free to define a private macro_rules! to easily set create this triplet for the bool variants, saves you from typing, and easy enough to do if you combine it with the paste crate already used in other parts of rama.

The custom rules can be constructed without exposing the CustomRule struct. And to get them you can just return an opaque iterator with a tuple &str or w/e.

I think this is pretty elegant and does everything you wanted to do without even having to expose our internal representation of it. Keeps the structure also a lot more flat without really losing much, given the RobotsTag should be able to be represented pretty small. It's in the end just a bunch of booleans and options, with the value parts of the option being on the heap, so pretty great for this purpose.


TLDR: great job on your work so far. Seems you are getting pretty comfortable with the codebase and defintely know your way around rust. I hope my feedback and guidance above helps you out in finishing it. Feel free to push back on something you do not agree with or propose alternatives.

Goal is mostly to keep this thing as simple and minimal as possible, while still allowing people to also do custom stuff, but without exposing too much of our internal API as to give the flexibility to change stuff internally without breaking stuff. Not that I mind making breaking changes where needed, but even better if you can upgrade without needing to break.

@hafihaf123
Copy link
Contributor Author

This allows them to be constructed in any way. Feel free to define a private macro_rules! to easily set create this triplet for the bool variants, saves you from typing, and easy enough to do if you combine it with the paste crate already used in other parts of rama.

I was wondering to maybe add some sort of #[derive(Builder)] functionality to the builder to generate the boilerplate code but I am wondering whether it is even possible as I am quite new to rust macros. From what I've found, I would probably have needed to define it as a procedural macro and kept it in a separate crate. I found rama-macros in the code base, but in its documentation, it states

There are no more macros for Rama.

@GlenDC
Copy link
Member

GlenDC commented Jan 20, 2025

I was wondering to maybe add some sort of #[derive(Builder)] functionality to the builder to generate the boilerplate code but

Out of scope for this PR and overkill for what you need here. A macro_rules has plenty to fill our needs here

If one day you want to learn to write and read proc macros you might enjoy https://www.manning.com/books/write-powerful-rust-macros

Could use some help in my venndb crate for a future release

@hafihaf123
Copy link
Contributor Author

another thing, do you prefer a separate builder struct (example: let robots_tag = RobotsTag::builder().noindex().max_video_preview(5).build()) or building using setters (example: let robots_tag = RobotsTag::new(); robots_tag.set_noindex(true).set_max_video_preview(5) or let robots_tag = RobotsTag::new().with_noindex(true).with_max_video_preview(5))

@GlenDC
Copy link
Member

GlenDC commented Jan 21, 2025

another thing, do you prefer a separate builder struct (example: let robots_tag = RobotsTag::builder().noindex().max_video_preview(5).build()) or building using setters (example: let robots_tag = RobotsTag::new(); robots_tag.set_noindex(true).set_max_video_preview(5) or let robots_tag = RobotsTag::new().with_noindex(true).with_max_video_preview(5))

The reason to introduce a separate builder struct, e.g. RobotsTagBuilder is because it allows you to when done right to at compile time make an impossible state. For example it would allow you if done right to make it that this does not compile:

RobotsTag::builder().build()

As at that point there is not yet anything set. In my proposal above I said as a shortcut you can just ignore empty robotstag and not write those. But if you do want to go this extra mile we can indeed prevent it at compile time, which is better so I would argue.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=82444bb713185c3f09ceeb917cc4fd66 here is an exmaple playground for you to show what i mean with that

Reason why you would need a builder for that is because:

  1. the most important reason: you avoid that you are polluting your otherwise generic-free struct with generics
  2. less important is that it allows you to get around the name colission trick, as we otherwise have 3 methods a for the same struct, which is not allowed in rust

I'm totally okay with you building that out, good proposal of you, so why not.
Even if you do that I would still add setters to the regular struct as well as it allows, e.g. in a proxy setting, to still modify the struct itself. Another option is to be able to turn it back into a builder e.g. impl From and perhaps also add into_builder. That works fine as well.

Btw note also in my example that I add both the "consume" variant (mut self) and the setter variant (&mut self) to the builder. Please do this as well if you make a builder. Because in some scenarios you want to consume while in another one it is not as easy.

in cases where you want to pass it immediately to a function or parent structure it is easier to just consume:

foo(Foo::builder().a(true).b(42).build())

but in cases where you work with conditionals it's more conveniant to have the setter also available, e.g.:

let mut foo = Foo::builder().a(true);
if bar {
    foo.set_b(true);
}

Of course in the case of a bool this seems a bit silly, as you could just assign bar to set_b. But for cases where the property type is a bit more complex, e.g. an enum it becomes less easy so. As such, I usually provide both options.

@GlenDC
Copy link
Member

GlenDC commented Jan 26, 2025

Hope you are doing @hafihaf123 , just checking in here with you, no pressure, take your time, there's no hurry. However in case you are stuck, have some questions or want to talk something through do let me know :) I'm here for you. But also fine if you are ok as-is and continue at your own pace on your own. All good, just want you to know this.

@hafihaf123
Copy link
Contributor Author

Sorry for the delay, lately I didn't have much time to code. Anyways, I have completely reworked the API based on your suggestions. Please let me know where it needs some improvements and what should I focus on next.

@hafihaf123 hafihaf123 requested a review from GlenDC January 27, 2025 17:48
rama-http/Cargo.toml Outdated Show resolved Hide resolved
Copy link
Member

@GlenDC GlenDC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx starts to look close to being closed. Besides my remarks on the code, it seems we are still missing the actual implementation of https://docs.rs/rama/latest/rama/http/headers/trait.Header.html, right?

@hafihaf123 hafihaf123 requested a review from GlenDC January 29, 2025 18:06
@hafihaf123
Copy link
Contributor Author

hafihaf123 commented Jan 29, 2025

Hello,

I would appreciate a review before proceeding with the implementation of the actual encoding and decoding functionalities. Additionally, I would welcome any guidance on the best approach to implement these.

My current idea is to follow a structure similar to the Via header, implementing the FromStr trait along with a custom CSV parsing function to handle bot names effectively. I would love to hear your thoughts on this approach.

Thank you!

Copy link
Member

@GlenDC GlenDC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate a review before proceeding with the implementation of the actual encoding and decoding functionalities.

Looks pretty good, a couple of small remarks but overall pretty good.

Additionally, I would welcome any guidance on the best approach to implement these.

My current idea is to follow a structure similar to the Via header, implementing the FromStr trait along with a custom CSV parsing function to handle bot names effectively. I would love to hear your thoughts on this approach.

Sounds good, my 2 cents if you wish is as follows:

Mind though that you do want to add some sanity check there to ensure that you only allow starting a new RobotsTag if the previous tag had at least one value set (e.g. cheapest would be by keeping track of a counter or bool or w/e.

Other than that sounds like a sane idea.

Encoding is a lot easier so I guess you do understand that part.

Cargo.toml Outdated Show resolved Hide resolved
@hafihaf123 hafihaf123 force-pushed the issue-382/x-robots-tag branch from f10e6df to b887598 Compare February 1, 2025 22:19
@hafihaf123
Copy link
Contributor Author

I have implemented the initial versions of the encode and decode functionalities. For parsing, I chose an iterator-based approach, where each iteration yields a RobotsTag representing a valid portion of the header for a single bot name. The iterator continues processing until the entire string has been parsed.

The implementation is still somewhat rough and may benefit from refinements in structure, error handling, or efficiency. I would appreciate your feedback on areas where the approach could be improved.

@hafihaf123 hafihaf123 requested a review from GlenDC February 1, 2025 22:29
@hafihaf123 hafihaf123 force-pushed the issue-382/x-robots-tag branch from 54c0675 to 83b118a Compare February 5, 2025 21:27
use headers::Error;
use rama_core::error::OpaqueError;

macro_rules! robots_tag_builder_field {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering whether it would make more sense to add a custom rule for bool fields, as it would make more sense to just be able to set them to true without having to specify (maybe then also add an toggle_off function or something like that). Also, it would prevent building an invalid RobotsTag by setting a field to false (RobotsTag::builder().bot_name(None).no_snippet(false).build()). Please let me know what you think about that.

P. S.
It would still be possible to do something like RobotsTag::builder().bot_name(None).no_follow().toggle_off_no_follow() to achieve the same thing, but I feel like it is more explicit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now, to keep it simple:

  • no_follow() instead of no_follow(true)
  • 🚫 let's not add the toggle_off versions

Also I would perhaps while you're at it not allow bot_name(None) but only use it with a value if you want to set it with bot name. Such that the usage would be

RobotsTag::builder().bot_name(MyBot).no_follow()

if you want to specify it for a bot or

RobotsTag::builder().no_follow()

to specify without a bot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the setters on a constructed RobotsTag? Should they also be changed for bool values or kept as-is?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the setters should be gone from RobotsTag, gonna be easier if that's just read-only

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builder was relying on the setters to modify the RobotsTag structure inside it, how do I change that? Do I keep the setters but restrict their visibility or do I modify the visibility of the fields or how else could it be done?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can pub(super) the properties of RobotsTag

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support X-Robots-Tag as a typed http header XRobotsTag
2 participants