Error Handling in gRPC

As mentioned in some previous chapters, error handling in gRPC is a bit of a problematic topic. gRPC itself does not describe or prescribe any consistent error-handling, and there is no global consensus as to how you should handle errors.

Statuses

The basic vehicle for dealing with gRPC errors are statuses. Statuses are often used to indicate errors with the service or the gRPC library itself, you can think of them as something similar to HTTP Response Statuses.

However, gRPC statuses are not included in the schema, you cannot create your own custom statuses and you cannot differentiate between system and business errors (meaning, whether the failure is due to library use issue, or wrong gRPC API usage), which may be a deal breaker for you.

It is possible to attach a string message to a status, which may be used to distinguish some errors:

#![allow(unused)]
fn main() {
use bytes::Bytes;
use tonic::{Status, Code};

let status = Status::with_details(
   Code::InvalidArgument,
   "ID does not exist",
   Bytes::new(),
);
}

Or there are also methods for specific codes directly:

#![allow(unused)]
fn main() {
use tonic::Status;

let status = Status::invalid_argument("ID does not exist");
}

As you can see, there is a third parameter in the first example. This parameter is called details and it can be used to provide additional binary data.

The usage of this field is a bit problematic, as it is not tracked by the schema. Therefore, you do not have the usual forward compatibility, backward compatibility, and consistent format benefit that you would otherwise have. Without negotiating elsewhere, the other side has no chance to know in what format and shape the details data is. Of course, you could ham-fist another protocol buffer in there, though.

The status you create is used in the return value of your services. For example, we can take a look at the tonic::server::UnaryService definition:

#![allow(unused)]
fn main() {
pub trait UnaryService<R> {
    type Response;
    type Future: Future<Output = Result<Response<Self::Response>, Status>>;

    fn call(&mut self, request: Request<R>) -> Self::Future;
}
}

or in more concrete terms:

#![allow(unused)]
fn main() {
use tonic::{transport::Server, Request, Response, Status};

use hello_world::greeter_server::{Greeter, GreeterServer};
use hello_world::{HelloReply, HelloRequest};

pub mod hello_world {
    tonic::include_proto!("helloworld");
}

#[derive(Default)]
pub struct MyGreeter {}

#[tonic::async_trait]
impl Greeter for MyGreeter {
    async fn say_hello(
        &self,
        request: Request<HelloRequest>,
    ) -> Result<Response<HelloReply>, Status> {
        println!("Got a request from {:?}", request.remote_addr());

        let reply = hello_world::HelloReply {
            message: format!("Hello {}!", request.into_inner().name),
        };
        Ok(Response::new(reply))
    }
}
}

These are not great, since it can be quite difficult to show a meaningful error to the user on the frontend of the application

Schema:

message UserInfo {}

message LoginUserRequest {
    string username = 1;
    string password = 2;
}

message LoginUserResponse {
    UserInfo userinfo = 1;
}

service AuthenticationService {
    rpc LoginUser(LoginUserRequest) returns (LoginUserResponse);
}

Google's rich error model

This is quite similar to the previous one, except that we now have a gRPC structure that's defined for our Status:

package google.rpc;

// The `Status` type defines a logical error model that is suitable for
// different programming environments, including REST APIs and RPC APIs.
message Status {
  // A simple error code that can be easily handled by the client. The
  // actual error code is defined by `google.rpc.Code`.
  int32 code = 1;

  // A developer-facing human-readable error message in English. It should
  // both explain the error and offer an actionable resolution to it.
  string message = 2;

  // Additional error information that the client code can use to handle
  // the error, such as retry info or a help link.
  repeated google.protobuf.Any details = 3;
}

These are still quite simple to use on the backend, and they do not interfere with 3rd parties - as this status is included in an HTTP header, which lets clients safely ignore it, if they do not support it. Adding a new error state also does not interfere with backward compatibility.

However, we still have issues with transparency. It is not visible in the API, what errors can it return, distinguishing between system and business errors is by convention and is against, not explicit.

Furthermore, due to the Any type, the details are not part of the schema with this model either. Also, Go utilities can sometimes only work with errors that contain error types predefined by Google, which severely limits the set of errors you can use.

Schema:

message UserInfo {}

message LoginUserRequest {
    string username = 1;
    string password = 2;
}

message LoginUserResponse {
    UserInfo userinfo = 1;
}

service AuthenticationService {
    rpc LoginUser(LoginUserRequest) returns (LoginUserResponse);
}

As you can see, there is still no change to the actual schema. Keep in mind that errors propagated this way are meant to be developer-facing and according to Google, must be in English.

Numeric errors (errno 2: The electric boogaloo)

Another option is to encode errors numerically into the response, letting the client decipher it and based on it, and optionally, additional data, present an error to the user. This is the first model here that is more geared to be user-facing.

A smart way that we can do this is by essentially replicating the Result type from Rust. Another benefit is that we do not need to deal with translating the error on the backend, as it will be handled by the client.

This error handling model can have errors returned as part of every response, or, once again, be put into an HTTP header, if we want to help the ergonomics of 3rd party clients.

However, if a parameter is part of the error state (for example minimum or maximum length of a password), then that information is implicit and duplicated between systems. Furthermore, adding new error states can be backwards incompatible.

In this model, we store the errors in the schema.

message UserInfo {}

message LoginUserRequest {
    string username = 1;
    string password = 2;
}

enum LoginError {
    UNKNOWN_ERROR = 0;
    USERNAME_IS_TOO_SHORT = 1;
    INVALID_PASSWORD = 2;
}

message LoginErrors {
    Repeated LoginError errors = 1;
}

message LoginUserResponse {
    oneof result {
           LoginErrors errors = 1;
           UserInfo userinfo = 2;
    };
}

service AuthenticationService {
    rpc LoginUser(LoginUserRequest) returns (LoginUserResponse);
}

Explicit error set in every interface

This approach adds a oneof error message that, with adequate types describes precisely what went wrong, with all the necessary data bundled with it. Think how you use Rust enums to describe errors in Rust with crates like thiserror.

We can also easily distinguish between system and business errors, as business errors are returned via the error message, whereas system errors returned by the gRPC library et al. are returned as a status (as they always are).

In this error model, we are providing the maximum amount of information to the frontend, which then has an easy time figuring out what and how went wrong to produce a user-facing error.

However, this significantly increases API complexity, and it might be more difficult to use by 3rd parties than the previous approaches. Adding an error state on the backend is backwards incompatible.

It is also quite likely, that you will have to compose the error types like in Java. For example, on the edge-server, a method may call several backend methods, and so it must be able to return all possible errors from multiple methods.

Of course, these errors are in the schema.

Schema:

message LoginUserRequest {
    string username = 1;
    string password = 2;


    CaptchaChallengeResponse captcha = 3;


    oneof secondfactor {
        string otp_token = 4;
        WebAuthnResponse webauthn = 5;
    }
}

message LoginUserError {
    oneof error {
        google.protobuf.Empty  captcha_validation_failed = 1;


        // validate_login_credentials errors
        google.protobuf.Empty  invalid_login_credentials = 2;
        core.ValueTooLongError password_too_long = 3;

        // validate_2fa errors
        google.protobuf.Empty  invalid_user = 4;
        google.protobuf.Empty  user_does_not_have_otp = 5;
        google.protobuf.Empty  invalid_otp_token = 6;
        google.protobuf.Empty  user_has_no_webauthn_challenge = 7;
        google.protobuf.Empty  user_has_no_fido = 8;
        google.protobuf.Empty  invalid_webauthn_credential_id = 9;
        google.protobuf.Empty  invalid_webauthn_client_data_json = 10;
        google.protobuf.Empty  invalid_webauthn_challenge = 11;
        google.protobuf.Empty  webauthn_auth_failed = 12;
    }
}

message LoginUserErrors {
    repeated LoginUserError errors = 1;
}

message LoginUserResponse {
    oneof result {
        LoginUserErrors               errors = 1;
        clients.profile_mgmt.UserInfo userinfo = 2;
    }
}

service EdgeServerService {
    rpc LoginUser(LoginUserRequest) returns (LoginUserResponse);
}

Another example:

message LoginUserRequest {
    string username = 1;
    string password = 2;
    CaptchaChallengeResponse captcha = 3;
    oneof secondfactor {
        string otp_token = 4;
        WebAuthnResponse webauthn = 5;
    }
}

// At least one of the booleans inside
// this message MUST be true
message LoginUserErrors {
    bool captcha_validation_failed = 1;
    // validate_login_credentials errors
    bool invalid_login_credentials = 2;
    bool password_is_too_long = 3;
    uint32 max_password_length = 4;

    // validate_2fa errors
    bool otp_token_is_invalid = 5;
    // The input WebAuthn challenge probably expired.
    bool webauthn_challenge_is_invalid = 6;
    // Validation of the WebAuthn response failed - possible reasons: ...
    bool webauthn_auth_failed = 7;
}

message LoginUserResponse {
    oneof result {
        LoginUserErrors errors = 1;
        clients.profile_mgmt.UserInfo userinfo = 2;
    }
}

service EdgeServerService {
    rpc login_user(LoginUserRequest) returns (LoginUserResponse);
}

Pavel's approach

Error handling philosophy

The primary reason for smart propagation of errors is to inform the user about his or her errors, therefore about errors, which are on the side of the user, that the user can do something about (for example, wrong format of entity name), or errors, which are needed to understand continued usage of the system (such as system temporarily unavailable). Detailed errors, which are not caused by the user, or the user cannot do anything about do not have to be communicated to the user.

The set of errors, which can the user cause should be finite over all forms / APIs and reasonably small. These errors should be reused across different API's (therefore, the number of different errors should grow asymptotically slower than the API. For example, the count of errors may be O(log N), where N is the the amount of API calls).

For instance, authentication and authorization issues are typically the same across all calls. Same goes for technical requirements such as the correctness of parameters, existence of entity, entities in correct states, and so on.

Often when calling an API, the only thing that is important for the caller is whether the given operation was executed according to expectations or not. Precise reasons for errors and what do with it can differ case to case.

It is better to keep things simple rather than complicated. Complexity has to be justified, optimally with a practical experience.

Description

In practice, in relation to the actual calls, this method is the same as the second one (Google's rich errors). Therefore, we have a standard extended type for error propagation, just like Google suggests. The difference is in usage and organization of calls.

An API call returns, in the case of an error, a status code with error code, which describes the key issue.

The error code is stored in the reason field of ErrorInfo. The code can also be transferred in an HTTP header, optionally. The error codes are not part of the API signature or otherwise strongly-typed.

The error codes are shared between client and server, wherein only errors that can be interpreted by the end user (ie. the frontend) must be shared and maintained. Other errors are always interpretable by the end user as "internal system error".

The meaning of an error code is global and shared among API calls.

For rich user interaction (ie. the understanding of errors) can a form / API / or its exact values an associated precondition checks. These precondition checks verify the validity and appropriateness of form values or the existence / state of entities in the system (such as password quality, existence of user and so on.), or other business requirements. These checks do not have to (and optimally shouldn't) be tied to specific API calls.

Precondition checks must always be read-only, they shouldn't modify the system, and they are reusable for different calls, so that they eventually form a sort of an analytical dictionary for verifying requirements. Furthermore, they should be usable not only before a call, but also after a call that causes an error, without a risk of race-conditions. The reasoning being that API ending by error typically have and should have the semantics of "no change", ie. we can think of it as atomic transactions. Best practice is however verifying ahead of time, if I want to display rich, explanatory errors and, even better, lead the user towards correct calls as opposed to reacting to caused errors.

The return errors of precondition checks do not have to be bools, but can have any explanatory character. They still have to be read-only, though.

The API stays atomic and all conditions are verified as part of its own API call. The API call must be completely independent on whether the user used a precondition check or not, and which ones he or she used. They should eliminate typical reasons for API failure, but everything should work even if they aren't used.

Benefits

There is no added complexity where no added complexity is necessary. Error propagation is trivial, same as the first variant. The validation is proactive before sending the form - this leads to earlier reporting and often eliminates the problem of reacting to an error.

It does not prevent using a complex error type with description for the cases, where this is justified. For example, complex verification logic for forms. It is straightforward for handling typical errors.

It eliminates propagation of changes in interfaces from the backend to frontend and vice versa. Adding an error type, which does not have a user interpretation does not require a change on the frontend. Adding an error with user interpretation is only a weak dependency. FE can theoretically run even without its knowledge.

Detriments

Well, there is multiple calls involved, which can be a detriment in and of itself, and may also lead in rare cases to race conditions. This is why it is key that the precondition checks must remain read-only, so that we prevent data races.

There is also no assurance that the user can process the errors we produce. It however doesn't cause non-functionality or incorrectness of the system, merely less accommodating communication for errors, which is easy to detect and fix.

Lastly, there is a necessity to share standard errors between backend and frontend, perhaps for a public documentation (only the subset of all of our errors).

API example in schema

syntax = "proto3";

import 'google/protobuf/empty.proto';
import 'google/protobuf/wrappers.proto';

// The enum type is not used anywhere as part of the interface annotation,
// but is available for producers and consumers for easy lookup / exhaustive checks.
enum E {
  E_UNKNOWN = 0;

  //
  // Authentication
  //
  // eg.: invalid session on auth.requiring API
  E_AUTH = 101;
  // Stateful and/or time sensitive methods that cannot be validated by themselves / multiple times
  E_AUTH_OTP = 102;
  E_AUTH_FIDO = 103;
  E_AUTH_CAPTCHA = 104;

  //
  // Values
  //
  // eg.: "invalid email format" when raised by "CheckEmail('foo@bar')"
  E_VAL_FORMAT = 201;
  E_VAL_RANGE = 203;

  //
  // Entities
  //
  // eg.: "username is already taken" when raised by "CheckUsername('foo')"
  E_ENT_EXISTS = 301;
  // eg.: "target account doesn't exist" when raised by "MergeAccountWith('JohnDoe')"
  E_ENT_NOT_FOUND = 302; 
}

message InitAuthMethods {
  // submit method requires password response in a HTTP header "x-ii-password"
  bool requires_password = 1;
  // submit method requires captcha response in a HTTP header "x-ii-captcha"
  bool requires_captcha = 2;
  // submit method requires fido response in a HTTP header "x-ii-fido"
  bool requires_fido = 3;
  // submit method requires otp response in a HTTP header "x-ii-otp"
  bool requires_otp = 4;
};

// Login
message UserLoginRequest {
  string username = 1;
  string password = 2;
}
message UserLoginResponse {
  string token = 1;
}

// Signup
message UserSignupRequest {
  string username = 1;
  string password = 2;
  string email = 3;
};
message UserSignupResponse {};

// IntegrityCheckedAction
message IntegrityCheckedActionInitResponse {
  InitAuthMethods auth = 1;
};
message IntegrityCheckedActionRequest {
  message SplitItem {
    float value = 1;
    string label = 2;
  }

  // Sum of item's "value" attributes has to be 100!
  repeated SplitItem items = 1;
};
message IntegrityCheckedActionResponse {
  message Errors {
    bool has_sum_100 = 1;
    bool has_unique_labels = 2;
    bool odd_values_are_even = 3;
  };

  oneof result {
    google.protobuf.Empty ok = 1;
    Errors err = 2;
  }
};

service ExampleService {
  // Condition checks
  rpc CheckUsernameIsWellFormed(google.protobuf.StringValue) returns (google.protobuf.BoolValue);
  rpc CheckUsernameAvailability(google.protobuf.StringValue) returns (google.protobuf.BoolValue);
  rpc CheckPasswordIsWellFormed(google.protobuf.StringValue) returns (google.protobuf.BoolValue);
  rpc CheckEmailIsWellFormed(google.protobuf.StringValue) returns (google.protobuf.BoolValue);

  // "Submit" methods that include the "Condition" methods above as part of their inner logic
  // and bounces back an RPC error encoded as "google.rpc.ErrorInfo" message
  //
  // The expectation is that:
  //  - Clients won't even encounter this if it pre-checks everything.
  //  - Re-checks its pre-conditions again should it encounter some error.
  rpc UserLogin(UserLoginRequest) returns (UserLoginResponse);
  rpc UserSignup(UserSignupRequest) returns (UserSignupResponse);

  // Speculative example of interface whose errors cannot be widened
  // to a set of generic errors and has its error state communicated
  // as a strongly typed return value.
  //
  // It can, however, raise authentication related errors
  // as part of the authentication middleware logic.
  rpc IntegrityCheckedActionInit(google.protobuf.Empty) returns (IntegrityCheckedActionInitResponse);
  rpc IntegrityCheckedAction(IntegrityCheckedActionRequest) returns (IntegrityCheckedActionResponse);
}