Scalable routing design principles

Scalable routing design principles

Scalabe design is achieved by introducing a broker class as follows:

ERM  Event Routing mechanism

ERM Design Principles:

Events are essentially RESTful API calls. The ERM design principles are listed below:

– All Events contain a RESTful request URI containing the context (REST event resource or event name) and optional query parameters.

– The event will contain mandatory headers which are prefixed by the custom “X-” directive. The list of custom headers and their descriptions are given below.

– Each event will contain an optional payload encoded in JSON format.

– All events will be designed using asynchronous semantics, to avoid blocking behavior.

– Events could be subscribed or published. The event router will process incoming events which are published and shall deliver them to all subscribers.

– The event router would be configured to deliver events either in broadcast mode (parallel fork) or in sequential fork mode.

– Event delivery shall be reliable in nature, and all failure scenarios shall be taken care of either by the client which published the event, or by the ERM.

– There should be a mechanism to retry failed messages for event delivery.

Event Message Types –

The event messages will be of the following eight (8) types:

A) API Event Requests:

Event Type – 1

These will be RESTful POST/PUT/GET/DELETE requests with a discrete event name encoded in the servlet context of the HTTP request.

The event type header would denote that the REST/HTTP packet is an event request.

Each event request will also encode a flow-identifier, branch-identifier and a hop-count among other mandatory headers.

B) API Event Syntax Error Responses:

Event Type – 2

Any event request which does not contain the mandatory headers mentioned above, would be deemed to be syntactically incorrect.

Such requests shall be rejected by the event router with a 400 BAD REQUEST packet consisting of JSON metadata which denotes which mandatory parameter(s) are missing.

C) API Event Delivery Acceptance Responses:

Event Type – 3

When an incoming HTTP event Request is successfully received by a Microservice, it has to implement async semantics and confirm the receipt of the HTTP packet.

This needs to be done by sending a 202 Accepted response back to the sender.

This response is exchanged hop by hop, and ensures that the API event request is reliably routed from the publisher to the ERM and from the ERMto all the subscribers.

D) API Event Delivery Error Responses:

Event Type – 4

In case a subscribing Microservice is not able to process the incoming API event request due to an internal error, it will send back a 500 INTERNAL SERVER ERROR response.

Internal errors could be triggered due to overload conditions or software bugs in the Microservice (checked exception handling).

E) API Event Delivery Timeout Responses:

Event Type – 5

When the ERM delivers an event to a subscribing microservice’s ELB, and there is no response from it (202/400/500 etc), then the HTTP event delivery times out after a configurable timer interval.

Once this guard timer fires, an implicit timeout (408) is assumed by the event router.

This event is generated internally by the ERM, and not sent on the HTTP protocol.

The 408 status for a particular microservice subscriber is included in the ERM Delivery Status Report which is explained below.

F) ERM Delivery Status Report:

Event Type – 6

The ERM maintains a graph data structure which consists of the event name, and its subscribers.

The event is delivered to all the subscribers, and as explained in the previous sections – the subscribing Microservices may either send a 202 (Acceptance), 500 (Processing Error), or 408 (Timeout Error) responses.

To illustrate the concept of an ERM delivery report, consider the following scenario –

S1, S2 and S3 are three subscribers to a given event. In case an event is delivered successfully to all subscribers (202 received from all of them), the ERM initiates a HTTP POST request with a delivery report payload to the publisher of the event.

The mandatory headers such as flow-id, branch-id etc shall be the same as were received in the originally published event. This will allow the ELB to send the deliver report to the correct Microservice instance which published the event.

The payload of a delivery report shall contain the status of event delivery and the success/failure code.

The example below illustrates a delivery status report, where the event was successfully accepted by S1 and S2, but rejected by S3:

“delivery_status_report”
{
“S1”: “202” ,
“S2” : “202” ,
“S3”: “500”
}

When the events are delivered as a broadcast (explained later in the document) or in a sequential manner – the ERM delivery status report semantics remain the same as explained below –

The ERM application reads from the graph the target Microservices where the event has to be delivered.

This data is stored in a Java POJO, which is part of a map, and inserted as part of a timeout smart cache

This POJO has the following fields:

Flow-id, event name and other mandatory header fields with values.

Please note that hop-id and branch-id are not persisted, as they are not relevant to delivery status reports.

EventDeliveryStatusMap (ConcHashMap): Map Key is the Microservice name where the event was to be delivered, and value is an Integer type.

Value 1: API Event Delivery Acceptance Received (202)

Value 2: API Event Delivery Error Received (500)

Value 3: API Event Delivery Timeout (408)

Value 4: API Event Syntax Error (400)

This POJO is in turn stored in a global concurrent HashMap which has the flow-id as the key.

This Outer HashMap is then stored in the timed smart cache.

This is required so that the POJO can be retrieved and updated on receipt of 202 / 500 / 400 / Internal Timeout cases based on the flow-id and then the Microservice name.

If the smart cache entry expires, and there are still branches for which a 202/500/400 etc events are not received on HTTP – then they are marked as timed out implicitly without waiting for the actual HTTP socket timeout.

Hence, the smart cache TTL and the HTTP socket Timeouts have to be defined in a way that Smart Cache TTL is always greater than the socket Timeout.

For failed message deliveries, a retry mechanism is explained later.

G) API Event Processing ACK Response:

Event Type – 7

When the subscribing Microservice receives an event, it has to execute some business logic on the event.

This business logic in some cases may entail an outcome.

H) API Event Processing NACK Response:

Event Type – 8

If any service  determine that the incoming event request was “logically” incorrect – then it can also generate a NACK event with metadata detailing out the reason for the NACK.

Event Subscriptions/Unsubscriptions –

The following semantics will be implemented by the event router for subscription management –

A) Pre-Provisioned Bulk Subscriptions
B) API driven Subscriptions
C) Batch Mode Unsubscriptions
D) API driven Unsubscriptions

These are explained below:

Subscriptions can be pre-provisioned in the configuration file of the ERM. Which Microservice to subscribe for an event would depend upon the business logic of that service.
These are known as pre-provisioned subscriptions. It is possible to provision additional events and subscribe them to a microservice manually through CLI/GUI.

API driven subscriptions are initiated by the Microservice in question. This is done by initiating a HTTP POST request towards the ERM.

Please note that the subscription API call is not an event. It is an out of band management API call.

If the subscription is successful, the ERM sends back a 200 OK response. If the subscription is not successful, the ERM sends back a 500 Internal Server Error response.

In such cases, the Microservice can re-try the subscription for a configurable number of times.

On similar lines, it is also possible to un-subscribe to events. This may take place in batch mode, through the CLI or GUI -or through API calls.

The HTTP DELETE request should be sent my the Microservice, if it wishes to un-subscribe to an event in an API driven manner.

Publishing Events –

Publishing an event is a straightforward process.

The event is sent through the HTTP POST/GET/PUT/DELETE methods depending upon the nature of an event.

All events are sent to the ERM, and each event needs to have mandatory headers.

JSON payload is permitted in an event request. The metadata definitions are captured for all events in the Event API guide.

Event Routing Semantics –

It is important to understand the core functionality of the event router and the routing semantics.

The ERM maintains a graph model which consists of the following node types:

1. Event Name Node
2. Subscriber Microservice Node

The event name node is always the first node of the graph (at the head of the graph). Hence, in this document it will be called the “hub node”.

The subscriber Microservice nodes can be many – depending upon how many Microservices have subscribed to a given event.

The relationship edges of the graph between the event name node and the subscriber microservice nodes can be of two types –

– Spoke Edge, which connects a subscriber Microservice node to the hub directly

– Sequence Edge, which connects subscriber microservice nodes in sequence to the hub node.

There are four use cases for event routing –

A) Broadcast (Parallel Fork)

It is often required to deliver events to subscriber Microservices in a broadcast fashion.

This is also known as a parallel fork.

Parallel fork delivery is provisioned in the event router as follows:

The event name is the hub node.

All the subscriber microservices are connected to the hub node with a “spoke” edge.

B) Sequential Delivery

There may be a requirement for delivering an event to its subscribers in a pre-defined sequence.

In such cases, unless the first Microservice in the sequence sends back an “API Event Delivery Acceptance Response” (Event Type -3) to the ERM, the event is not delivered to the next microservice in the sequence.

Sequence delivery is provisioned as follows:

The event name is the hub node.

The subscriber microservices are connected in a pre-defined sequence to the hub node.

All node edge relationships from the hub node towards the Microservice node, and between the subscriber Microservice nodes themselves is of type “sequence”.

C) Hybrid Delivery

It may be required for an event to be delivered in sequence to some subscriber Microservices, and in parallel fork mode to other Microservices.

Even though this possibility is rare, the ERM design will cater to this requirement as follows:

The event name will be the hub node.

There will be multiple “spoke” nodes of subscriber Microservices for parallel fork delivery, and “sequence” nodes for sequential delivery.

Hence, this graph instance will be hybrid in nature.

D) Event Delivery Retry

It is possible that certain events are not delivered to the subscriber Microservices irrespective of whether the event was delivered as a parallel fork or sequentially.

For such cases, the following special procedures are applicable –

The publishing Microservice would receive the ERM Delivery Status Report.

This status report would inform the publisher whether the event was delivered successfully/or not delivered to the target subscriber Microservices.

Based on the business logic of the publisher, and the call flows, a retry may or may not be needed for the failed event delivery.

In case no retry is needed, the ERM Delivery Status Report is acknowledged by a 200 OK and no further messages flow.

In case a retry is required, the publisher sends a special RETRY HTTP POST request with a context set to /retry towards the ERM.

For the /retry messages, the ERM does not consult the graph structure and attempts immediate delivery to the target Microservice.

This is a synchronous transaction, and a 200 OK / 500 / 400 /408 etc is sent back to the publisher based on whether the retry succeeded or failed.

To avoid an infinite chain of retries, the publisher microservice can retry a failed event at most 5 times, after which permanent failure should be assumed and an alarm should be raised.