Kabootar — Swiggy’s Communication Platform
Swiggy has a wide range of end-users. Be it our customers, delivery executives, restaurant vendors or customer care executives, it is very important to have a very good communication link to all of them. As a product, we will want to communicate with each of them from time to time using various channels like push notification, SMS and Email.
Hence, there has always been a requirement for building a communication platform. Traditionally, a communication engine was a decentralized system in Swiggy, with each team having their own logic embedded in their microservices. This worked well when the teams were small and the use-cases were far apart and deterministic, but as we continue to scale rapidly, there arose demand for a centralized platform, which can cater to the needs of all the teams.
As a platform, we wanted a centralized place to send both our transactional and marketing communications, but with different priorities. A support for fallback and retry was must. An user should be able to define templates with render-able context in the system and also define when the communication will go. Templates should become part of a wrapping construct called campaign so that we can send multiple communication as part of the same context. Campaigns will also help in tracking success of a communication in terms of Open rate, CTAs etc. and in handling certain specific use cases like filtering or enriching. Any communication to be sent out will be embedded in a context object called ‘MESSAGE’. This object will be a self sufficient entity, which will have all the information required to send the communication. Embedding everything in a single object with no outside state, helps in building a distributed system, because any node can process the message. Message object passes through multiple steps(micro-services) in our system. Event system receives an event and generate context around it. Campaign system builds individual message object and applies context to it. From here the message goes to templating service, where template gets rendered and tracking information is embedded into it. Router server picks it up next and sends out the communication through the right channel. Message itself has a state machine which is continuously persisted in our storage. Hence if any message is in a non-final state it can be picked up again and can be given to the right processor. Most of the domains in our system be it an event, message, campaign or tracker, has a state machine. It becomes very easy to process a domain which can be represented in terms of state.
This system was built as a business agnostic communication engine, which will help users to use a rendering engine to build templates and send communications to users via multiple channels. The system will help track open rates, CTAs (click through actions) and conversions, but will not be aware about other services. This seemed like a good idea, as the communication engine will become an isolated system with no dependency on any other system. This system was designed to be a push based system, where other services will send an event to the platform, which will then send it ahead. The whole process will be asynchronous with users having an option to either provide a webhook or a queue (topic) if they want to know when communication was routed if there is a high requirement for consistency.
Tech Stack
As Swiggy is a Java shop, we chose our development language to be Java 8 with a focus on functional capabilities of it. We chose Spring-boot as our micro-service architecture. More specifically an abstraction we wrote over Spring boot (swiggy-commons) as it gives us stronger capabilities for caching, auditing, pipeline building and in general functional programming. We didn’t want to have a hard dependency on any Transport protocol and hence, kept it very flexible. As the system was designed to be built out of bunch of microservices, there was a requirement for a way to stick them together. One way, was to do it was over HTTP or RPCs. The issue with this is that these are push based systems and a very high throughput can potentially bring down the entire system. We wanted it to be pull based and hence, went with a Queue based system. Mysql is used a transactional store and ElasticSearch as a nosql for aggregation, reporting and adhoc queries.
Core Architecture
As we can see from the diagram each of the microservice was divided into 3 parts.Integration layer receives messages from the outside world to process. This layer supports both Push based(HTTP) and Pull based(Queues) integration. Once messages are received and deserialized in the registered format, it is sent to be processed by the core logic system. After messages are processed and transformed it is sent to the consumer layer which sends it out. The benefit with this kind of design is that our core logic layer is not dependent on the integration layers .We can compose our logic layer with multiple integration layer systems. For example we can receive a message over RabbitMq but send data in kafka. This gives us very strong integration capability. In addition to that we can also perform filtering before messages hits the core filtering logic and after logic has been applied. We call this construct a pipeline which has been modelled in similar lines to lot of EIP(Enterprise Integration Patterns)
System Design
Interactions in our system happen through a RabbitMq backbone. Our communication platform is built out of 8 microservices with each of them performing a very specific task. Any communication to be sent out has few steps to be followed beforehand. A campaign has to be created for the communication and the templates to be sent, have to be attached to it. We provide a user interface for doing these operations. While defining the campaign, the user can tell whether this campaign will be event based or schedule based. Event based communication triggers when a certain event is fired in the system and is received by the communication platform. User information has to be part of this event. In case of schedule based communication, the user just tells when he/she wants the communication to go out. This schedule can be a one time trigger or a cron. In addition to that, the user provides a way to retrieve user information. An external aggregation system(Grand Maester) has been built for this very purpose. As do not want different teams to collect user information about users that they don’t own, we built a central aggregation service. All teams can come and register their APIs here, which can be used to retrieve user info dynamically. This system will also help us filter, slice and dice around users without each team independently implementing them.
Once campaigns are created they get triggered based on their behavior. Event microservice is responsible to receive events and classify them into different categories and enrich them with user information and any other information required. Once all information is embedded, it sends out messages to campaign service, which will apply campaign specific information to each of the messages and then forwards it to templating system.
The templating system would just render all of the templates along with tracking information and send it ahead to the router. Router determines the type of communication and pushes it to the right gateway.
Fallback and Retrial
Fallback and retrial for a message is important for high reliability and consistency. A message may fail due to multiple reasons. It maybe external router fails due to some unknown reason or maybe rendering fails because of insufficient information. There are two ways to handle retry and fallback — External and Internal to the message object. In case of external handling, message object does not contain fallback templates or retry counts. If a message fails, we go to our persistent store and build the state for the message and find out the next step. The problem with this approach is that it can be slow and complicated and our database needs to scale for adhoc queries on append only tables. We decided to go with internal handling. Let’s take an example. Suppose we want to send a communication with template T1, retry twice if it fails and then go to fallback template T2, retry it twice. This can be stored in the template as [T1,T1,T1,T2,T2] This is a dequeue. When we need to render a template we pull an element from the dequeu, render it and send the communication. If something fails, we can process the message object and figure out next steps. For example in case of rendering error, we may not want to retry but go directly to the fallback template.
Scheduling
Scheduling was one of the major problems that we wanted to solve. It was required for triggering scheduled communication, time based events on the FSMs and reporting. This system had to be distributed and fault tolerant. As these requirements was nothing specific to communication platform this was built as a standalone system called Samay.
We used quartz scheduling as the underlying technology for storing events. A wrapper system was built over it to allow users to register crons, one time triggers etc. and a way to get notified when this events trigger. User can store certain metadata with the event for processing. User can define a queue, topic or a webhook as a notification format. The event itself can be registered in two different ways — Ack and No-Ack. In Ack events, the consumer of the events has to send an ack back acknowledging successful receipt of event. This acknowledgment can be asynchronous as well. In No-Ack format there is no need for an acknowledgment. This system is built as a ‘jar’ and embedded in all of our microservices. The benefit of this is that if any of our microservice is up, our scheduler engine is up. We don’t need to add extra resources for the system to be up. As this system just receives an event from quartz and send it to the consumers, it will not take up much resources of parent microservice as well
Reporting
The system provides hourly, daily, weekly and monthly reporting on a campaign level. For email and push notifications, the system provides information about dispatch rate, fail rate, delivery rate, open rates and click rates. A cron runs every 1 hour to aggregate information and store it in a reportable format. User level information is also provided as a Map Reduce Job on request. This gives very detailed tracking about each campaign.
Performance
We tested our system on a 4 node machine (m4.xlarge) with one instance of each microservice. We were able to process 1 million messages in a 15 minute window. With multiple instances of each microservice, this number can be brought down significantly. Most of the processing was done asynchronously with each core logic system maintaining a thread pool. All services were deployed on a tomcat7 container.
— Post contributed by Siddhant Srivastava, Software Developer at Swiggy