Image source: undraw.co

Architecture and Design Principles Behind the Swiggy’s Delivery Partners app

Narendra
Swiggy Bytes — Tech Blog
10 min readNov 13, 2019

--

This article is the third in a series that covers how the Mobile team at Swiggy built the Delivery Partner app. The first article covers the need, challenges involved, research conducted and how the team got together to build the new Delivery Partners app. The Second article covers identifying the requirements, tech evaluations and why we have chosen React Native.

In this article, we would like to give you a walkthrough of our core design principles, design patterns, high-level design of the app, scalability, real-time logs, metrics and stability of the application that we followed.

First things first: The Design principles

Keeping in mind to set good quality of the codebase standards, reusability of the code (we are lazy programmers!), futuristic approach (less rework), faster rollouts, A/B experimentation, plug-play-unplug use cases, unit testing, agility in integrating any 3rd party modules etc. we have laid out the below principles to stick to while building this app (Like how celestial bodies revolve around the Sun, our codebase revolves around these principles).

Core Design Principles (in-depth explanation below)

Keep it Simple and Stupid(KISS)

To maintain constant sync with our servers, our mobile application process needs to run in background for longer duration (on average 4–5 hours each session) and involves data such as tracking location info of our delivery partners (needed to assign trips, real-time tracking of an order, payment based on distance travelled etc.), order info (state of the order, item info, restaurant or store info, customer info etc.), earnings and incentives info etc. These components also interact with device resources such as GPS, network, bluetooth and other sensors for location tracking, proximity detection, distance travelled measurement, activity recognition etc. Performing these long-running/heavy operations on a single thread (UI / JS thread) isn’t efficient and in the long run, will become a bottleneck for UI operations inducing jittery/laggy experience. So, to keep things simple we decided to build only UI components in React native and long running, background intensive, data storage components in Native (Android/iOS)

Division of Components based on key criteria

Separation of Concerns

Our application is feature-rich and complex in serving the data needed to complete each delivery workflow. This involves handling user interactions, business logic, data sync and storage, network transactions, instrumentation of data such as user journey, touchpoints, real-time health metrics, etc. to name a few concerns. Instead of sprinkling these concerns all over the code base which becomes a developers nightmare to change/clean-up/revamp the code related to any concern, we separated these concerns either by creating a module or exposing a framework so that dependent modules can ignore the implementation details and be extensible to build upon.

Concerns involved in the App.

Modular and DRY (Do not repeat yourself)

At Swiggy, we are often working on multiple projects during which components that are built can be shared across multiple apps (Consumer, Delivery, Vendor, Daily etc.). Repeating the same code often involves development cycle, QA cycle, maintenance and stability monitoring, etc. To avoid these cycles and fast track production releases, we create components (however small it is) as agnostic to any specific application. So, based on this principle we decided to build below modules in which a couple of them are being shared across other apps.

Inversion of Control

According to the above principles, we built smaller components with clear responsibilities and separation of concerns. But as we build more components in the future, the interaction between these components, dependency resolution and the flow of control will be cumbersome. So, we relied on this principle which is analogous to the Hollywood principle:

“Don’t call us, we will call you”.

Based on this principle, the flow of control is handled by the external components rather than the caller to reduce the complexity of handling the control. In order to decouple the dependency resolution from the callers and also to scope components better based on duty state, app state, etc; we integrated Dagger2 for dependency injection. As to delegate the flow to other components rather than creating a burden on the caller, we built few components as frameworks which invoke the caller functions appropriately or relied on events to trigger the flow. An example of this is the Retry framework which can be used by callers to retry a function unless successfully executed based on attributes like time, strategies, retry attempts etc. Overall, applying this principle helped us create smaller reusable components at the same time reduce the cumbersome problems involved while handling too many components.

High-Level Design

Based on our requirements and above principles we laid out the blueprint of the App before writing any piece of code. Below diagram shows the high-level design and major components involved.

Blueprint of the App

The Design patterns

Unidirectional Data Flow:

As we have decided to build our sync modules, storage components in Native and UI in React native (based on above principles) we also need to stick to a pattern in the way data should be flowing across these components. Taking inspiration from the Flux pattern and especially the way REDUX works, we designed below blueprint of the data flow(server → storage → redux store → UI components).

Blueprint of the data flow across the components

So, storage components in native acts as the single source of truth (w.r.t the client) and redux store in React native would be updated only after native storage components are updated from the server. So, any user interactions triggering data modification (ex: status update of an order) would not be dispatched directly to the redux store but rather be passed on to the sync module in Native which makes request to the server and accordingly update the storage components as-well-as publish the updates based on which redux store would be modified. This ensured us avoid any inconsistency in data across components and the reliability of the data is persisted.

Pub-Sub Pattern:

For the most part, our application is event driven and honours a finite state machine for each delivery flow. So, whenever any event or state transition occurs we might need to perform tasks like playing a sound and vibrating the phone as soon as an order is Assigned, showing a confirmation pop-up on the screen, uploading logs for real-time debugging, alerting partner when the battery level is critical or in poor network area, etc. Below is an example showcasing the usage of this pattern in one of our core components.

An example showcasing the usage of pub-sub pattern in one of our components

To simplify the reactive mechanisms mentioned above, we use RxJava extensively across all components for publishing events to subscribers, streaming data, asynchronous and background execution of a task, buffering or throttling of events, periodic scheduling of a job, etc.

Scalability and Real-time Aspects

As we have decided upon the design principles and patterns to be considered (explained above), we also had to take into consideration below explained aspects in detail before making the application live in the market.

Supporting offline/poor network conditions

Since our partners spend most of their day travelling on the roads delivering orders with a smile, through tech solutions we wanted to make their workflow on the app a buttery-smooth experience by providing support even in flaky/offline network conditions. In oder to achieve this, we made our critical flows lighter as well as removed complex flows from the critical path, built retry flows, lazy sync mechanisms and bcp modes(business continuity plan).

HLD for supporting offline/poor network scenarios

To make the above design more robust and use the same set up across flows/apps, we built our lazy sync module and retry framework as application agnostic which takes a configuration of a function that is to perform, metadata such as max number of attempts/retry limit, duration of the timeout, policies(exponential, linear and fibonacci…), conditions and criteria etc.

Real-time Debugging

At Swiggy, we operate at a tremendous scale during which an issue affecting one delivery partner will impact N orders/customers where N can be any number as we continue to grow at a rapid pace (in actual may be derived through parameters like demand-supply ratio, growth rate, time of the day, duration of the issue etc.). As our scale increases, our Turn around Time to identify the cause, length and breadth of an issue as well as finding the resolution should decrease to minimise the impact. For debugging issues on the mobile application, we need to analyse data such as user interactions, control/data flow, network interactions (API calls made), application state, device information, user and zone level information etc. Above data that is to be collected should also be reported in real time as time ticking is of utmost importance during an issue. To solve this requirement, we built a custom Logger solution which can be configured in real-time to report specific data of flow. Below is the high-level design of our solution.

HLD of Logger which records and reports real-time/historical logs collected

Above solution helps us collect data whenever we are notified with an alert (by PagerDuty/NewRelic/Firebase Crashlytics), reported by fault detection mechanisms baked in the app for critical flows or when an issue is reported by the Operations team from the ground. Based on the logs collected and real-time metrics (explained below) our team analyses the data quickly for faster resolution of the issue.

Real-time Health Monitoring

We operate 24x7 around the year, so it is important to keep a constant eye and stay alert on the application health all the time. So tracking metrics related to network, fatal/non-fatal errors, operational metrics each minute is crucial for our workflow. Below, we explained in-depth on how we constantly monitor each of these metrics.

Monitoring Network Metrics:

For tracking network metrics of all mobile applications across Swiggy, we’ve integrated NewRelic which helps in tracking info such as network error rates, traffic, response times for an API call and much more insights real-time as well as historic, provides us with alerting when any metric breaches the threshold, creating custom dashboards for a holistic picture of the application during any timeframe, etc. Below is one of the snapshots of our real-time dashboard which we created for our team to constantly keep an eye on / quickly refer during an ongoing issue.

Real-time dashboard (above are a couple of snapshots only)

Monitoring Fatal/Non-Fatal Errors:

Firebase Crashlytics helps in tracking fatal and non-fatal errors along with recording vital information such as device details, user identity. It stitches firebase analytics data too on the same dashboard for each log making it easier to understand the user journey. It even offers real-time alerting based on velocity configuration through multiple channels (we configured Pager Duty and Slack for the same).

Realtime Monitoring and Debugging of the App

Configurations

For any feature we rollout, we plan to keep a config for turning it on/off, controlling the feature to only a subset of users based on certain properties like the city, zone, os versions, devices etc. Nothing best fits our requirement than the features provided by Firebase Remote Config which gives us the flexibility to modify configs real time based on attributes such as user properties, device info etc.

Firebase remote config dashboard

Stability

As we handle one of the largest delivery fleets across India and process millions of orders daily, it is super important to be stable. So, to mitigate any risk while adopting from older version to this application, we slowly and steadily rolled out this new application to the entire fleet making changes iteratively based on the inputs collected and in the process improving stability at each update. As we are 100% live to the entire fleet, today our application is stable for 99.9% users. We were able to achieve this by sticking to our core principles, understanding the capabilities of React Native, real-time monitoring in place before GTM (go to market), setting up fallback mechanisms, BCP(business continuity plan) modes during an issue and above all a Super-Duper team handling our app.

Stability of our latest version of the App.

Summary

Overall, we talked about how we designed our new delivery application and addressed major technically challenging aspects. While some of these tech choices may look opinionated like React Native to Native distribution, It was indeed done to leveraging each framework’s core strengths and offerings. Working around each framework’s pain points helped us to get the most out of both worlds. Now, this also aligns with our core engineering principles(The selection criteria of any framework/tech should be based on current and near-future business needs and the customer base). I hope you got to learn a thing or two from this article.

Wait! we are not done yet, Stay tuned for our next article where we will talk about how we are leveraging nearby and BLE beacons for indoor proximity detection.

I’m Narendra from Mobile team at Swiggy. If you’re craving for innovation, join our team before Thanos revives and snaps his finger again. We are hiring!

--

--