Presentation

Synthesize 2023: Breaking the Data Bottleneck

February 8, 2023

Managing sensitive data at scale in e-commerce using synthetic data

Read the blog post

Transcription

Vijay Seshadri (00:14):

Hi everyone. Good morning, good afternoon, good evening, depending on where you are. My name is Vijay Seshadri. I work for a company called Swiggy, which is on demand convenience platform based out of India, and I wanted to, with a very brief introduction, talk a little bit about some of the challenges that companies of our size and scale face when using data analytics both for human-driven and machine-driven decision making. So I've been with Swiggy for the last four years and my role is to have technical oversight into our backend infrastructure. Prior to Swiggy I was with Amazon as a principal engineer, both in AWS security as well as the retail business. And before that I was building enterprise security products for over 15 years in the Bay Area. So my introduction into big data processing was sometime in 2007, in 2008, when we had to build our first cloud-based security analytics solution. And since then I've been dabbling in the data analytics space, both at Amazon and here at Swiggy, and pretty excited to be part of the Synthesize conference and look forward to sharing my insights and thoughts. So with that, let's get started.

(01:52)
So here's the agenda. I suspect most of you may not be familiar with who we are, so we'll do a very brief introduction about the company, and then I also want to talk about the data processing architecture as a foundation by taking a very specific predictive analytics problem that we have, and then walking through our data processing architecture. With this background, we'll talk about the challenges that we face in sensitive data management. I want to say that I don't think these problems are unique for Swiggy. I think most companies of our size and scale are going to face very similar challenges, especially when they're invested in data analytics and AI-based decision making. I think these problems are going to be pretty common. We'll then talk about overview of some of the solutions that we have developed here at Swiggy.

(02:57)
I do want to say that I feel, especially when it comes to sensitive data management at scale, we are in the very early stages and there are some interesting hard problems to be solved in this space, and I'm looking forward to companies like Gretel taking the leadership here and helping companies like us manage our sensitive data in a more secure and transparent manner. So that's the agenda. Let's start with who we are. So Swiggy's is India's on demand convenience platform, and what that means is we are a three-way marketplace, and we've got restaurant partners and stores on one side who are service providers who are offering products and services on our platform. We also have consumers who are generating the demand in terms of wanting to discover and place orders for those products. And then we also have an on-demand delivery fleet of over 300,000 delivery executives who are then able to pick up the food item or the grocery item from the service provider's location and deliver it to the customer.

(04:23)
And the reason this is on demand is our average delivery time is 30 minutes, which means that all of the orchestration that happens in terms of fulfilling the order needs to happen in a relatively short span of time. And this presents a lot of different challenges, especially as it relates to data processing. So with that background, here is a quick review of our products. So the company was founded as a food delivery marketplace. So this is our oldest and one of the largest businesses we have. In addition, we also launched Instamart, which is our grocery delivery service, and we also have a service called Genie, which is pick up and drop any item within the city's radius. And as you can see, there's a bunch of other categories and products.

(05:22)
The reason I'm flashing this slide is because this has implications both on the types of data that we need to process as well as the vendors and partners that we interface with across all of these business lines. So now let's look at what exactly is hyper-local delivery. And I'm going to use the example of food delivery here and walk you through the different events that happen during order fulfillment. So typically when we look at order delivery, we break this down into a few different distinct components. The first component is what we call the first mile, which is the time it takes for a delivery partner to go from his or her location to the restaurant. And once the delivery partner reaches the restaurant, they typically have to wait until the food is prepared. And as you can see, obviously we are paralyzing this operation, meaning that we relay the order to the restaurants while we find the right delivery partner to handle the order so that the food preparation happens in parallel to the first mile commute.

(06:51)
Once the food is ready to be picked up, the delivery agent is able to pick that up and move to the customer's location. This is what we call the last mile, which is the time it takes for the delivery partner to go from the restaurant location to the customer's location. So from a data science analytics perspective, as you can see there are a bunch of different predictions that are involved here. The first is that to be able to do accurate time predictions, this is not just about distance, it's about time, we need to be able to have a very good view of the on-demand supply near that particular restaurant so that we are able to have a sense of what an aggregate function might be in terms of how long it'll take for the first mile.

(07:47)
We also need to be able to predict the last mile here, and there are a lot of variations in ground conditions that can impact the last mile and the first mile events, including weather, special events, changes to road conditions, not to mention in the last mile. There's also a component of driver affinity. In other words, how familiar is the driver with the customer's location? So all of these are essentially, you can think of different features in the feature space, that needs to participate in producing the right predictions. Now on the prep time, it's also an interesting problem because depending upon the type of the restaurant, meaning cuisine, depending on the item, the prep time can greatly vary. There is also external factors that we don't directly control, including the stress of the load at the restaurant, which could include both online orders as well as offline orders. So all of those things we need to factor in to be able to predict the prep time for a given item or a set of items.

(09:01)
So the interesting thing here is that this whole prediction conforms to the Goldilocks principle in the sense that we can't be overly aggressive when we are going to predict these times for the customer because if we are very aggressive and we show less than actual delivery time to the customer, it could potentially be a bad order in the sense that it could have a negative customer experience in terms of the order being delayed. And at the same time if we are overly conservative and we show a higher than actual estimated delivery time, then that can be a loss of perception of speed. So we have to get it just right, and given the variances I talked about it indeed is a pretty interesting challenge from a predictive analytics standpoint.

(09:58)
So with that context in terms of what the AI problem is, I also want to talk about another scale related challenge here when it comes to showing estimated delivery times to our customers. So typically when customers open our application, against each restaurant we are showing the estimated delivery time in minutes. And what this means is that we need to figure out from a given customer location what are all the different restaurants that are actually serviceable at that particular time? And as you can imagine, this is not as simple as looking at a radial distance or Haversine distance because obviously the road connectivity matters here. Even the type of road connectivity matters. So typically the way we look at this is we organize restaurants into different clusters based on reachability from the customer's location. That then allows us to compute all the different restaurants that are potentially serviceable from that location. Now to add to this, especially in most of our cities, dense locations, there could be up to a thousand different restaurants that are serviceable from that customer's location at that time.

(11:26)
Now, what this means is that for every customer restaurant pair, we need to be able to predict the estimated delivery time. And just to give you a sense of scale, New Year's Eve of last year, December 31st, it was one of our highest traffic days and at peak we had to do about 2.2 million predictions per second. So in addition to the AI related challenge I talked about earlier, there's also the need for us to operate at very high scale and also be accurate at the same time. Okay, so with that context, let's look at our data processing architecture here. So there are two distinct components in our data processing architecture. One is a realtime event ingestion, typically handling events like click stream information coming from mobile applications, ads, events, et cetera. And this goes to a realtime data processing pipeline and while it gets stored in our data lake, it also then feeds data into our streaming analytics solutions as well as time series based anomaly detection systems.

(12:51)
In addition to this, we also have realtime dashboards. As you can imagine, a lot of the ground conditions can change in realtime, so we do have our business teams and city operations teams that are actually monitoring the performance, the business performance realtime. So we also need to be able to compute those metrics reliably and show to our users. Down below we have a CDC or change data capture based flows. We have a combination of relational and NoSQL databases, and this requires different data processing pipelines to fetch the data, do the compaction and store it in the data lake. Now from the data lake you can see that there are several folks that the data takes. On one side, we do compute facts and dimensions for the raw data, which is also stored in the data lake, and then synced to our data warehouse where we have analysts who are using the data for analytics.

(13:58)
At the same time, we also have feature generation for our data science models that happens from the data lake, and those features are computed, and if necessary, the models are retrained by looking at the latest features from the feature store. Now, the reason I'm showing this slide is because it gives you... If you look at sensitive data protection, it gives you an idea of how many different parts the data elements take. So imagine there is a sensitive data sitting in one of our RDBMS slaves. It can take one of many different parts in the data processing architecture, and it presents us with a challenge in terms of figuring out how do we classify this data accurately, and if necessary, how do we transform this data?

(14:49)
Okay, just to add to the variations in data flow we talked about, I just also want to emphasize that we are handling about 30 billion events a day total, and petabytes of data. While there's a lot of emphasis about data sanitization for model training purposes, I also want to call out that we are doing about 30,000 queries a day, and mostly coming from hundreds of analysts that we have. So it's equally important for us to classify data, transform data in real time so that the analysts are also able to have the same view of the data that our data science models have. So with that background, we will now go into the data management challenges.

(15:47)
The first challenge I want to talk about is data for test environments. So in our backend infrastructure we have over 500 microservices which are developed by hundreds of different developers. And in any given point in time, there are tens of projects that are being actively developed and pushed to production. On an average we have anywhere between 1,500 deployments a day. Now, in all of these scenarios the developers or testers need the production state to be replicated in test environments, and this presents us with numerous challenges. One, a slightly obvious one I think is simply the volume of data that's in production can't just be replicated over usually for cost considerations.

(16:36)
But there's also security and privacy implications here. As you probably know in the recent past, we have seen increasing number of data breaches happening from test environments. And the reason I think for this is that typically most companies pay a lot of attention to the production infrastructure when it comes to deployment of security controls, but somehow test environments may not have the same kind of focus, and generally these environments are assumed not to have sensitive data. And this presents us with some unique security challenges. The other thing I want to highlight is that given that there is so much entropy in the system where new features are being developed, data models are being changed constantly, whatever transformation solutions we come up with needs to be integrated with our CICD pipeline, which means that essentially the data classification and transformation almost needs to happen in lockstep with the rate of change that is happening on the backend infrastructure.

(17:42)
The other challenge we realized is one of data consistency. And what I mean by here is that if you look at an order state where a customer from a given city placed an order for a given item from a given restaurant, the data for this order state is typically distributed across multiple different systems. And that means that the restaurant data store needs to have information about the restaurant, say R1 from which the order was placed. The customer data store needs to have the right customer information. The catalog needs to have the right item information. And this means that the data has to be consistent across distributed data sources. The reason I'm pointing this is typically when we look at data sanitization problems, we look at data that's sitting in one place, typically in a JSON file or a CSV file.

(18:37)
And what we realized is that the problem is actually that the data sources themselves are distributed. So any classification and transformation that we need to apply, or if you need to generate synthetic data, the synthetic data has to ensure that there is consistency across these distributed data sources. And of course, it also needs to have referential integrity to make sure that we don't have any dangling references. If a given order is pointing to a given restaurant ID, of course that restaurant ID also needs to exist in the other data source. The last one is around data distribution guarantees. And here I think this is probably a more well known problem. Let's say if we are trying to detect coupon abuse on our platform, the exact distribution of customers to coupons to orders has to statistically match what we have in production. And this has also been a challenge for us.

(19:34)
The other challenge for us is data exchanges with partners. So we operate a three-way marketplace, so we actually deal with multiple different vendors across different business lines, different categories like grocery, food, et cetera. And this means that one, we need to know which of those interactions might potentially contain sensitive data, and when they do, we need to be able to handle them appropriately. I just want to give one example that highlights this point. We have a feature where a delivery partner can call a customer in case he or she's not able to reach the customer's location. And this happens through an anonymized interaction where the delivery partner doesn't get to know the customer's phone number. Now this is orchestrated by an internal telephony platform which needs to interface with our telephony service provider. In that process, we will have to actually handle and exchange mobile phone numbers. So in all of these situations, we are going to have to exchange data with partners. It's the reality of doing business. We need to figure out a secure way of exchanging this data.

(20:40)
The third challenge we have when it comes to handling sensitive data is that in addition to having to classify items or data elements at scale, we are also noticing that there are lots of variations in what's considered sensitive depending upon the geography you're in. So for example, in the US we have the social security number, which is also used as a tax identification number, but in Europe and other geographies, the tax identification numbers are different. And here in India we actually have an alphanumeric format, with the alpha characters appearing in specific locations. So we actually had to develop solutions where we are able to identify "sensitive data" that applies to a certain geographic locale.

(21:28)
The other challenge is around location data, as you can imagine, a product like ours is collecting location information from multiple actors on the platform. And again, there's a lot of variation in how do we detect accurately what is actually an address data element. Things like geo hash and lat longs are relatively easier, but when it comes to descriptive addresses, this can be a big challenge. And the same applies for customer name detection as well. Personal names, again, vary depending upon the geography you're in. The other challenge in handling this data we realized was that there is tremendous amount of data sprawl that happens. As we saw in the earlier slide, the pipeline literally is a sequence of stages where we do data processing, purchase data, typically to S3, and then run a bunch of triggers based off that data and then process again.

(22:34)
And in this path there's a whole bunch of folks that happen where there are multiple different data processing elements reading the same data. So you end up really with a directed acyclic graph of different pipelines. And if you have to identify a choke point where we're going to apply the classification and data transformation, it becomes a... It's a pretty big challenge. Lastly, we also realized during this process that there are real use cases where business teams need access to actual sensitive data like phone numbers, email addresses, et cetera. So we do need a mechanism to have elevated privileges and revocation of sensitive data.

(23:16)
So the last challenge I want to talk about is regulatory challenges. Most geographies have restrictions on handling PI data. Here in India, we have data localization requirements for payments data, which means that we can only store the payment information within this region. And this presents two challenges. One is an obvious one, which is that any data process like fraud detection, for example, that's tracking relationships between customers, payment instruments, mobile devices, et cetera, now needs to have access to synthetic data which has the right statistical distribution. The slightly more nuanced problem is that there are services outside the regulatory boundary that also need information about payment instruments, and this needs to happen in a synchronous online manner. Most of the data processing pipelines or solutions we looked at were designed for batch and asynchronous processing use cases.

(24:16)
Now with that, let's quickly jump into some of the solutions that we had internally developed. The first foundational service we realized was a tokenization service. This, as the name sounds, we want to be able to replace a sensitive data element with a token. And going back to our telephony example, at the point of ingestion, which is when the user first registers her cell phone on our platform and we get collect the phone number, we convert that into a token, and the rest of the systems, card processing systems, order management systems, et cetera, essentially deal only with the token.

(24:58)
And at the point of egress, which is the very last hop that needs to talk to the telephony provider, we can do the detokenization and replace the token back with the real sensitive data element. So with this foundational service we also built something called DataShip that allows us to securely exchange data with our vendors and partners. We did realize that there are both scheduled and ad hoc demands that come when it comes to data exchanges. And for most of the data that's sitting in our data warehouse, we had to devise a data processing pipeline that first identifies and tags elements as PII, and if the PII data elements, the AWS Glue job that you see here is then used to connect to the Kryptonite platform that I talked about earlier, which can replace the data element with a token. And this token is then shared with our service provider or the partner. And the sync here typically means that it's either SFTP or a HTTPS endpoint where we need to send the sanitized data.

(26:15)
Next I want to talk about handling sensitive data. As we started the process of analyzing all the different data elements on our platform, the first problem we realized was that even categorizing which data processing flows actually handled sensitive data was a pretty big challenge. And for companies like us, where we went through a very rapid scale, it's very difficult for us to put the right controls and best practices in place before the data sprawl happens. And hence you're looking at a system that literally has, say, thousands of different tables in materialized views and hundreds of workflows that are processing all of this data. So we realized that one key thing we need needed was lineage tracking, which means that if you have a sensitive data element, we need to be able to trace that data element through the data processing pipelines to make sure that we can identify what all workflows and what all systems might have the sensitive data.

(27:24)
We also took the approach of doing data interception at the source database, to move the problem as upstream as possible, where we are able to do the data classification and masking. This then allowed us to ensure that whatever data that flowed into the data lake, and all of the data processing pipelines that were reading data from the data lake, all of them were dealing with mass data only. The last solution I want to talk about is a capability that we build called ephemeral environments. And this is in response to how do we solve the problem of test environments getting access to production data. So we created a set of... The capability to create a disposable test environment where on demand a tester or developer is able to provision all the infrastructural components, whether I mean the databases, the caching products, the message queues, et cetera, and all of the different microservices that participate in a given test scenario. Now we then combine this with declarative test framework, which uses a Gherkin based DSL to auto generate data sets.

(28:47)
So to give you an example, if a developer or tester is testing a certain stage of our order processing flow, for example, after the order is placed, the test creator can simply say when an order has been placed with these properties. And this then allows us to create the right data sets in different subsystems. For example, in the order management system, the customer data store, the restaurant data store. So that we have these templates that can be reused across multiple different scenarios. And this has greatly reduced the need for us to copy data from production onto test environments. Okay, thanks a lot for listening. I hope you found the information quite useful, and if there are any questions, I'm ready to take them now.

Synthesize 2023: Breaking the Data Bottleneck

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Transcription