# Dataset Joinability at Scale with HyperLogLog

## Video description

John Myers, CTO at Gretel.ai presents a talk on our streaming implementation of Google's K-HyperLogLog (KHLL) research paper at RedisConf 2020.

## Read the blog post

## Transcription

John Myers (00:07):

Hi everyone. My name is John Myers. I'm the CTO and co-founder of a seed-stage startup called Gretel. And today, I have the privilege to talk to you guys about some of the analysis we're doing using Redis to determine joinability between data sets using HyperLogLog. So to start, I'll go over a quick overview of what Gretel is actually building, some of our customer challenges and requirements, talk about some of the existing research already in this field, and then walk through what the basics are of containment and some experiments we ran, and then finally talk about our implementation of how we are determining joinability between data sets at scale.

John Myers (00:53):

So at a high level, Gretel is a service that is empowering developers to quickly build derivative data sets through privacy analysis, data transformations, and generative modeling that allows people to generate synthetic data from existing data. So the whole purpose is that this derivative data set becomes safe to share without actually infringing on individual privacy. For the purposes of this discussion, we're going to focus more on the privacy analysis side, where we can use joinability between data sets to determine if a data set is actually safe to share, to make sure that information can't be re-identified back to its original source through a second party, third party joins.

John Myers (01:37):

So some of our customer challenges summarized is, "Can you help us determine what overlaps we have between our siloed data?" So one of the biggest challenges we see in very large organizations or between B2B scenarios is that we have data silos where you have a lot of overlapping data, but you can't really see what the other team has without giving access to the data. So it becomes this all-or-one situation, or maybe I should say all-or-none. So you either get access to the data or you get no access to the data, and it's really hard to determine how you can mutate or transform the data to make it possible to share.

John Myers (02:20):

So some of the use cases we have is customers want to know what analytics are in a realm of possible across data sets. So this would be looking to determine if two data sets can be joined so that net new insights can be generated between that data. Additionally, they want to verify if there's any re-identification risk in a data set before it's shared or released. So this would be making sure that if you were going to release a data set and it turns out that some of the columns or elements can be joined with a publicly available data set or another data set that another company or organization has, you want to mitigate that risk before you even release it.

John Myers (02:59):

So what we're talking about is joinability and containment, and I often use the two terms interchangeably, but really joinability in my mind is a yes or no question. Can data be joined? Yes or no? Containment is the actual measurement of how much of the data can be joined together. So really containment is the percentage overlap between data set A and data set B and vice versa. One of the big use cases we have is being able to do transitive join analysis between data sets that are more than one degree separated. So in this case, if I can show that data set A could be joined with data set B and data set B can be joined with data set C, this could reduce to showing that A and C are joinable when they don't actually have any overlapping data to begin with.

John Myers (03:51):

Some of the challenges here is that we need to operate on streaming data. So a lot of our customers are working with event driven architectures. They're dealing with transactions. They're dealing with unbounded data sets. And finally, they want to be able to determine joinability without accessing the raw data. So how can I determine that we have the same cards in our hands without actually having to show each other our hands?

John Myers (04:18):

So we'll talk a little bit about some existing research in the field. So Google was kind of the basis for this, and they have an excellent paper that discusses two major approaches to privacy analysis. And the paper is based around this notion of a KHyperLogLog, which is a new data structure that they proposed for doing two things - one, for determining joinability between data sets, and then second, to do re-identification analysis. So when we talk about re-identification analysis, we talk about taking, let's say, a sensitive field in a data set that could be the primary key. It could be a unique identifier. It could be a specific piece of PII to a customer, and then determining what combinations of other fields exist that are just as unique.

John Myers (05:14):

The proof of concept code for both the joinability/containment and the re-identification analysis is in a GitHub repo under Google, and they use BigQuery to demonstrate this capability. Additionally, Neustar has a blog about using Redis to determine intersections between HyperLogLogs, which are just determining set intersections at scale. This is a fundamental piece of the containment analysis. Unfortunately, the blog is no longer available. When I go there, it seems to have like a WordPress error. On the other hand, I was able to find a reference to that blog, which gives a good high level summary of what they discovered.

John Myers (05:59):

So what is a HyperLogLog? HyperLogLog is a data structure in Redis that provides estimated cardinalities. And for our purposes, we are going to focus on the joinability side of things. So we are going to focus on determining joinability between two data sets purely based off of the cardinalities of those sets. So Redis HyperLogLog can count the number of unique elements in a set. It's a probabilistic data structure. So it's not as precise as doing a pure Redis set operation. However, it has a huge efficiency in memory usage. So it uses a constant amount of memory, approximately 12 kilobytes and it is extremely quick at adding elements to a set and computing the estimated cardinality of a set. There is a standard error of 0.81% because of its probabilistic nature. I won't go too much into the underlying implementation, but there is an excellent blog out there from a Redis developer and it outlines the exact architecture and how it works. And finally, Redis works really well for event-driven and streaming scenarios because we can accumulate data into these HyperLogLogs without having to do massive table scans.

John Myers (07:27):

So next, we'll get into containment basics. And so before I move on from here, I will point out that all of the code snippets you see and the experiment that I run here, the code is all available on GitHub. So anyone can pull it down and modify it and mess with it themselves and tailor it to the use case or whatever you'd like to do with it. So what is containment? Containment is an asymmetric calculation of what percent of values overlap between sets A and sets B. And when I say asymmetric, it's because it's two way. So you can actually calculate the containment of set A inside of set B and set B in set A. This is a little different than other techniques like at your card index, where that gives you a one value for any pairwise tuple of sets. So we'll write this as "con(A, B)" and "con(B, A)" for our formula, and we'll let the cardinality of a field be defined as card of F.

John Myers (08:41):

So the containment of A inside of B is defined as the cardinality of the intersection between the two sets divided by the cardinality of A. Now if we were saying the containment of B inside of A, our numerator would be the same, but our denominator would be the cardinality of B. This yields a score between 0 and 1 that tells you what percentage of A is actually inside of B. And to determine the intersection, we can actually use the inclusion-exclusion principle, which relies on the union of the two data sets. So in this case, we take the absolute value of the cardinality of A plus the cardinality of B, and then we subtract the cardinality of the union. As we'll see later, we were able to easily compute the cardinality of the union using our built-in Redis data structures.

John Myers (09:40):

And then in the GitHub repo, we'll see that there is a sample code. It's an Etsy directory, and it's called basics.py, I think, and this just runs a very simple simulation. So what we do is we use UUIDs as our set items, and then we generate 500 identical UUIDs that will be between sets A and sets B. Set A contains 1,000 items and set B contains 1,500 items. And there are 500 distinct items that overlap between the two of them. So when we run this calculation and we were able to add each item to the HyperLogLog, which is named "f_a" and "f_b" respectively, we can run inclusion-exclusion principle, determine the containability score. And on the bottom left here, we see that the containment of A and B is approximately 50%. Right? So that is saying 50% or, in this case, 49% of the UUIDs inside of set A are found inside of set B, which gives us very close to what our actual truth is. Now on the flip side, we look at the containment of B inside of A. Now because of B has more items, it has 1,500 and only 500 are shared, we see that there's approximately a 33% overlap. So very simple way to implement, very quick and very efficient on memory.

John Myers (11:15):

So one of the two experiment about the standard error that I mentioned before. So in this situation, we do 100,000 iterations of creating two sets that have zero overlap. And the reason why we do this is we want to see what this actual difference is. So for field A, we generate a random 10-digit string, and for field B, we generate a random 3-digit string. The estimated cardinality of A comes in just under 100,000, and the estimated cardinality of B comes in at 997. Now, obviously because we have a limit on 3-digit string, there's only so many combinations of the 3-digit string that we could have, which lowers our cardinality for set B no matter how many combinations of that we create. So really, we're comparing a set of a hundred thousand items to a hundred items, which is approximately a hundred times larger than set B.

John Myers (12:27):

So in this case, when I ran the simulation one time, my estimated intersection came in at 159, which gives a containment score of 16%, which we know the actual answer is 0%. So what we learned from this is that the order of magnitude between the two set cardinalities could really let that standard error rate throw a false positive. So that's just something we have to keep in mind for our implementation. In the Neustar blog, they suggested using 20 as kind of the multiplier. So if your cardinality between sets A and B is 20x, meaning one of the sets is 20 times or more larger than the other one in terms of unique items, then you should be very cautious about trusting the results.

John Myers (13:23):

So I wanted to kind of dive a little deeper and give more of a real-world scenario here. So in this situation, there is a script in the repo that generates a bunch of dummy credit card data. And I chose this scenario because I think it's a scenario that lots of people are familiar with. A lot of people have been subject to credit card fraud, and it's a scenario where there are a lot of companies in the security space that are in the business of determining what is the order of magnitude of damage that happens when there's a credit card leak. Right? So these companies are able to kind of procure the leaks of data from the dark web or wherever. And then they're able to kind of inform their customers on what types of things are being leaked, so banks and other financial institutions can take action.

John Myers (14:14):

So I kind of wanted to replay that scenario and say like, "Hey, how could we, in theory, demonstrate this without actually having to share real numbers with each other?" Another solution to do this would be, well, we can just hash everything, right? We could say, let's just hash every credit card that the bank has and hash every credit card that is leaked and then compare them, which works, but you're still exhausting a lot of space to do that computationally. Right? It's still a one-to-one scenario. So how can we do this in a singular footprint? So what we do is we generate two million records and we say they belong to the bank. Very simple schema. We have the type of credit card. We make a fake name for the credit card, the credit card number itself, an expiration date, and a CVV code.

John Myers (15:04):

And then we generate some cards that belong to the thief. So in this situation, the thief steals 50,000 records, and they have a much smaller footprint in headers where they only have the credit card type, the credit card number and the CVV. When I used this data set, I generated it with a Python library called Figure. And so what you see is the output from baker, where it actually tells you the name of the card and the number of digits. And I left that as is because what we really care about are the credit card numbers.

John Myers (15:40):

So the question we want to answer is how vulnerable is the bank relative to the entire stash of stolen credit cards? So before we go on, I'll tell you that the dramatic irony of the situation is the generator created 25,000 records that overlap between the two. So 25,000 of the thief's records belong to the bank. On one note here is I got a little aggressive and this is actually a 40x difference in set size. So 40 times 50,000 is 2 million. And so this is more than double outside of the recommendation from Neustar, and I did this on purpose so we can see how things react regardless. And then before I go on, inside of the GitHub repo, there's another script that will load this dummy data into Redis using the redis-cli bulk-pipe technique. And then there is a containment script that will actually go and compute all the containments and kick out a lightweight data structure.

John Myers (16:54):

So how I implemented this inside of the scenario is once we load everything inside of Redis, each column gets its own HyperLogLog. So we have eight total HyperLogLogs that get created in this scenario. There are five headers for the bank, three headers for the thief. That gives us our eight, and we will compute containment across 2-tuples. So we'll take every combination of header and then create a containment score for that. This becomes an 8 choose 2 problem, so that gives us 28 total combinations of headers between the bank and the thief. Now I am actually comparing fields in the same table. You don't necessarily have to do that. It totally depends on your use case. I will say that for a lot of streaming scenarios, I have found that you don't necessarily have a clean table schema, and then a lot of times when you're streaming data, you'll have a data structure that streaming and then you'll have actual overlap between values in the data set, whether it's a duplicate value or it's a value from one field that is contained in another field, but with some other type of metadata around it.

John Myers (18:17):

Okay. So getting into our scenario results, the highest score is a 100% containment between the thief's CVV numbers and the bank's CVV numbers. So 4,300 and change from the thief and 11,000 and change from the bank. This isn't surprising because we're going to exhaust the entire key space for the possible CVVs pretty quickly. So it's reasonable to say that this is extremely accurate, and we also don't really necessarily care. So this is the meat that we're looking for right here. So we have 48% of the thief's credit card numbers, which comes in at an estimate of 50,275, where the real number was 50,000, are contained inside of the bank's credit card numbers, which comes in at an estimate of 1,991,000 and change. So this is the target that we're looking for. We know that the real answer is 50%. We commented 48% given the difference in the cardinality sizes here. Again, the cardinality sizes are about 40 times bigger for the bank than it is the thief. This is still pretty good.

John Myers (19:29):

As we move down the list, again, we see the CVV score pop back up. So the bank's CVVs are 39% contained inside of the thief's, which again, you would expect that because you would expect that a subset of the bank CVVs could be arbitrarily generated by the thief. And then finally, we got a containment score here showing that 19% of the thief's CVVs are contained in the bank's credit card numbers. So this is that wildly different cardinality size kicking in the gear here. So now we're looking at a cardinality of 4,000 compared with about 2 million. So this is where you're going to see a false positive come into play, but on the other hand, it gives you a really low containment score and this is way more than 40 times the size of the other set when we're comparing 2 million to 4,000. So you can reasonably say, this is probably a false positive and we won't even pay attention to it.

John Myers (20:38):

So some implementation considerations for real world based off of this experiment. So field context, super important when you're comparing data. And what I mean by field context is, can you derive some type of information about the field other than the value itself? So pre-processing and normalization of the field to extract any type of type or entity would be really useful here. And one way that we do that is we use a different varying techniques of natural language processing, regular expressions pattern matching, and custom extractions to identify contexts about a field before we make a determination to build a HyperLogLog for it. So can we determine that a field is containing phone numbers or addresses or zip codes or credit cards, names, other types of technical metadata? If we can extract it and label it, it gives us some context we can grab on to before we start doing comparisons for joinability.

John Myers (21:43):

And this is really just to have some notion of fields containing similar things. So we want to make sure in general that we're comparing credit cards to credit cards or locations to locations, PII to PII, et cetera. On the location front, I will say it's an extremely interesting opportunity here to do geocoding, reverse geocoding, to really canonicalize locations together. So if you can move locations into a common format, something like latitude and longitude, and then compare them for joinability, that has a huge opportunity there. Right? So addresses can be converted down to lat longs with a certain level of precision. And then if you already have lat longs coming in, you could normalize those to a certain degree of precision, maybe three or four decimal places, and then you can do containability there.

John Myers (22:34):

And then finally, you want to minimize comparing every possible field. So again, you have an N choose M problem if you choose to compare every field with itself regardless of the source table or source data source. So that scales very quickly. It becomes exponential. So if you can minimize which field you want to compare with each other, you're actually going to be better off. And while the PFCOUNT command is extremely fast, when you do PFCOUNT with multiple fields to actually get the union between two fields, you incur a little bit of a penalty because Redis has to actually create a temporary merged HyperLogLog. So that can be exhaustive over time.

John Myers (23:16):

And then finally, as we saw before, you want to be cautious about fields with much different cardinalities, And as I mentioned before, Neustar recommend a 20x. We were able to demonstrate that it still works pretty well at 40x. I think your mileage may vary depending on your exact use cases and the types of data you're working with. One option here is that you don't necessarily have to throw the score away, but you could consider flagging field 2-tuples with some type of flag giving a warning that, "Hey, this score is computed from two sets that have a 50x difference in cardinality size. So you might want to take it with a grain of salt."

John Myers (24:05):

And then generally, we're looking for containment scores above or equal to 40%, and I'm saying that anecdotally that's based off of the work that we've done and the type of data we're processing. It's really important to kind of set your expectations early, and our expectation is that we're looking for massive overlaps in data sets. We're not looking for needles in the haystack. So if you have two ginormous data sets and you want to just see if a handful of records of one exists in the other, this is not the solution for you. And finally, HyperLogLogs are exportable as byte strings. So under the hood, you can look at a HyperLogLog in Redis because at the end of the day, you can actually set and get it, and you can actually export them as byte strings, which is super useful and you'll see why.

John Myers (24:55):

So next, we'll get into some high level implementation details of how we're enabling this. So first, we extract all possible context for a field, and we do that based off of the contained values and we also can try to exploit the header names. So we've done a lot of work in building ways to model header names to give context as to what the actual value is. And so we've collected data from millions and millions and millions of database tables that has been made available by Google and we can kind of build a corpus of header substrings and fasttext vectors to do analysis, to try to say, "Hey, based off of the header name and the values down that header, we can reasonably say that this is an address or reasonably say that this is a first and last name." And then we attach that context as metadata to the field.

John Myers (25:52):

Normalization and encoding, you can get a lot more coverage this way. So if you're able to post-process an extraction, for instance, let's say we're looking at phone numbers, then you could actually strip out a lot of the formatting metadata in a phone number. Right? You don't need the parentheses and the dashes and the periods. You can just extract out the 10-digit number, the 11, 12-digit number, depending on what localization you're looking at, and then use that as your value going into the HyperLogLog data structure. And then finally, the high level abstract data structure that we are tracking for joinability becomes a 3-tuple, and I'll kind of walk through how we compute that abstract 3-tuple here.

John Myers (26:34):

So what we do, we have this notion of a project. The project is a stream, just like a repo in GitHub. And so what we do is we maintain a HyperLogLog per project, per field, per context, and this is formatted by the field name and the context itself. So here's an example of a key name for Redis and adding a credit card number to it. So our field name is prefixed with "cont" for containment and "hll," and that is just kind of a prefix you might use to kind of differentiate the keys from other keys you have in the database. And then here, we're combining the project name with the name of the field, which in this case is number. And then the context about the field, we've determined that as a credit card, and here, we're adding a fake value to this supposed HyperLogLog.

John Myers (27:31):

Finally, one thing we can do is then we can actually use a sorted set to track the pairs of field names and context to something like a last seen epoch, and this becomes useful when we want to go compute the containment signatures or the values is that we can determine which values to use based off of which ones have changed over time. So in this case, we can do a ZADD to a sorted set data structure where, again, it's prefixed here just to kind of namespace it. And then we can have a sorted set per project. And then the values in the sorted set are the names of the field and the context pairs with something like a last seen epoch. And so this makes it pretty straight forward to iterate over the sorted set to generate the actual HyperLogLog key names we'll use to make our computations. So our ideal final 3-tuple here is that we get a field name, some type of context, and then the actual HyperLogLog bytes.

John Myers (28:42):

So when we move into what we can do with this is we can actually generate a serialized signature. So in this case, I've generated two signatures, and when you run the sample code on GitHub, it does this for you and writes them in, excuse me, write to them out as a JSON file. And so in this case, since the field names and the context are the same, the field and context fields match up between the thief and the bank. And then what I've done is simply take the HyperLogLog byte string and write it to a base64-encoded string before we write it to disk. And if you take this and you reverse the process and you put it back into Redis and run the same computation will actually kick out the same exact score that we saw from earlier. So we'll see that it's approximately 48% overlap.

John Myers (29:33):

So again, these byte strings increase in size of course when you base64-encode them. They'll be a little higher than the 12 kilobytes that they are inside of Redis, but they'll still be generally consistent. It doesn't matter how many values you've added to the HyperLogLog. You always get that consistent byte string. And the most important thing is these signatures don't contain any of the personal information. So we essentially have a way to now determine if two data sets have a high degree of overlap in data without ever having to compare the raw data with itself.

John Myers (30:18):

So let's get into kind of how we execute this architecturally. So what we have is an inbound API that's very similar to an Elasticsearch API where you can post JSON records. And then what we do is we drop those records into a queue, and then we have a series of services that we kind of abstractly call our context extractors. This does a lot of our natural language processing, our regular expression processing, week or month custom patterns provided by customers that allow them to build patterns or regexes or extractions to tag types of data based off of what's in the raw content. And then we can label a field with all the different types of contexts we've seen in that field. This routes down into what we call our service queue. And then we have a bunch of downstream services that do anonymization and transformation and synthetic data generation, but for our purposes, we're more worried about how we generate these HyperLogLog containment signatures.

John Myers (31:22):

So when we do the context extraction, what we're doing is actually, in addition to tagging the fields with all the different contexts we've seen, we will actually generate the Redis write commands that we need to insert this data into our HyperLogLogs. Oops. So what we do is we actually buffer these up inside of Redis pipelines, and then we can take that command stack, and then we serialize those commands and we write them onto a queue. And then we have a series of microservices that just pull from that queue and actually do the Redis writes. Now, the reason why we do this is because the most expensive part of our entire pipeline is this context extraction. So that could scale to have many, many, many workers, and we don't want each one of those workers creating independent connections to Redis.

John Myers (32:19):

So what we do is we buffer up all the commands in a queue, and then we can control exactly how many Redis writers pull out of that queue and write to Redis. This is a little bit of a flipped use case in the sense that this is way more write-intensive than it is read-intensive. So it's important that we can buffer those writes. Obviously, we can mitigate this further by using Redis clustering, where we're kind of spreading out our keys across multiple shards, but for right now, we're actually still just using a single master writer endpoint, and just we're able to buffer the writes well enough so that we still have good performance.

John Myers (32:58):

So then we can have a routine that comes along every so often and pulls out all the HyperLogLog structures that have recently changed. We can generate those signatures and then we can keep them into a signature store. And so then what you're able to do is search that signature store for projects that have similar contexts to the ones that you're collecting, and it becomes very trivial to take any two signatures and then load them into an ephemeral or very volatile version of instance of Redis running, so you can compute the score and then flush out the keys or whatever. Right? So it's kind of just like a TrueCache version of Redis just to compute the scores very quickly and send them back out.

John Myers (33:47):

So in summary, we have found that HyperLogLogs have made containment scoring pretty painless and have satisfied a lot of our requirements to be able to help customers do analysis across siloed data without ever actually having to show the raw data to the two interested parties. One of the big things that we realized, again, is that having some context around the two HyperLogLogs we were comparing is really important. Fortunately, we had a requirement to do labeling and content extraction in our data separately. So we were able to leverage that to add additional context to our entire process of doing joinability. Normalization/Encoding is super helpful. So if you're able to post-process the data so it's in a standard format before you put it into a HyperLogLog, that's going to give you dividends down the road, for sure.

John Myers (34:50):

And of course, this is ideal for event-driven architectures. Since we don't have to do retroactive table scans to build these models because we're dealing with streaming data, we're able to run a very typical queuing architecture to buffer up these writes into Redis and then have periodic read out to build these containment signatures. Well, that's it for me. I hope this has been useful. For those of you watching, for those of you who are going to tune in or have tuned in already, I really appreciate it. I hope everyone's staying safe and feel free to reach out to me or my team. Check out our work on GitHub. We'd love to hear any feedback, any use cases around this. And Redis kicks ass and I continue to love to use it.