Presentation

Build a synthetic data pipeline using Gretel and Apache Airflow

September 21, 2021

During this webinar, we’ll build an ETL pipeline that generates synthetic data from a PostgreSQL database using Gretel’s Synthetic Data APIs and Apache Airflow.

Read the blog post

Build a synthetic data pipeline using Gretel and Apache Airflow

Read the post

In this blog post, we build an ETL pipeline that generates synthetic data from a PostgreSQL database using Gretel’s Synthetic Data APIs and Apache Airflow.

Transcription

Danielle Beringer:
So excited to have you here for our first Gretel workshop of the fall season. My name is Danielle Beringer, and I am the director for partnerships at Gretel. I'm really excited to present today with Drew Newberry from our engineering team. And we'd like to set the stage by talking about privacy engineering as a service. Gretel was founded on open source. And today we'd like to show you the basics of privacy engineering, and put it in a construct around traditional ETL that you may be familiar with. We've chosen Airflow and Gretel together, because we think it's a great way for you to start your early understanding of what our capabilities are, how we can change the way you work with data, and also how we can start to move towards the use of privacy data actions in everyday life.

Danielle Beringer:
Our agenda for today will be to generate the synthetic data pipeline using Gretel and Apache Airflow. Some of the modern data actions around privacy engineering that we want to share with you are things like use of classification, transformation and synthetic data throughout your data lifecycle. The placement of privacy tools in the hands of developers is also one of the reasons that we chose Airflow for today's workshop. We want to give you secure options to work with your data on premise or in the cloud. And we'll show you how our capability can do just that.

Danielle Beringer:
We also want to ensure that you're moving towards adoption of ongoing quality scoring and privacy protection. This allows you to confirm the data utility for the datasets that you're using. Not every data set may be fit for use in your machine learning models, or for data sharing internally in your company or outside with others. And ultimately, we want to see everyone share data safely for innovation and data value.

Danielle Beringer:
One of the first things I'd like to cover is [inaudible 00:01:57] visibility. Like I mentioned in the opening, Gretel was founded on open source, we want to democratize data access, making data safe and private. Data engineers, data scientists, data architects and developers can all benefit from our three core capabilities, synthetics, transformation, and classification and labeling. Today our workshop will focus on synthetics, but we encourage you to come back, follow our blog in our upcoming workshops on our other two pieces of capability.

Danielle Beringer:
The use of privacy controls lets you collect and share data with others. Synthetics poses a very unique opportunity to do things that previously you may not have been able to do due to regulations, risk, security protocols, or long timelines for obtaining data with sensitive information. Today we'll talk about generating unlimited data using our synthetic API, how we're AI based and open source, and how differential privacy is enabled, particularly around our most recent launch of privacy filters, which gives you four different options for applying the exact custom amount of configuration that suits your needs.

Danielle Beringer:
Regarding our Gretel runs, we offer two different ways to run our product. You can run your workers in our Gretel Cloud with some features that you see on the left. We want you to be able to scale to the workloads necessary for the job at hand. Team collaboration is extremely important in development cycles, whether that's machine learning, software engineering, data engineering, or even just ad hoc discovery. The ability to collaborate with your team and share projects as well as data sources is essential. We understand and recognize that governance and privacy controls in each organization can be very, very rigid. So offering the ability for our customers to run on premise in their own virtual private cloud, or in whatever configuration necessary to meet those protocols, you have the flexibility to do that, and manage workers in the console locally without data ever going into our cloud.

Danielle Beringer:
Gretel workflow is illustrated here. We have our transform, synthesize and labeling. We call this the control plane. In the middle of the picture where you see our Gretel logo, we have the controls for our Gretel Cloud that orchestrates the requests that we provide to you through the console, the command line interface or the SDK. On the right are the various ways you can run as I previously mentioned, either cloud workers or local workers. We know that this diversity of solution will allow you to use our API's in the way that is most suited to your development practices and the controls in your organization. I wanted to note that the local worker will only need to connect to Gretel Cloud to transmit timing, logging and error telemetry. The data does not leave your environment when running synthetics locally.

Danielle Beringer:
The outcome of synthetics in generating data sets that had the same statistical distribution and correlation of your original data requires us to provide some guarantees. We do that through our synthetic quality score and privacy protection. When you generate synthetic data sets, you need to know if they're fit for use. And we want to make sure that you know the quality of that data is something that can be used, shared appropriately depending on your use case. This supplemental data protection from our privacy filters is essential for you if you're going to be using your data for training machine learning models. We have four mechanisms that we won't be discussing in detail today, very important for you to understand in our product offering. Those are over fitting prevention, similarity filters, outlier filters, and differential privacy. We offer sample reports, configuration templates and documentation that we'll talk about at the end of the presentation for you to dig deeper on this subject.

Danielle Beringer:
Before we move into the meat of our discussion today and the workshop that Drew is going to facilitate, I wanted to pause and put a new mindset out here for this workshop. Everyone's familiar with extract, transform and load. And what Gretel is putting forth is really a new mindset around extract, synthesize and load. We think that the advancement beyond traditional ETL to ESL is the next horizon in data accessibility. My personal experience in looking for ESL opportunities was in loading sandboxes, trying to find ways to do PLCs with limited data, attempting to put together analytics and roadmaps without a full data set at the right level of data grain. Doing prototyping and testing with no data or the incorrect data or unbalanced data was miserable. Extract, synthesize and load and the way that Gretel can deliver classification, transformation and synthetics in union really is an amazing new way to enable your organization to accelerate with data and shorten dramatically data accessibility time.

Danielle Beringer:
Before I pass it over to Drew, I want to make one note, we chose Airflow and Gretel as our very first workshop. And the reason for that is we feel the personas that interact with data, data scientists, data engineers, developers, data architects, and knowledge workers, all work in different tools. We want to meet you where your tools are. Airflow is a very well adopted community, many people are comfortable with the way in which airflow can accelerate data pipelines in open source community. That contribution is fundamental to our mission. So we're excited to share with you today how we see extracted, synthesize and load in an example using Airflow. We hope that you'll continue to join us for our new workshops and blogs, where we will be exploring other data platforms, tooling, data scientist, interaction models, as well as things that people are using on a day to day basis to go through their software development lifecycle, and really leverage the power of data.

Danielle Beringer:
I'd like to introduce you to Drew, who's going to go into the workshop portion of our session today. And I think you'll be really excited to see what he has to show you. Drew.

Drew Newberry:
Great, thanks, Danielle. Let me share my screen here and we'll get started. Alright, so hopefully everybody can see this GitHub repo. If you can't, just let us know.

Drew Newberry:
So thanks again for joining. Again, I'm Drew. I'm a software engineer at Gretel. So for the next 20 minutes or so we're going to expand on some of these topics that Danielle spoke about. We're going to build a data pipeline that generates synthetic data from a Postgres database using Gretel and Apache Airflow. We're going to cover some Airflow basics, talk about how to write a data pipeline in Python, using what's a relatively new feature of Airflow called the TaskFlow API. We're also going to talk a bit about feature engineering. And then we're going to dig into the extract, synthesize and load ETL pattern that Danielle was referencing earlier. And then lastly, we're going to cover how to use Gretel from both our Web Console as well as our Python SDKs to train and build synthetic data models.

Drew Newberry:
So first kind of in order to frame and talk about these topics, we're going to walk through a pipeline that tackles a fairly common feature engineering type task. So here on my screen here, this diagram in the README is going to give us a high level overview of what we're going to build. Each box on the bottom, this represents a step or a task in an Airflow pipeline. Each box at the top represents an external service that we call out to from the pipeline. So we're going to review this top down, left to right.

Drew Newberry:
So starting in the top left corner here, we have a post database. And this database is meant to represent a product DB or maybe a data warehouse in your own infrastructure. We're going to seed the database with a data set from Kaggle that contains Airbnb user booking data, and we're going to go into more detail kind of what that schema looks like later on. Using this database as the source, we're going to write a SQL query that extracts a set of features from the database, and then uploads that to a remote S3 location. So we'll then use that extracted set of features as training input into our synthetic model. Using our Python SDK, we're going to configure a model up to the raw data set of our cloud, where the models then are going to be trained and ran to generate synthetic models.

Drew Newberry:
So after the models been trained, we're going to download that synthetic set of models to our Airflow worker, and then upload them to an S3 bucket here on the top right. So this S3 bucket, it just represents some destination that's accessible by downstream data consumers. In your own infrastructure, this might be a data warehouse or a data lake, but for the purposes of our demos this is just going to be an S3 bucket.

Drew Newberry:
So now we've seen a bird's eye view of things, we're going to jump over to some code. I have a VM here, this has our Gretel pipeline demo code checked out. In this repo, we have a set of Docker containers and services that are going to bootstrap an Airflow server and a Postgres database. Will use these services then to run our synthetic pipeline end to end. So the very first step in this demo is to seed our database with the Airbnb booking data that we were talking about before. So we have a make file here, this just gives us some shortcuts that are helpful in bootstrapping the database. So running [inaudible 00:11:33], hope you can see my terminal down here. This is going to download the data set from Kaggle, and then it's going to insert it into our Postgres database. And this is going to take a little while to run, there's a couple of gigs worth of data here. So while that's running, let's jump over and take a look at the schema.

Drew Newberry:
Alright, so there are two tables that are interesting in this data set. We have a user's table and a sessions table. And this users table, it just identifies a single user, we have an arbitrary ID, and some other identifying information that includes gender, age, and how they signed up for the service. The session table here, this captures user activity by user ID. So we have a couple different types of actions. This just details how a user was interacting with the site. So did they create a comment, make a new booking or acquire reservation, and then that's tied up to the users table with this user ID variable here, which is a foreign key reference to this ID field up at the top.

Drew Newberry:
Looks like down here, and as we dig into the pipeline, we'll also talk more about how we use these two tables to build up our set of features that we're going to synthesize. So this should finish here in a second, while this is finishing, let's jump over. The next step here is to actually start our Airflow server. Oh, great, it looks like that finished. But the first thing we have to do before we start our Airflow server is we need to configure some secrets for the external connections that we were talking about earlier.

Drew Newberry:
So we have this ENV sample file here. So this defines all of the secrets that are used to connect to our external services. So we have three services, we have our booking database, which is what we just seeded with the source data. We have our Gretel API key. So just this connection string just takes the form of the URI. This first component here, API.gretel.cloud, this references our set of REST endpoints. And then we have to fill the API key and product ID. So in order to populate those, we're going to jump over to our Gretel Console, where we can create a project, and then grab our API key.

Drew Newberry:
So switching to the console, this is the first screen you're presented after logging into Gretel, this is your dashboard. The first thing we want to do is we want to hit new projects. We're going to go over to settings, we're going to name our project, Airflow Webinar. [inaudible 00:14:00] the project. Will grab the name of the project here, let's save. Will drop back over to our environment variables, and we're going to replace the project ID with the project name. Now we need the API key. So heading back over to our console, we've got this Connect screen in the top right or this connect dropdown. We're going to click that, it gives us some details about the project. But it also gives us an opportunity to grab our API key. So hitting copy here. So copy our API key to the clipboard, and then we can paste it right here.

Drew Newberry:
With the very last connection string we have is for our S3 bucket. So this takes a access key a secret, our bucket name and our region name. So for these last two variables, I've already populated our VM with the rights secrets, but if you're following along at home or running this on your laptop, you want to update these yourself.

Drew Newberry:
So now that we've configured all of our secrets, we're going to start with the Airflow server. So running [inaudible 00:15:03] start here in the terminal, this is going to boot up Redis, Postgres, a bunch of other support services that are required to run Airflow.

Drew Newberry:
[inaudible 00:15:20] take a few more seconds here. And then as soon as this loads, we'll jump over into the Airflow UI and we'll talk about how that all works.

Drew Newberry:
Just a few more seconds here, I think. Let's see if that worked. Perfect. Okay. So you log in. And the first screen that you're presented with is a set of DAGs. So a DAG is a core unit of work in Airflow. In computer science, a DAG stands for a Directed Acyclic Graph and it represents an ordered set of vertices and edges. In Airflow, a DAG is just a data pipeline. So clicking in to one of our pipelines, this Airbnb bookings pipeline, this is what we're going to build today. So you can see a DAG is made up of any number of steps or tasks in Airflow terminology. So each task that you see down here, this represents a single discrete step in the pipeline. Each task has a set of inputs, a set of outputs, and these are used to define an order instead of dependencies.

Drew Newberry:
So in the case of our pipeline, we have three steps here. We've got our extract step, extract features. We have our synthesize step, generate synthetic features. And we've got our load step, upload synthetic features. So this is referencing back to that ESL pipeline pattern that Danielle was describing. [inaudible 00:16:48] what the pipeline looks like from the UI, let's jump back into some code and we'll see how it's defined from Python. So looking at our repo, we have a folder here called DAGs. This Airbnb user bookings.py file, this defines the pipeline that we saw in the UI. So what we're going to do is, we're just going to kind of read through it top down, and I'll point out some interesting lines of code.

Drew Newberry:
So the first three lines that we want to point out here, starting on line 15, Gretel, Postgres, and S3. All these variables represent a reference to one of our external services, you can save the instantiate a hook. And a hook is an Airflow concept used for managing the lifecycle of an external service. They read in connection secrets. So if you'll remember, we defined all these connection secrets in our previous step. So they read in all these connection secrets, and then they expose these easy to use API's that interact with the underlying service.

Drew Newberry:
So the Postgres hook and the S3 hook, these are shipped with the official distribution of Airflow. And then our Gretel hook, we built this specifically for the webinar to make it just a little bit easier to interact with our services. So we won't go into too much detail and how the Gretel hook works. But if you're interested, it's in this repo, in plugins, hooks Gretel. And you can see here we've got this Gretel hook class. So if you already have Airflow in your infrastructure, and you want to use Gretel, you just want to grab this file, and you can bring it into your code base in order to interact with our API's.

Drew Newberry:
So jumping back into our pipeline. So now we'll get into the meat of how that pipeline is defined. This first outer method here, Gretel synthetics, Airbnb bookings, this represents the DAG. So if we head back over to our UI, you can see Gretel synthetics, Airbnb bookings, this is how we actually define that pipeline. We then have three inner methods inside of here, extract features, generate synthetic features, and upload synthetic features. These names should all sound familiar, these are the three steps in our pipeline.

Drew Newberry:
So this first step, extract features, this is going to read in a SQL file and then execute that SQL query against our Postgres database. So the SQL file here, we've got it stored in the SQL directory. So this is our feature extraction pipeline here. So we're going to create a table called Session Features by User. And this is just a roll up type table that summarizes various metrics from our original booking database. These roll ups might be used as input to predictive models, or maybe to report on various business metrics or KPIs. So the output here will have a user ID, age, gender, we have some specific details about the user and then some aggregated metrics here at the end as well.

Drew Newberry:
So heading back to the pipeline, we're running that query on our database. We're streaming those results back to disk, and then we're uploading them back to S3. So this S3.load file, this is that hook that we configured from above. And this is going to upload the raw set of features to an intermediate S3 location. It's then going to return those features, the key to those features where they can be picked up in the next downstream pipeline step.

Drew Newberry:
So the middle step we have here is generate synthetic features. It takes our input data source, which was the key pass from the previous step. And then this function is going to actually generate the synthetic data. So the first thing we do is we get a reference to our project. So Gretel here, that's the Gretel hook we configured above and get projects, that's going to get the project that we configured in our environment variable. So [inaudible 00:20:29] Airflow pipeline. So with the project, the next step is to actually configure our model.

Drew Newberry:
So there are two parameters that the model takes, model config and data source. So model config, we're just going to use the default set of synthetic configurations that we provide. Depending on what your data looks like the shape or the characteristics, there might be a different config that leads to a better synthetic model. But for our data set that we're building here, the default config does pretty well. And if anyone's interested in learning more about these configs, just ask us at the end and we can point you to some documentation. And then the second parameter here, data source, this just takes the data source from the previous step, and it uploads it to our cloud.

Drew Newberry:
So the next step, now that we've configured the model is to actually submit it. So calling model.submit cloud, this is going to send the config, and the data source up to our cloud, will provision a GPU worker, and then we'll actually train the model and then generate a set of synthetic records from it. So this next line, line 78 pole model. So what this is going to do is this is just going to block the function. So while the model is training, while the records are generating, this task it's just going to wait until that's done. So we just double check that the model is successful. And then we're going to return a link to the set of synthetic features. So this artifact link here, we've trained the model, we've ran the model, we want to return a link, which is just a signed HTTP URL to download that set of features in our next step.

Drew Newberry:
So we're on to our last step in the pipeline here, upload synthetic features. So this takes data set as an input. So remember, we've got that are assigned HTTP link to our synthetic set of features. So, that's going to be passed into this function. We're going to download those features to our Airflow worker, and then we're going to re upload them to the remote S3 destination here. So you've seen now how we've composed or we've built up each step in the pipeline, the next step is to wire them together. So how do we organize them into that DAG that we were referencing before.

Drew Newberry:
So these next three lines here are how we actually build the DAG, and wire everything together. So the first thing we do is we call extract features, we pass in the SQL file that we were talking about before, the output is going to be the key to the raw set of features, we pass that key into generate synthetic features, it's going to generate our model. And then we upload those features using that signed artifacts link. So if you were to kind of follow the path down of all these method calls, you're going to wind up with a DAG or a map that looks really similar to what we see on the screen. Our first step is extract features and then we follow that all the way down. So now that we've seen the code, let's run it. So this on the top right screen, there's this play button, we're going to say Trigger DAG. And this is going to start our pipeline.

Drew Newberry:
So the first thing we're going to do, we're going to extract features, let's jump into the logs here and take a look. You can see, these logs are a little bit hard to read. So I will try and highlight sort of the interesting tidbits and we'll map them back to the code that we just walked through. So the first thing we do is we run the SQL file, it's going to run that against the database. Let's refresh to get the latest results here. You can see right here, we got the results. And now we're uploading those results to our intermediate S3 location under this booking features .CSV file. So now that we have those results, let's jump to the next step in our pipeline, generate synthetic features. And this is where we're going to configure the model and then train it up in Gretel's cloud. So you can see, we've started configuring the model. So this line right here, this maps back to our model configuration line here where we pass in our default synthetic config, as well as our training input.

Drew Newberry:
So after we've configured the model, we're going to submit the model. So you can see here we've submitted the model to our cloud. And then you'll see, we'll get this big JSON response. So this is the response from our API. You can see here, the synthetic params. These are the parameters that we provided with that default synthetic config, if you need to override those or have a different type of data set. These are all levels that you can tune or pull to get better results from your data.

Drew Newberry:
So scrolling down just a little bit, you can see we've created the model, a worker has picked up that model. So we've provisioned the GPU for you. We've loaded that training set into the worker. And let's refresh to see how we progressed. Awesome. Scroll back up to where we were. So we've loaded the data set and now we're training the model. So this is going to go on for just a little while here. But what we're going to do, actually before we do that, so now that it's in our cloud, let's jump back over to the console and we can see the model training. I have a couple models that I pre-trained for this webinar. But this active model right here, this is the model that we just launched from Airflow. So we've uploaded the data set, we validated the configuration. So this was our default synthetic params. We started a worker, and now we're training the model. So now we're all the way up to 44 epochs.

Drew Newberry:
So I won't have everybody sit here and watch while we train the model, though it is kind of fun to watch each one of these epochs progress. What we're going to do is head back to the Airflow pipeline. And let's take a look at a previous run already completed. So we're going to look back in the logs for our synthetic features task. You can see here, this should look familiar. We got the response back from our API, we started training the model. So training, training, training, [inaudible 00:26:30] it here. So we got up to 99 epochs. So we've trained the model. And now we're going to start generating records from that model. So we're looking for 5000 records, which is the same number of records that we extracted from our feature query. So it's generating those records. And here at the end, got 5000 records back. So now that we have that set of synthetic records, we want to generate a link so that we can download that artifact into our worker.

Drew Newberry:
So right here, kind of towards the bottom, you can see the return value. So this link here. So this is a signed authenticated artifact link that we can use to download the features into the worker. So going back here, we're on to our last step now. So now we've got a link to download our synthetic features. We're now going to upload those to the remote S3 destination. So taking a look at the logs for load. You can see here, we got a link to that synthetic data set. So this is that sign HTTP link. We're going to download it to the worker. And then we're going to upload it back to the remote S3 bucket. So you can see here uploaded data set to. We've got this booking synthetic features .CSV file. So now that we've seen the pipeline end to end, let's actually go look at what the inputs and the outputs of the pipeline were.

Drew Newberry:
So heading over to our S3 bucket. Alright, we've got the original data set here. And we've got our synthetic features here. So let's download both of these. And then we'll take a look.

Drew Newberry:
Let's close this, set our editor up here. Alright, so on the left hand side, we have our synthetic features. And on the right hand side, we have the original data set. So just looking at the columns here, we have user ID, gender. So these are all the results from our feature query we wrote. We've got some of our aggregated metrics such as the number of unique actions taken, or the total number of actions taken. So if you compare the two, the data sets look really similar. But the one on the left here, the synthetic data set is not the same as the one on the right, so they do share the same statistical properties of each other, but no individuals records are actually shown here. So now that we've seen sort of what this pipeline looks like end to end, we've seen the inputs, the outputs. Let's just quickly head back over to our map here and recap what we talked about.

Drew Newberry:
So we started with a raw data set in PostgreSQL, we extracted a set of features from that data set, we use those features as input to train our synthetic model using Gretel Cloud. We then generated the set of synthetic records from that data model, we uploaded those back to a remote destination, where they can then be accessed by any downstream model or any sort of consumer. So by using synthetic data, instead of the raw data source, we reduced the risk of compromising any individual privacy, but we still have a data set that's representative of the original. So that's all I have for you for the demo here. And if you're interested in running this on your own, we have a repo, Gretel Airflow pipelines, you can check this out, you can run it yourself just following this README, you should be able to replicate what we just walked through. So with that, I'm going to stop sharing and hand it over to Danielle for Q&A.

Danielle Beringer:
Awesome. Thanks, Drew. So we had a question in the Q&A, anyone is welcome to submit a question. The first one was if we're going to offer a recording of this session. And yes, we will be offering the full recording. We're really excited to make this something that people can come back to, that you can share with your teams. And we're also going to be discussing the resources after the Q&A. So I'm going to pause for a second, and if any of the participants want to drop a question in the Q&A or in a direct chat to Drew, I will be happy to answer that.

Danielle Beringer:
While we're waiting, I did tee up a couple topics that we commonly hear from customers and some of our beta to advocates, the developers that are actively using our classification, our transformation and our synthetics. We received a question around test management tools, things that are generating test data for lower environments like UAT, User Acceptance Testing, or for unit testing. And what the difference between synthetic data and that type of data might be. I think there's four main things that we would highlight with synthetic data having kind of an upper edge on that. The first is our smart seeding technology, where you get to choose the columns or features that you would like to have synthetics generated for. So if you have a data set that has 10 columns, you have four columns that are internal to your company sensitive data, whether those be PII or something else, you can choose which columns you want to synthesize, you can also synthesize to any unlimited volume. So you may need a certain data set size, you may be looking to feed early development cycles for your team's doing software development or mobile application development or really anything.

Danielle Beringer:
The other item I would say is really important is the correlation. So we maintain the correlation between the relationships and the referential integrity in your data. You generally can't do that with a lot of high volume testing tools, because really, they're just obfuscating the columns. We also want to make sure that it's very clear with our synthetics, that it's irreversible. So the differential privacy that's in use ensures that in no way even if you were to remove columns within anonymization or obfuscation tool, they still could without differential privacy, those methods could have someone be able to reverse engineer and actually deduce the identity of some of the original data elements. Differential privacy that we use, ensures that doesn't happen and that's a big distinguishing factor in using synthetic data over test management type tools.

Danielle Beringer:
And then obviously, the volume and the choice. Instrumentation by API for privacy allows you to put our calls to our API's anywhere. It could be with your data architecture group, could be with a group that's doing PLCs, maybe doing demos with your own customers, you often have a need for data, you certainly can generate that manually. I don't think it comes with some of the privacy guarantees that we offer within our tooling. Let me see, we've got another question.

Danielle Beringer:
So we have a question around data and access to the control plane. So the control plane being what's running in the Gretel Cloud, that basically is allocating the calls for the API's. Let me just read the question here. So the question is, could you firewall off the data ingestion endpoints from the control plane endpoints? And this is really around restricted data sets. And yes, they are completely separate. So we can definitely have some follow up information shared. And it's available in our documentation in our GitHub and in other places that talks about the kind of the control plane, which is the actual allocation of some of the metrics for the metadata, versus the data itself. They are completely separate. The question really was around security and the firewall, but they are completely separate.

Drew Newberry:
I would also add that we don't actually have to necessarily ingest the data ourself. If you've got your own data set somewhere off it's restricted, you can use our on premise workers. And as long as wherever you launch those on prem workers, that data set is addressable and available to those workers. You don't even have to go through our services for those.

Danielle Beringer:
And there's also a common question we get, what are some of the use cases for synthetics? We've seen machine learning teams, data scientists that need to train models, and often it's early in a project where they may not know the full data set, all of the feature engineering is not complete. That's a great way to use synthetics to stub in some really like if they're doing a PLC, or maybe they want to do some gaming. Identify machine learning model pathways or roadmap to more complex product, that's very common. Obviously, we discussed seeding in lower environments. So people that may have a non prod or a stage environment, perhaps there are users within an ecosystem, within a company that are not authorized to have actual production data, synthetic data could be used as a substitute. It could also be used to balance a bias dataset. So some healthcare companies and financial organizations that want to simulate the fringe cases or places where they do not have full representation from an entire demographic, can use synthetics to fill in the blanks and allow them to have a more rich data set to reach insights faster as well.

Danielle Beringer:
We've got another question. Let me... [inaudible 00:35:52] on primary if there is a supply chain to protect yourself from sending data to your cloud via firewall. So Michael, your question around the firewall, I'd love to follow up with you offline. Like I said, the control aspect of the API being called the REST API versus the data itself, and like Drew mentioned, is completely contained in a Docker container. If you were to run that either physically on premise like in a data center or in a virtual private cloud, or whatever way that you would be running that. It can be completely independent. And let me see if we've got any more.

Danielle Beringer:
How would we extend this pipeline to work with multiple tables? So Alex, one of our founders listed that. I think that... Where did you get that question actually? So normalized data versus denormalized data, there's two different... Well, there's three different aspects that we would address or hope to address with people that have a desire to use synthetic data in a complex ecosystem where they have maybe a complex schema. You have transactional data itself, and then you often have the ETL process between your transactional data and your operational data stores, data warehouses, etc, that are used for analytics, that are used for machine learning models and data science. I think that one of the things we would hope to maintain, not only is the correlation within a single data set, but if you denormalize the data, how do we encourage our customers or help our customers to get that back into their ecosystem, to re-ingest. So you have basically an ETL flow, you have an ESL flow, like we talked about today. And then there would be some additional steps to get it back into the ecosystem so it could be used by a broader audience, we call that operationalizing the synthetics. So I think we're actively working on that.

Danielle Beringer:
And part of being open source and our fantastic community is really looking at our customers, and their partners and how they're using the synthetics to determine what's the best way for us to offer support to get that data into a consumable state, especially if you have multiple tables. There's a performance obviously implications. But I think that's something that we actively would want to do is to maintain the lineage of the data. And that ultimately would be reflected, I think in our synthetics quality. So we have a... I wanted to share my screen and show one of the... One second here.

Danielle Beringer:
Sorry. To make this a little bit easier, by the way here. This Gretel synthetic report has something that I alluded to in my earlier comments around the data sharing use cases. And so to Alex's question or point, if we have customers that have complex data relationships between tables, they've done feature engineering which requires correlation across multiple tables. It comes into, not only is the data once it's synthesized appropriate for the use at hand, whether it be for analytics or for machine learning models, but also is the privacy protection level intact. What shape is it in for data sharing.

Danielle Beringer:
So I think that this whole area is something really interesting that we'll continue to pursue and look at, based on customer feedback. And based on how we can see people operationalized. Go from casual use of synthetics, just starting to get comfortable with privacy engineering as a discipline, which again, is ensuring that privacy considerations are considered by design at the very beginning. So during data curation and early logical and physical data modeling, data architecture work, during the time where data scientists might be doing feature engineering and really identifying what's best for their teams, analytics teams that may need curated data that is both raw, stage and aggregate, and how we can enhance each of those steps with synthetics. I think that's one of the biggest benefits with Gretel synthetics is that you can use the API's, wherever you see fit within your software development lifecycle. And that's also the classification that we anticipate would happen upfront.

Danielle Beringer:
If your data isn't classified and labeled, you may not be able to make the correct choices for what you may want to synthesize. Then the transformation itself or anonymizing is key, that is reversible. So the opportunity to have synthetics which are not reversible, have transformations which are reversible gives you that variety, which ultimately, I think would be a huge benefit.

Danielle Beringer:
So I hope you'll join us for some classification and transformation webinars that we plan to do. I think that would be a great thing for us to cover and say, how do you use the three capabilities as complementary to one another. In another example, I think Drew did a great job today covering synthetics and the data pipeline. I just wanted to see, Drew, did you have anything else that you wanted to share that you saw on your experience in building this workshop around just working with the Airflow community, and some of the things that people are challenged with privacy?

Drew Newberry:
Yeah, I think one of the great things about working with Airflow is just the breadth and depth of the community. So any sort of data service that you have in your existing infrastructure, there probably a Airflow plugin for it. So it makes integrating any number of different services, sort of in a single unified pipeline, really, really easy. And I would say their APIs are really easy to work with, and I made it really easy for us to integrate our APIs in. And I think what's great once you have these sort of base set of primitives configured, is it makes it really easy to compose these into more advanced pipelines. So like Michael, the example that you're asking for, how do I do both transformation as a pre-processing step to my synthetic data generation job? It's really easy to do that with our APIs because that's just a new config that you need to add to the pipeline. You add that as a pre-processing step, and then you can kind of just continue with the pipeline as it's already configured. So it's really easy to plug and play different components in with Airflow.

Danielle Beringer:
And looking back at Alex's question, I think I may have misread it. I think the answer I gave was very specific to maintaining across tables but his question really was, how do we extend this particular pipeline? So let's say we take this Airflow pipeline. You want to enhance it with additional ETL to work with multiple tables, maybe Drew, you could speak a little bit about that.

Drew Newberry:
Yeah, that's a great point to elaborate on. If you look at kind of the beginning of the pipeline as we strung everything together, the very first input to our extraction is a SQL file. So what's neat is the only thing you have to do to add more tables is you just create a new pipeline. You can use the same code, the same patterns, you write a SQL file or a new SQL query, whether that's for features or whether that's just pulling data out of a individual table. And then you just rerun that pipeline for whatever other table or whatever other query that you have. So the pattern stays the same. It's sort of about rinse and repeat, and you can add it for however many sets of features or tables you want to generate synthetic data for.

Danielle Beringer:
Awesome. Okay. Well, I'd like to pull up our resources page. We've got a few minutes left. We certainly can't take more questions. I'm going to share my screen again. One of the things that I think that was such a nice surprise for me when I joined Gretel, is that the documentation and the accessibility to be able to work quickly and to get the answers that you're seeking is really amazing. So we have documentation both for the console, the CLI and the SDK. We've got links within this presentation that we'll be sharing with all of you. Obviously if you like us, give us a star on GitHub. Our open source is there and the transparency and the community that we provide to make privacy engineering part of everyone's development and data life cycles is really important. So we really hope you'll dig in there.

Danielle Beringer:
We'd love for you to run the code from this workshop. Drew worked really hard on the material. I think he did a fantastic job and we've made that accessible to you. We've got some great quick start videos. So if you want small snippets of how to start doing synthetics, transformation or classification, those are also available to you. And then our blog, the blog posts are very focused on addressing some different personas. Some of our blog posts are really, really suited for data scientists. Some are for data engineers, some are for network engineers that might be interested in doing things like Fluent Bit and looking at how we optimize our own infrastructure in the cloud. I think that could be really something that gives that full holistic picture of the type of organization that Gretel is and what we hope to offer.

Danielle Beringer:
We've highlighted a couple of recent blog that we think might be of interest to you for our privacy filters, which just debuted a few weeks ago. And then some articles on reducing AI bias with synthetic data and also automated detection of PII, and that can be streaming or batch. Many people are really focused on privacy as a result of GDPR and CCPA. But we like to think that there's a lot more beyond that. It's kind of the proactive activities. So privacy engineering can give people a way to be highly proactive and get ahead of what often is just fixing after the fact, redacting data or removing data from tables. This allows you to get ahead of it. And I think that that article in particular would be really of interest to you.

Danielle Beringer:
And of course, we'd love to have you join us in our slack community. And we've got a lot of developers and advocates and community members who are asking us hard questions about the problems that they're trying to solve. We are constantly improving our product and really excited to be heading towards general availability. And we hope that today's time that you invested with us to learn about synthetics with an example of a data pipeline using Airflow was worthwhile. We also will be adding other sessions on things like snowflake, big query and other tools that I mentioned earlier. Again, meeting developers and knowledge workers, engineers, where they are in the tools that they're using and showing them how easy it is to call Gretel by API, with the comfort of knowing that privacy engineering is something that will become part of the development life cycle.

Danielle Beringer:
With that, I want to thank you all for your time today. I hope you enjoyed it. I hope you learned a few things and we'll be following up with an email with some links, a link to our deck that we presented today to all of the code. And we hope to see you soon in our upcoming workshop series. And thank you so much for being here. Have a great day.

Build a synthetic data pipeline using Gretel and Apache Airflow

Video description

More Videos

Generating differentially private synthetic text using Gretel GPT

Anonymize Financial Data with a Fine-Tuned SLM

Video - Generate synthetic data for training LLMs and SLMs

Read the blog post

Build a synthetic data pipeline using Gretel and Apache Airflow

Transcription