Introducing Gretel Navigator (Early Preview)
Video description
An early preview of Gretel Navigator (formerly our Tabular LLM)
More Videos
Read the blog post
Transcription
Speaker 1 (00:05):
Hey, I'm Alex Watson, co-founder and chief product Officer at Gretel. Today we are thrilled to share an early preview of something. Our team has been working hard on introducing our novel tabular foundation model that can generate powerfully useful synthetic tabular data. If you've struggled with building realistic data for pre-production testing, had to deal with dirty or missing data for data science or have had the need to generate more labeled examples for machine learning. Let me show you what our model can do. Let's jump right in from the gretel console here. I'm going to do the most basic example, something that any developer has done at some point in their life. We are going to create a mock dataset. We're going to tell the model that we're looking for certain columns to exist. We're looking for an ID field, first name, last name, email, gender, and some information about the company that these users work at.
(00:51)
So we want them to come from fu, a company called fu, and we have some specific information about how we want their email address to be formatted. It should use the first character of their first name plus their last name@food.io. Finally, we're asking the model to regionalize this data set for the French nationality. Let's click generate. Here we see the model starting to populate the table with results. We can see first names and last names looking in them quickly, they look good. They look like actually French names, which is awesome. We see the email address corresponding to the format that we asked for, so the first letter of the first name plus the last name, and we see these users coming from cities that are located in France. Let's move on to more complicated example. Now I'm going to cut and paste in an example where I am prompting the model to create a CPG E-commerce data set.
(01:38)
I'm not telling it exactly what fields that I'm looking for. I'm relying on the model's ability to understand what a consumer packaged goods e-commerce data set would look like, and then giving it some guidance. I'm hoping to see order IDs, customer names and emails, product names, and even product reviews. Let's go ahead and jump in. We'll click generate. We see the model starting to build out a relatively complex dataset from this simple prompt. Very quickly we see order IDs, customer names, and when I look at these customer names and emails, once again, it looks like they're correlated information. So this isn't just fake or mock data that we're putting in here. We're relying on a model that has read millions of data sets, and as we were saying earlier, over half a trillion tokens to know what a e-commerce data set should look like and to fill in a data set that matches our prompt.
(02:26)
Here we see product quantities. We see product descriptions and prices, and we even see a customer review written for each dataset. So we have our dataset, but next we might want to customize it, whether you're starting with one generate from scratch or you're uploading your own. Perhaps you want to augment the dataset by adding an additional column or additional fields to your dataset. Here we see a customer review, and one of the first questions I had was, is it positive or negative? Let's go ahead and add a customer rating where we're essentially asking the model to read everything else inside the dataset and fill in contextually relevant fields. We'll go into the dataset editor here, and I'm going to add a request for a customer
(03:05)
Rating, and I would like this to look like an Amazon rating, so I'm going to ask for it in stars format. So now we'll click generate, and what's happening is both our original prompt and the synthetic dataset that we've created are being sent to the model. The model is reading all of the data that already exists and using that to automatically populate the customer rating review. Here we see a set of reviews being filled in here, all within between one and five stars and initial look at this actually looks like the model is doing a really good job here. So we see a customer view, great quality and taste. My family loves it getting a four. We see other examples here. I was expecting more depth in the flavor, getting a review of two. So here we can see the model has learned essentially from everything that it's seen before and it's able to generate contextually relevant results. Say we want to take this a step further. Building on that use case, we mentioned earlier of creating a demo data set for your application to show to users. One of the things that's really important is having something that's going to resonate with users. You're showing things to this data set here is obviously customized for the United States, Western America, if you see the cities and states that it's coming from. So let's go and customize or try to regionalize this data set to work for an entirely different part of the world
(04:22)
I'm typing in. Now, help me regionalize this data set for the Japanese language and
Speaker 2 (04:32):
C
Speaker 1 (04:34):
Can't type. Go ahead and click generate.
(04:40)
So once again, the model is taking the entire dataset that we've already created and building upon it so we're not starting from scratch and keep building on this iteratively until we get to the point. Where're really happy with our dataset. We see the same format that we have below including customer ratings and reviews, but it has been translated to the Japanese language. So really exciting. Here we see customer names in the Kanchi format. We see cities and prefectures that are from Japan, and we see the customer review once again has been translated to Japanese, however, matching the same star rating that we had before. So very exciting. When you're ready and you're happy with this, you have the ability to generate an entire data set much bigger than the preview we're looking at here. One of the really special things about the grettel service here is has the ability to deploy it almost any size that you need. So here we'll ask for 10,000 records. That sounds wonderful. We'll click generate and this will automatically be sent to the grettel service for generation. You can either view the results inside the grettel console or you can wait to get an email with notification that your dataset is ready.
(05:44)
Let's build on our example of augmenting data, not starting with the dataset that we've created from within the grad old tab fm. Let's go ahead and upload our own dataset and show its ability to improve data sets for machine learning and data science use cases. I'm going to upload a fitness watch dataset,
(06:00)
Very popular for machine learning training here. One of the use cases you might use is use a dataset here to predict, for example, how many steps that a user's going to take given a period or an hour of the day. One of the things that machine learning algorithms really struggle with is missing data. So here we can see examples of fields where steps, some of the kilometers, some of the active energy burned is missing from the dataset. So we are going to prompt our model now to see if it can help us fill in those gaps. As a data scientist, when you're working with missing data, your options are either to fill in that data yourself, so interpolate based on other results or unfortunately often drop the rows because the downstream model that you're using for creating a regression or classification task doesn't support missing fields. In this case, we're going to ask the model to do all of this work for us. So help me impute missing values in the dataset below.
(06:55)
I'll click generate and both our prompt and the dataset are being sent to the grettel service and we see the model starting to in real time, fill in the missing values here with contextually relevant results, and we'll look for correlations in the data between the synthetically generated new examples that go into the missing fields and other data that we can see within the dataset. For example, here we're looking at a user's heart rate, the respiratory rate, the steps, and the distance. So all of these fields are highly correlated. We can see, for example, when a user took 1,589 steps, they traveled the model estimated 1.27 kilometers. In another example here, we'll look for something similar with only 950 steps in the existing dataset. We can see that the user only traveled 0.76, so we're not seeing fake data. Once again, we're seeing contextually relevant data based both on the dataset that you uploaded and also on the half a trillion tokens that the model has been trained on. Once we're happy with our results here, you also have the ability to deploy this and activate it via the SDK as well.
(08:00)
For our next use case, let's dive into an example where a developer needs to create a synthetic dataset based on only an API specification. Imagine you're waiting for access to an API, you just can't wait to get started, or you're trying to create additional data for a lower production pre-production testing environment. So we'll start with an example here, creating Apple Health kit format data. So in this prompt we're asking the model to create synthetic health kit activity records matching the activity format below we're asking for specific fields like step count, distance, running, walking, running, stride length, but we're also waiting on the model and really depending on the model to dive in and read the API specification below. Here you can see the type of a variable that should be returned, as well as a narrative description of what this variable should look like. One cool thing here I'll show you is actually this prompt is actually built directly from the API Apple Developer docs. So I can go over to the Apple Health Kit activity description. Essentially all I did is cut and paste the examples here directly into a prompt for our model, give it a little guidance for what we're looking for in the output. Let's click generate and see what it does.
(09:11)
So we can see the model was able to extract the fields that we were looking for and start to populate it with values. We see things like distance running, walking in kilometer format, corresponding once again to the running speed. Also exciting things that we can look for in this data set is at times, for example, we would have higher running speeds for really high running speed here. Let's see, here's an example with 12.3. We look for the running stride length to increase as well, and we also look for the running power to increase as well. So we see running power going all the way into the three hundreds here when we were running speed of 12.5 kilometers per hour. So very exciting here to see data that's being created from scratch from the model that is contextually relevant for the use case. We don't always have developer docs to work with.
(09:56)
Sometimes we're interacting with the database. Once again, we'll show an example of the flexibility of this model to deal with different types of input data, and we are going to move over to an example using structured query language. SQL is the most popular way to interact with a database, and often when you're waiting on access to a database, we're even mocking up a database for your application for the first time. How do you get that initial data to fill it in? So here we'll do an example. We'll take a relatively complex SQL query and I'll paste it into our prompt. Here we are asking for specific things. So we see a select statement. It means we're asking for certain fields from the model. We see it being requested from a patient's table. We're joining it with another table on demographics and donate with another table on financials.
(10:38)
So you could imagine this type of query happening in a data warehouse to an epic health record system, for example, and we're giving it some specific context for these results coming back. We want the birthdays to be after 1980. We're looking for the healthcare expenses to be quite large, so over $55,000 and we are looking fortunately for all of these patients to still be alive. So death date should be set to null. We'll go ahead and click generate. There is no database back behind the scenes here. Essentially, the model is extrapolating what the user is asking for and bringing back a contextually relevant set of results. Here we see IDs and birth dates being populated. We see these birth dates matching what we were asking for. Each one of these birth dates corresponds to 1980 or later, so we know that part is going well.
(11:22)
We can also see further down what we were asking for healthcare expenses, that all these healthcare expenses are over 70,000 or over $55,000 as requested. So we're meeting our requirement here. We're including patient demographic information and all the patients are still alive. So essentially we've got a data set back that quickly helps us prototype what a complex query to a database would look like. As you can see, gretel's upcoming tabular foundation model is incredibly powerful. From zero shot prompting and SQL Generation to data augmentation and gap filling for machine learning, we can help you quickly build and scale complex datasets. You'll no longer have to use fake data or hand code records. You'll be able to craft your dataset in minutes using our model playground and then use Gretel workflows to automatically generate data whenever and wherever you need it. Everything we show today is still under active development, but we are opening up our wait list to anyone who's interested in trying out our tab FM model playground, which will be launching later this year. There is still some work to be done, but we would love feedback from the community today so we can build a great product experience. Sign up for more information@gretel.ai.