Innovating With FastText and Table Headers
Today we’re going to walk you through some examples of using FastText embeddings of field headers to improve our Natural Language Processing (NLP) for structured data (e.g. tables, json, csv). NLP can be tough in a structured environment as field values are often single words or numbers. Field headers are often either abbreviated, misspelled or unstandardized. The luxury of the context of full natural language sentences, and the now amazing and plentiful ways of exploiting that is not to be had.
As a data scientist, being able to quickly analyze and understand new datasets is of vital importance. At Gretel, we build tools that help developers and data scientists build, anonymize, and share data safely. Enabling safe data sharing starts with labeling sensitive data, and personally identifiable information (PII) has a way of creeping in in unexpected places. Gretel makes sure you know where all potential PII exists using innovative machine learning models.
This is where the FastText embeddings of field headers comes in. At Gretel, we have built a FastText model based on a large collection of schema examples to supplement our entity recognition service to help find sensitive data. Today, however, we’re going to explore alternative use cases for this model. We invite you to experiment with both the model and code with the hope of inspiring yet even more ideas.
So just what is the data behind this model? Google researchers created an attribute correlation statistics database, ACSDb , containing co-occurrence statistics of schema elements by mining billions of HTML tables extracted from the general-purpose web crawl. We combined their data with a hefty slice of similarly obtained data from the Web Data Commons project to create a data set of roughly 12 million schema examples. Each line in our data set is a list of the field headers occurring in a particular table schema. Thus when we build our FastText model, each field header embedding becomes a representation of the other field headers it typically co-occurs with. When two field headers show strong similarity (i.e. consistently co-occur in similar contexts) the implication is they’re interchangeable (a synonym, an abbreviation,…).
Gensim is the king of taking state of the art NLP, implementing it in a supremely efficient manner, and then making easy to use API’s available to everyone. It was an easy choice to decide on using Gensim’s version of FastText. Our model was built using the Gensim API, and our freely available Gretel Tools package utilizes the Gensim API to implement the functionality we’ll demonstrate below.
On to the fun part!
First, you must set up your dependencies. To install Gretel Tools, view the README.md file on GitHub. The code examples demonstrated below live in this notebook.
Next, load our pre-trained FastText header model. By instantiating a HeaderAnalyzer object, the module automatically downloads our FastText model from S3.
Now let’s look at some scenarios of utilizing the model. One clear use case is you’re designing a new table schema and trying to decide on which field headers to use. You envision potentially joining your table with other sources in the future and would like to use common, popular headers. First up is a telephone number. You’re wondering, should I literally call it “telephone number” (which is a bit long) or “telephone” or “phone#”. Utilizing the Gensim API, you can ask the model “What are other common variations of ‘telephone’ used in schema definitions?”
In the above code example, we asked for the 30 most similar headers and then sorted them by frequency. If you play with the parameters, you’ll quickly realize there are literally hundreds of different ways to phrase this in a table context. The most frequent of the similar headers appears to be simply “phone”, thus the decision is made.
Perhaps you misspelled “telephone” using “telepone” instead. Not a problem! FastText not only embeds words, but substrings of words (ngrams) as well. Thus you can ask questions about absolutely any word, regardless of whether it was in the training data.
Another scenario is say you’re trying to automatically assess the joinability of two tables. You have already determined that the fields you’re comparing are of the same datatype and do have overlapping values. A FastText comparison of their headers could be the final bit of evidence needed.
Let’s say one table uses the header “gender” and the other uses “sex”. You can check their similarity with the following code:
That’s a decent similarity score. Another example is one table uses the header “street” and the other uses “address”:
Not bad. And yet one more example, say one table uses the header “player” and the other uses “athlete”:
Definitely strong evidence that the columns contain joinable fields.
Yet another scenario is say a company wants to enforce a standardization policy to ensure the joinability of information across a large number of internal data sources. When designing a new table, searching through a lengthy standards document can be an arduous task, especially when the header you had in mind shares few characters with the company standard (as in the examples listed above). A quick FastText comparison between your header and the list of standards would easily narrow in on the right choice. Here’s an example of checking the header “born” against an example (tiny example!) of company standards:
Conclusion
It’s fascinating (and also sometimes frustrating) to see the multitude of different ways that developers and data scientists will label the same kinds of data.
Today, we have looked at how FastText word embeddings can help us quickly understand new datasets, and build more consistent labels for our own data.
We’d love to hear about your use cases- feel free to reach out to us for a more in-depth discussion in the comments, twitter, or hi@gretel.ai. Follow us to keep up on the latest trends on building with your data!