WTF is a text embedding?!

How we do maths on words.

Tim Myers

Jul 06, 2023

A slice of experiments

I made text embeddings for every anime

TL;DR: https://huggingface.co/datasets/abatilo/myanimelist-embeddings…

2 years ago · 1 like · Aaron Batilo

My buddy recently wrote a blog post about creating text embeddings for anime descriptions. If you’re new to machine learning like me though, your first question is probably wtf is a text embedding?

The absolute simplest explanation I can give is “Turning words into numbers so that we can do math stuff on them”, but that’s pretty unsatisfying. There are a lot of ways that we could assign numbers to words, and most of them aren’t very useful. As an example let’s consider the following set of four words:

apple, banana, sun, fire

How can we turn these into numbers? One very obvious idea is to just order them alphabetically and use their indexes:

apple - 1
banana - 2
fire - 3
sun - 4

Now let’s say that we want to mathematically judge the similarity of two words. One formula we could come up with is to see the difference between the numerical version of the words. For example, the similarity between apple and banana would be 2 - 1 = 1.

Let’s define this a bit more specifically and say that the similarity between two words A and B is the absolute value of the difference in their corresponding numbers:

\(Similarity = |N_A-N_B|\)

The absolute value component takes care of any ordering issues, so that the difference between A and B is the same as the difference between B and A.

With this formula, we can calculate a similarity score between any two of the words. The results for banana and fire is 3 - 2 = 1, which is the same as our score between apple and banana. Clearly what we’ve came up with so far doesn’t work great, since I think most people would judge a banana to be much more similar to an apple than to a fire.

Let’s word better.

You, the reader, are clearly a master of words if you’ve made it this far, and surely we can describe these words better than just their alphabetical order. Let’s play ~~twenty~~ three questions:

Is it a fruit?
Can it be yellow?
Is it hot?

We’ll answer these questions in order for each of our words:

apple - yes, no, no
banana - yes, yes, no
fire - no, yes, yes
sun - no, yes, yes

Now since all we’ve done is used more words, let’s convert this into a numerical representation. No = 0, yes = 1, and we’ll put the answers in arrays:

apple - [1, 0, 0]
banana - [1, 1, 0]
fire - [0, 1, 1]
sun - [0, 1, 1]

Depending on how much math you suffered through in your education, you might remember that we already have a convenient way to tell the similarity of two vectors (another word for an array): the dot product.

If you don’t remember this (or never learned), don’t worry it’s easy. Like all things in math, the technical notation is much scarier than the reality:

\(\textbf{A} \cdot \textbf{B} = \sum_{i=1}^{n}a_ib_i\)

To take the dot product of vectors A and B, multiple their elements one index at a time, and then sum up all of those results.

Let’s do the example of apple and banana again:

\(\textbf{A} \cdot \textbf{B} = [1, 0, 0] \cdot [1, 1, 0] = (1*1) + (0 * 1) + (0 * 0) = 1\)

Banana and fire:

\(\textbf{F} \cdot \textbf{B} = [0,1, 1] \cdot [1, 1, 0] = (0*1) + (1 * 1) + (1 * 0) = 1\)

It seems we haven’t really done any better of a job yet, we’re still getting the same similarity score of a banana to both an apple and fire.

Before we give up, let’s take a look at fire and sun:

\(\textbf{F} \cdot \textbf{S} = [0, 1, 1] \cdot [0, 1, 1] = (0 * 0) + (1 * 1) + (1*1) = 2\)

A score of two between fire and sun, this seems promising! The sun is literally a ball of fire.

Next steps

How can we take this promising strategy and refine it? One idea would be to ask more questions about the words, playing a full twenty questions. Another would be to give more subtlety to our answers instead of just yes/no. For example what if we were to add the word tomato to our list? Is it a fruit? Technically yes, but maybe we decide to give it a value of 0.5 instead of 1.

Regardless of our final strategy, what we are doing is creating text embeddings for our list of words. The vectors we created are numerical representations that attempt to produce mathematical (so that a dumb computer can do it) similarity results that closely align with our human intuition of the similarity between words.

When successful, we can start to do some pretty cool things with the results. One of the simplest is to build a synonym generator. Given a word, we can look at all of the others to see which is the most mathematically similar, and potentially find a synonym that is more specific to what you want to describe, or just sounds fancier.

But why?

Good question, no one actually cares about synonyms. AI is probably writing your essays anyways. With all this AI though you’ve probably had more time to watch anime, and you need some new recs. In the next post I’ll describe how you can make your own anime recommendation engine using text embeddings! Subscribe and don’t miss it!

Sillycon Valley

Discussion about this post