Wednesday, July 24, 2024

Text Embeddings: Comprehensive

As human beings, we can read and understand texts (at least some of them). Computers in opposite “think in numbers”, so they can’t automatically grasp the meaning of words and sentences. If we want computers to understand the natural language, we need to convert this information into the format that computers can work with — vectors of numbers.

People learned how to convert texts into machine-understandable format many years ago (one of the first versions was ASCII). Such an approach helps render and transfer texts but doesn’t encode the meaning of the words. At that time, the standard search technique was a keyword search when you were just looking for all the documents that contained specific words or N-grams.

Then, after decades, embeddings have emerged. We can calculate embeddings for words, sentences, and even images. Embeddings are also vectors of numbers, but they can capture the meaning. So, you can use them to do a semantic search and even work with documents in different languages.

In this article, I would like to dive deeper into the embedding topic and discuss all the details:

what preceded the embeddings and how they evolved,
how to calculate embeddings using OpenAI tools,
how to define whether sentences are close to each other,
how to visualise embeddings,
the most exciting part is how you could use embeddings in practice.

Let’s move on and learn about the evolution of embeddings.

Evolution of Embeddings

We will start our journey with a brief tour into the history of text representations.

Bag of Words

The most basic approach to converting texts into vectors is a bag of words. Let’s look at one of the famous quotes of Richard P. Feynman“We are lucky to live in an age in which we are still making discoveries”. We will use it to illustrate a bag of words approach.

The first step to get a bag of words vector is to split the text into words (tokens) and then reduce words to their base forms. For example, “running” will transform into “run”. This process is called stemming. We can use the NLTK Python package for it.

from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

text = 'We are lucky to live in an age in which we are still making discoveries'

# tokenization - splitting text into words
words = word_tokenize(text)
print(words)
# ['We', 'are', 'lucky', 'to', 'live', 'in', 'an', 'age', 'in', 'which',
#  'we', 'are', 'still', 'making', 'discoveries']

stemmer = SnowballStemmer(language = "english")
stemmed_words = list(map(lambda x: stemmer.stem(x), words))
print(stemmed_words)
# ['we', 'are', 'lucki', 'to', 'live', 'in', 'an', 'age', 'in', 'which', 
#  'we', 'are', 'still', 'make', 'discoveri']

Now, we have a list of base forms of all our words. The next step is to calculate their frequencies to create a vector.

import collections
bag_of_words = collections.Counter(stemmed_words)
print(bag_of_words)
# {'we': 2, 'are': 2, 'in': 2, 'lucki': 1, 'to': 1, 'live': 1, 
# 'an': 1, 'age': 1, 'which': 1, 'still': 1, 'make': 1, 'discoveri': 1}

Actually, if we wanted to convert our text into a vector, we would have to take into account not only the words we have in the text but the whole vocabulary. Let’s assume we also have “i”, “you” and ”study” in our vocabulary and let’s create a vector from Feynman’s quote.

This approach is quite basic, and it doesn’t take into account the semantic meaning of the words, so the sentences “the girl is studying data science” and “the young woman is learning AI and ML” won’t be close to each other.

TF-IDF

A slightly improved version of the bag of the words approach is TF-IDF (Term Frequency — Inverse Document Frequency). It’s the multiplication of two metrics.

Term Frequency shows the frequency of the word in the document. The most common way to calculate it is to divide the raw count of the term in this document (like in the bag of words) by the total number of terms (words) in the document. However, there are many other approaches like just raw count, boolean “frequencies”, and different approaches to normalisation. You can learn more about different approaches on Wikipedia.

Inverse Document Frequency denotes how much information the word provides. For example, the words “a” or “that” don’t give you any additional information about the document’s topic. In contrast, words like “ChatGPT” or “bioinformatics” can help you define the domain (but not for this sentence). It’s calculated as the logarithm of the ratio of the total number of documents to those containing the word. The closer IDF is to 0 — the more common the word is and the less information it provides.

So, in the end, we will get vectors where common words (like “I” or “you”) will have low weights, while rare words that occur in the document multiple times will have higher weights. This strategy will give a bit better results, but it still can’t capture semantic meaning.

The other challenge with this approach is that it produces pretty sparse vectors. The length of the vectors is equal to the corpus size. There are about 470K unique words in English (source), so we will have huge vectors. Since the sentence won’t have more than 50 unique words, 99.99% of the values in vectors will be 0, not encoding any info. Looking at this, scientists started to think about dense vector representation.

Word2Vec

One of the most famous approaches to dense representation is word2vec, proposed by Google in 2013 in the paper “Efficient Estimation of Word Representations in Vector Space” by Mikolov et al.

There are two different word2vec approaches mentioned in the paper: Continuous Bag of Words (when we predict the word based on the surrounding words) and Skip-gram (the opposite task — when we predict context based on the word).

Figure from the paper by Mikolov et al. 2013 | source

The high-level idea of dense vector representation is to train two models: encoder and decoder. For example, in the case of skip-gram, we might pass the word “christmas” to the encoder. Then, the encoder will produce a vector that we pass to the decoder expecting to get the words “merry”, “to”, and “you”.

This model started to take into account the meaning of the words since it’s trained on the context of the words. However, it ignores morphology (information we can get from the word parts, for example, that “-less” means the lack of something). This drawback was addressed later by looking at subword skip-grams in GloVe.

Also, word2vec was capable of working only with words, but we would like to encode whole sentences. So, let’s move on to the next evolutional step with transformers.

Transformers and Sentence Embeddings

The next evolution was related to the transformers approach introduced in the “Attention Is All You Need” paper by Vaswani et al. Transformers were able to produce information-reach dense vectors and become the dominant technology for modern language models.

I won’t cover the details of the transformers’ architecture since it’s not so relevant to our topic and would take a lot of time. If you’re interested in learning more, there are a lot of materials about transformers, for example, “Transformers, Explained” or “The Illustrated Transformer”.

Transformers allow you to use the same “core” model and fine-tune it for different use cases without retraining the core model (which takes a lot of time and is quite costly). It led to the rise of pre-trained models. One of the first popular models was BERT (Bidirectional Encoder Representations from Transformers) by Google AI.

Internally, BERT still operates on a token level similar to word2vec, but we still want to get sentence embeddings. So, the naive approach could be to take an average of all tokens’ vectors. Unfortunately, this approach doesn’t show good performance.

This problem was solved in 2019 when Sentence-BERT was released. It outperformed all previous approaches to semantic textual similarity tasks and allowed the calculation of sentence embeddings.

It’s a huge topic so we won’t be able to cover it all in this article. So, if you’re really interested, you can learn more about the sentence embeddings in this article.

We’ve briefly covered the evolution of embeddings and got a high-level understanding of the theory. Now, it’s time to move on to practice and lear how to calculate embeddings using OpenAI tools.

Calculating embeddings

In this article, we will be using OpenAI embeddings. We will try a new model text-embedding-3-small that was released just recently. The new model shows better performance compared to text-embedding-ada-002:

The average score on a widely used multi-language retrieval (MIRACL) benchmark has risen from 31.4% to 44.0%.
The average performance on a frequently used benchmark for English tasks (MTEB) has also improved, rising from 61.0% to 62.3%.

OpenAI also released a new larger model text-embedding-3-large. Now, it’s their best performing embedding model.

As a data source, we will be working with a small sample of Stack Exchange Data Dump — an anonymised dump of all user-contributed content on the Stack Exchange network. I’ve selected a bunch of topics that look interesting to me and sample 100 questions from each of them. Topics range from Generative AI to coffee or bicycles so that we will see quite a wide variety of topics.

First, we need to calculate embeddings for all our Stack Exchange questions. It’s worth doing it once and storing results locally (in a file or vector storage). We can generate embeddings using the OpenAI Python package.

from openai import OpenAI
client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   return client.embeddings.create(input = [text], model=model)\
       .data[0].embedding

get_embedding("We are lucky to live in an age in which we are still making discoveries.")

As a result, we got a 1536-dimension vector of float numbers. We can now repeat it for all our data and start analysing the values.

The primary question you might have is how close the sentences are to each other by meaning. To uncover answers, let’s discuss the concept of distance between vectors.

Distance between vectors

Embeddings are actually vectors. So, if we want to understand how close two sentences are to each other, we can calculate the distance between vectors. A smaller distance would be equivalent to a closer semantic meaning.

Different metrics can be used to measure the distance between two vectors:

Euclidean distance (L2),
Manhattant distance (L1),
Dot product,
Cosine distance.

Let’s discuss them. As a simple example, we will be using two 2D vectors.

vector1 = [1, 4]
vector2 = [2, 2]

Euclidean distance (L2)

The most standard way to define distance between two points (or vectors) is Euclidean distance or L2 norm. This metric is the most commonly used in day-to-day life, for example, when we are talking about the distance between 2 towns.

Here’s a visual representation and formula for L2 distance.

We can calculate this metric using vanilla Python or leveraging the numpy function.

import numpy as np

sum(list(map(lambda x, y: (x - y) ** 2, vector1, vector2))) ** 0.5
# 2.2361

np.linalg.norm((np.array(vector1) - np.array(vector2)), ord = 2)
# 2.2361

Manhattant distance (L1)

The other commonly used distance is the L1 norm or Manhattan distance. This distance was called after the island of Manhattan (New York). This island has a grid layout of streets, and the shortest routes between two points in Manhattan will be L1 distance since you need to follow the grid.

We can also implement it from scratch or use the numpy function.

sum(list(map(lambda x, y: abs(x - y), vector1, vector2)))
# 3

np.linalg.norm((np.array(vector1) - np.array(vector2)), ord = 1)
# 3.0

Dot product

Another way to look at the distance between vectors is to calculate a dot or scalar product. Here’s a formula and we can easily implement it.

sum(list(map(lambda x, y: x*y, vector1, vector2)))
# 11

np.dot(vector1, vector2)
# 11

This metric is a bit tricky to interpret. On the one hand, it shows you whether vectors are pointing in one direction. On the other hand, the results highly depend on the magnitudes of the vectors. For example, let’s calculate the dot products between two pairs of vectors:

(1, 1) vs (1, 1)
(1, 1) vs (10, 10).

In both cases, vectors are collinear, but the dot product is ten times bigger in the second case: 2 vs 20.

Cosine similarity

Quite often, cosine similarity is used. Cosine similarity is a dot product normalised by vectors’ magnitudes (or normes).

We can either calculate everything ourselves (as previously) or use the function from sklearn.

dot_product = sum(list(map(lambda x, y: x*y, vector1, vector2)))
norm_vector1 = sum(list(map(lambda x: x ** 2, vector1))) ** 0.5
norm_vector2 = sum(list(map(lambda x: x ** 2, vector2))) ** 0.5

dot_product/norm_vector1/norm_vector2

# 0.8575

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(
  np.array(vector1).reshape(1, -1), 
  np.array(vector2).reshape(1, -1))[0][0]

# 0.8575

The function cosine_similarity expects 2D arrays. That’s why we need to reshape the numpy arrays.

Let’s talk a bit about the physical meaning of this metric. Cosine similarity is equal to the cosine between two vectors. The closer the vectors are, the higher the metric value.

We can even calculate the exact angle between our vectors in degrees. We get results around 30 degrees, and it looks pretty reasonable.

import math
math.degrees(math.acos(0.8575))

# 30.96

What metric to use?

We’ve discussed different ways to calculate the distance between two vectors, and you might start thinking about which one to use.

You can use any distance to compare the embeddings you have. For example, I calculated the average distances between the different clusters. Both L2 distance and cosine similarity show us similar pictures:

Objects within a cluster are closer to each other than to other clusters. It’s a bit tricky to interpret our results since for L2 distance, closer means lower distance, while for cosine similarity — the metric is higher for closer objects. Don’t get confused.
We can spot that some topics are really close to each other, for example, “politics” and “economics” or “ai” and “datascience”.

However, for NLP tasks, the best practice is usually to use cosine similarity. Some reasons behind it:

Cosine similarity is between -1 and 1, while L1 and L2 are unbounded, so it’s easier to interpret.
From the practical perspective, it’s more effective to calculate dot products than square roots for Euclidean distance.
Cosine similarity is less affected by the curse of dimensionality (we will talk about it in a second).

OpenAI embeddings are already normed, so dot product and cosine similarity are equal in this case.

You might spot in the results above that the difference between inter- and intra-cluster distances is not so big. The root cause is the high dimensionality of our vectors. This effect is called “the curse of dimensionality”: the higher the dimension, the narrower the distribution of distances between vectors. You can learn more details about it in this article.

I would like to briefly show you how it works so that you get some intuition. I calculated a distribution of OpenAI embedding values and generated sets of 300 vectors with different dimensionalities. Then, I calculated the distances between all the vectors and draw a histogram. You can easily see that the increase in vector dimensionality makes the distribution narrower.

We’ve learned how to measure the similarities between the embeddings. With that we’ve finished with a theoretical part and moving to more practical part (visualisations and practical applications). Let’s start with visualisations since it’s always better to see your data first.

Visualising embeddings

The best way to understand the data is to visualise it. Unfortunately, embeddings have 1536 dimensions, so it’s pretty challenging to look at the data. However, there’s a way: we could use dimensionality reduction techniques to project vectors in two-dimensional space.

PCA

The most basic dimensionality reduction technique is PCA (Principal Component Analysis). Let’s try to use it.

First, we need to convert our embeddings into a 2D numpy array to pass it to sklearn.

import numpy as np
embeddings_array = np.array(df.embedding.values.tolist())
print(embeddings_array.shape)
# (1400, 1536)

Then, we need to initialise a PCA model with n_components = 2 (because we want to create a 2D visualisation), train the model on the whole data and predict new values.

from sklearn.decomposition import PCA

pca_model = PCA(n_components = 2)
pca_model.fit(embeddings_array)

pca_embeddings_values = pca_model.transform(embeddings_array)
print(pca_embeddings_values.shape)
# (1400, 2)

As a result, we got a matrix with just two features for each question, so we could easily visualise it on a scatter plot.

fig = px.scatter(
    x = pca_embeddings_values[:,0], 
    y = pca_embeddings_values[:,1],
    color = df.topic.values,
    hover_name = df.full_text.values,
    title = 'PCA embeddings', width = 800, height = 600,
    color_discrete_sequence = plotly.colors.qualitative.Alphabet_r
)

fig.update_layout(
    xaxis_title = 'first component', 
    yaxis_title = 'second component')
fig.show()

We can see that questions from each topic are pretty close to each other, which is good. However, all the clusters are mixed, so there’s room for improvement.

t-SNE

PCA is a linear algorithm, while most of the relations are non-linear in real life. So, we may not be able to separate the clusters because of non-linearity. Let’s try to use a non-linear algorithm t-SNE and see whether it will be able to show better results.

The code is almost identical. I just used the t-SNE model instead of PCA.

from sklearn.manifold import TSNE
tsne_model = TSNE(n_components=2, random_state=42)
tsne_embeddings_values = tsne_model.fit_transform(embeddings_array)

fig = px.scatter(
    x = tsne_embeddings_values[:,0], 
    y = tsne_embeddings_values[:,1],
    color = df.topic.values,
    hover_name = df.full_text.values,
    title = 't-SNE embeddings', width = 800, height = 600,
    color_discrete_sequence = plotly.colors.qualitative.Alphabet_r
)

fig.update_layout(
    xaxis_title = 'first component', 
    yaxis_title = 'second component')
fig.show()

The t-SNE result looks way better. Most of the clusters are separated except “genai”, “datascience” and “ai”. However, it’s pretty expected — I doubt I could separate these topics myself.

Looking at this visualisation, we see that embeddings are pretty good at encoding semantic meaning.

Also, you can make a projection to three-dimensional space and visualise it. I’m not sure whether it would be practical, but it can be insightful and engaging to play with the data in 3D.

tsne_model_3d = TSNE(n_components=3, random_state=42)
tsne_3d_embeddings_values = tsne_model_3d.fit_transform(embeddings_array)

fig = px.scatter_3d(
    x = tsne_3d_embeddings_values[:,0], 
    y = tsne_3d_embeddings_values[:,1],
    z = tsne_3d_embeddings_values[:,2],
    color = df.topic.values,
    hover_name = df.full_text.values,
    title = 't-SNE embeddings', width = 800, height = 600,
    color_discrete_sequence = plotly.colors.qualitative.Alphabet_r,
    opacity = 0.7
)
fig.update_layout(xaxis_title = 'first component', yaxis_title = 'second component')
fig.show()

Barcodes

The way to understand the embeddings is to visualise a couple of them as bar codes and see the correlations. I picked three examples of embeddings: two are closest to each other, and the other is the farthest example in our dataset.

embedding1 = df.loc[1].embedding
embedding2 = df.loc[616].embedding
embedding3 = df.loc[749].embedding

import seaborn as sns
import matplotlib.pyplot as plt
embed_len_thr = 1536

sns.heatmap(np.array(embedding1[:embed_len_thr]).reshape(-1, embed_len_thr),
    cmap = "Greys", center = 0, square = False, 
    xticklabels = False, cbar = False)
plt.gcf().set_size_inches(15,1)
plt.yticks([0.5], labels = ['AI'])
plt.show()

sns.heatmap(np.array(embedding3[:embed_len_thr]).reshape(-1, embed_len_thr),
    cmap = "Greys", center = 0, square = False, 
    xticklabels = False, cbar = False)
plt.gcf().set_size_inches(15,1)
plt.yticks([0.5], labels = ['AI'])
plt.show()

sns.heatmap(np.array(embedding2[:embed_len_thr]).reshape(-1, embed_len_thr),
    cmap = "Greys", center = 0, square = False, 
    xticklabels = False, cbar = False)
plt.gcf().set_size_inches(15,1)
plt.yticks([0.5], labels = ['Bioinformatics'])
plt.show()

It’s not easy to see whether vectors are close to each other in our case because of high dimensionality. However, I still like this visualisation. It might be helpful in some cases, so I am sharing this idea with you.

We’ve learned how to visualise embeddings and have no doubts left about their ability to grasp the meaning of the text. Now, it’s time to move on to the most interesting and fascinating part and discuss how you can leverage embeddings in practice.

Practical applications

Of course, embeddings’ primary goal is not to encode texts as vectors of numbers or visualise them just for the sake of it. We can benefit a lot from our ability to capture the texts’ meanings. Let’s go through a bunch of more practical examples.

Clustering

Let’s start with clustering. Clustering is an unsupervised learning technique that allows you to split your data into groups without any initial labels. Clustering can help you understand the internal structural patterns in your data.

We will use one of the most basic clustering algorithms — K-means. For the K-means algorithm, we need to specify the number of clusters. We can define the optimal number of clusters using silhouette scores.

Let’s try k (number of clusters) between 2 and 50. For each k, we will train a model and calculate silhouette scores. The higher silhouette score — the better clustering we got.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import tqdm

silhouette_scores = []
for k in tqdm.tqdm(range(2, 51)):
    kmeans = KMeans(n_clusters=k, 
                    random_state=42, 
                    n_init = 'auto').fit(embeddings_array)
    kmeans_labels = kmeans.labels_
    silhouette_scores.append(
        {
            'k': k,
            'silhouette_score': silhouette_score(embeddings_array, 
                kmeans_labels, metric = 'cosine')
        }
    )

fig = px.line(pd.DataFrame(silhouette_scores).set_index('k'),
       title = '<b>Silhouette scores for K-means clustering</b>',
       labels = {'value': 'silhoutte score'}, 
       color_discrete_sequence = plotly.colors.qualitative.Alphabet)
fig.update_layout(showlegend = False)

In our case, the silhouette score reaches a maximum when k = 11. So, let’s use this number of clusters for our final model.

Let’s visualise the clusters using t-SNE for dimensionality reduction as we already did before.

tsne_model = TSNE(n_components=2, random_state=42)
tsne_embeddings_values = tsne_model.fit_transform(embeddings_array)

fig = px.scatter(
    x = tsne_embeddings_values[:,0], 
    y = tsne_embeddings_values[:,1],
    color = list(map(lambda x: 'cluster %s' % x, kmeans_labels)),
    hover_name = df.full_text.values,
    title = 't-SNE embeddings for clustering', width = 800, height = 600,
    color_discrete_sequence = plotly.colors.qualitative.Alphabet_r
)
fig.update_layout(
    xaxis_title = 'first component', 
    yaxis_title = 'second component')
fig.show()

Visually, we can see that the algorithm was able to define clusters quite well — they are separated pretty well.

We have factual topic labels, so we can even assess how good clusterisation is. Let’s look at the topics’ mixture for each cluster.

df['cluster'] = list(map(lambda x: 'cluster %s' % x, kmeans_labels))
cluster_stats_df = df.reset_index().pivot_table(
    index = 'cluster', values = 'id', 
    aggfunc = 'count', columns = 'topic').fillna(0).applymap(int)

cluster_stats_df = cluster_stats_df.apply(
  lambda x: 100*x/cluster_stats_df.sum(axis = 1))

fig = px.imshow(
    cluster_stats_df.values, 
    x = cluster_stats_df.columns,
    y = cluster_stats_df.index,
    text_auto = '.2f', aspect = "auto",
    labels=dict(x="cluster", y="fact topic", color="share, %"), 
    color_continuous_scale='pubugn',
    title = '<b>Share of topics in each cluster</b>', height = 550)

fig.show()

In most cases, clusterisation worked perfectly. For example, cluster 5 contains almost only questions about bicycles, while cluster 6 is about coffee. However, it wasn’t able to distinguish close topics:

“ai”, “genai” and “datascience” are all in one cluster,
the same store with “economics” and “politics”.

We used only embeddings as the features in this example, but if you have any additional information (for example, age, gender or country of the user who asked the question), you can include it in the model, too.

Classification

We can use embeddings for classification or regression tasks. For example, you can do it to predict customer reviews’ sentiment (classification) or NPS score (regression).

Since classification and regression are supervised learning, you will need to have labels. Luckily, we know the topics for our questions and can fit a model to predict them.

I will use a Random Forest Classifier. If you need a quick refresher about Random Forests, you can find it here. To assess the classification model’s performance correctly, we will split our dataset into train and test sets (80% vs 20%). Then, we can train our model on a train set and measure the quality on a test set (questions that the model hasn’t seen before).

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
class_model = RandomForestClassifier(max_depth = 10)

# defining features and target
X = embeddings_array
y = df.topic

# splitting data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state = 42, test_size=0.2, stratify=y
)

# fit & predict 
class_model.fit(X_train, y_train)
y_pred = class_model.predict(X_test)

To estimate the model’s performance, let’s calculate a confusion matrix. In an ideal situation, all non-diagonal elements should be 0.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

fig = px.imshow(
  cm, x = class_model.classes_,
  y = class_model.classes_, text_auto='d', 
  aspect="auto", 
  labels=dict(
      x="predicted label", y="true label", 
      color="cases"), 
  color_continuous_scale='pubugn',
  title = '<b>Confusion matrix</b>', height = 550)

fig.show()

We can see similar results to clusterisation: some topics are easy to classify, and accuracy is 100%, for example, “bicycles” or “travel”, while some others are difficult to distinguish (especially “ai”).

However, we achieved 91.8% overall accuracy, which is quite good.

Finding anomalies

We can also use embedding to find anomalies in our data. For example, at the t-SNE graph, we saw that some questions are pretty far from their clusters, for instance, for the “travel” topic. Let’s look at this theme and try to find anomalies. We will use the Isolation Forest algorithm for it.

from sklearn.ensemble import IsolationForest

topic_df = df[df.topic == 'travel']
topic_embeddings_array = np.array(topic_df.embedding.values.tolist())

clf = IsolationForest(contamination = 0.03, random_state = 42) 
topic_df['is_anomaly'] = clf.fit_predict(topic_embeddings_array)

topic_df[topic_df.is_anomaly == -1][['full_text']]

So, here we are. We’ve found the most uncommon comment for the travel topic (source).

Is it safe to drink the water from the fountains found all over 
the older parts of Rome?

When I visited Rome and walked around the older sections, I saw many 
different types of fountains that were constantly running with water. 
Some went into the ground, some collected in basins, etc.

Is the water coming out of these fountains potable? Safe for visitors 
to drink from? Any etiquette regarding their use that a visitor 
should know about?

Since it talks about water, the embedding of this comment is close to the coffee topic where people also discuss water to pour coffee. So, the embedding representation is quite reasonable.

We could find it on our t-SNE visualisation and see that it’s actually close to the coffee cluster.

RAG — Retrieval Augmented Generation

With the recently increased popularity of LLMs, embeddings have been broadly used in RAG use cases.

We need Retrieval Augmented Generation when we have a lot of documents (for example, all the questions from Stack Exchange), and we can’t pass them all to an LLM because

LLMs have limits on the context size (right now, it’s 128K for GPT-4 Turbo).
We pay for tokens, so it’s more expensive to pass all the information all the time.
LLMs show worse performance with a bigger context. You can check Needle In A Haystack — Pressure Testing LLMs to learn more details.

To be able to work with an extensive knowledge base, we can leverage the RAG approach:

Compute embeddings for all the documents and store them in vector storage.
When we get a user request, we can calculate its embedding and retrieve relevant documents from the storage for this request.
Pass only relevant documents to LLM to get a final answer.

To learn more about RAG, don’t hesitate to read my article with much more details here.

Summary

In this article, we’ve discussed text embeddings in much detail. Hopefully, now you have a complete and deep understanding of this topic. Here’s a quick recap of our journey:

Firstly, we went through the evolution of approaches to work with texts.
Then, we discussed how to understand whether texts have similar meanings to each other.
After that, we saw different approaches to text embedding visualisation.
Finally, we tried to use embeddings as features in different practical tasks such as clustering, classification, anomaly detection and RAG.

Thank you a lot for reading this article. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

In this article, I used a dataset from Stack Exchange Data Dump, which is available under the Creative Commons license.

This article was inspired by the following courses:

“Understanding and Applying Text Embeddings” by DeepLearning.AI in collaboration with Google Cloud,
“Vector Databases: From Embeddings to Applications” by DeepLearning.AI in collaboration with Weaviate.

Friday, June 21, 2024

Integrating Azure Databricks and Microsoft Fabric

This article plunges into a hot topic often brought up during customer interactions — the combination and integration of Azure Databricks and Microsoft Fabric. Both services are top tier in their respective fields. Azure Databricks excels in scaling data engineering, data science, and machine learning workloads. Similarly, Microsoft Fabric shines with its simplicity and self-service features for a wide array of data usage. The burning question that usually arises is: how do we integrate these two powerhouses?

Currently, there are five options to consider. Keep in mind that this article may evolve as new features are introduced.

Enhance a Databricks-enabled architecture by adding a reporting and analysis layer.
Compliment a Databricks-enabled architecture by incorporating a OneLake gold layer.
Make Databricks write all data to OneLake. Though not recommended, it’s worth discussing.
Extend Databricks with a V-ORDERED enabled consumption layer
Enhance Databricks and Microsoft Fabric’s data processing efficiency by adding extra components. This is more of a personal touch.

The options available today will be thoroughly examined in the following sections. I will provide nuances, weigh the pros and cons, and refer to relevant documentation. But before that, it’s beneficial to understand why organizations choose to utilize both of these powerful tools.

Why the combination?

Organizations choose to combine Azure Databricks and Microsoft Service Fabric due to the unique capabilities this combination provides.

Azure Databricks, a comprehensive data processing, analytics, and data science platform, is favored by organizations of various sizes. With its long-standing reputation and successful adoption across numerous organizations, it has secured its place as a trusted platform. Founded by the creators of Spark, Databricks primarily caters to engineers, offering them a platform to manage Spark workloads, write notebooks, and handle complex tasks on a larger scale.

The appeal of Microsoft Fabric lies in its simplicity. Launched in 2023, it evolved from PowerBI, suggesting an easy transition for existing PowerBI users. With its user-friendly interface, unified self-service features, and seamless integration with Microsoft 365, it attracts particularly to business users. Microsoft Fabric is designed to democratize data usage and lower entry barriers, making it an accessible platform for all.

In essence, the combination of Azure Databricks and Microsoft Service Fabric offers a comprehensive solution that caters to both technical and business needs, making it a popular choice among many organizations.

Now we know why organizations often choose for a combination. Let’s go back to the burning question: how do we integrate these two services?

Enhance a Databricks-enabled architecture by adding a reporting and analysis layer

The first design consideration involves enhancing the typical Azure Databricks Medallion Lakehouse architecture, which leverages services like Azure Data Lake Storage (ADLS) gen2, Azure Data Factory, and Azure Databricks. In this setup, Databricks manages all aspects of data ingestion, processing, validation, and enrichment. PowerBI typically takes care of the remaining tasks, including reporting and delivering analytical insights.

Expanding the Databricks-focused architecture to include Microsoft Fabric is a popular used strategy to enhance self-service functions and improve user experience for business users. Think of it as Databricks and giving PowerBI a makeover — equipping it with a fresh suite of features and capabilities for a more engaging and efficient experience.

Microsoft has recently introduced a new feature called ‘shortcuts’ for Microsoft Fabric. This feature serves as a lightweight data virtualization engine that reads data from various sources, eliminating the need for data duplication and enabling direct data usage. For example, when using PowerBI, you can access the required data instantly without having to copy or import it into PowerBI.

Relating back to the Databricks-focused design we talked about earlier, we can use the ADLS Gen2 shortcut feature, given that Databricks writes all its data to ADLS. However, there are several important considerations to keep in mind:

Shortcuts necessitate a Fabric Lakehouse. If you don’t already have one, be sure to create one.
Shortcuts to tables can only be used to access data in Delta Lake format.
Use shortcuts on external tables whenever possible, rather than Databricks managed tables. I’ll come back to this point later when discussing the next design consideration.
Each shortcut can only reference a single Delta folder. Therefore, if you need to access data from multiple Delta folders, you’ll need to create individual shortcuts for each folder.
Don’t manipulate files directly in these table directories. Instead, use a read-only approach for reading Delta files from ADLS. So, in this approach ADLS acts as an intermediate store. You aren’t reading tables directly from Databricks.
Creating shortcuts in your Lakehouse must be done manually via the Fabric UI. Alternatively, you can provision all shortcuts programmatically using the REST API. Here’s the link to a tutorial and Notebook script.
When data is read directly from ADLS, the data access policies from Unity Catalog’s security model are not applied.

For integrating Databricks and Microsoft Fabric, there are exciting developments underway! These features were announced during the Microsoft Build 2024 Conference. Soon, you’ll be able to integrate Azure Databricks Unity Catalog with Fabric. Using the Fabric portal, you’ll have the ability to create and configure a new Azure Databricks Unity Catalog item. Following this step, all tables managed in the Unity Catalog can be promptly upgraded to shortcuts. This forthcoming integration will dramatically streamline the unification of Azure Databricks data in Fabric, enabling smooth operation across all Fabric workloads. The demonstration of this new feature can be found here: https://www.youtube.com/watch?v=BYob0cGW0Nk&t=4434s

The expanded Databricks-centric architecture, which now includes Microsoft Fabric for data usage, is commonly observed among customers who are exceptionally satisfied with Databricks. These customers have already invested significant amount of time and resources in establishing a Lakehouse using Databricks and plan to continue leveraging it. Microsoft Fabric recognizes the strength and versatility of the Lakehouse approach using the Delta format. It allows enhancing an (existing) architecture by adding a layer optimized for data consumption. This allows organizations to augment their existing Databricks-centric setup with an additional layer designed specifically for data consumption.

Compliment a Databricks-enabled architecture by incorporating a OneLake gold layer

The second design modifies the initial design pattern by incorporating a OneLake gold layer into the architecture. This is feasible because of the Azure Databricks’ Azure Blob Filesystem (ABFS) driver, which supports both ADLS and OneLake. You can see an illustration of this approach below and find Notebook examples on the MS Learn pages here.

Within this architecture, the overall workflow and data processing steps — ingestion, processing, validation, and enrichment — remain largely unchanged. Everything is managed within Azure Databricks. The key difference is that data for consumption is now closer to Microsoft Fabric because Databricks writes its data to a Gold layer, which is stored in OneLake. You might wonder, is this a best practice and why would this beneficial?

Importantly note that this style of integration is not officially supported by Databricks, which has implications for data management, which I will delve into next. For more information, please refer to the Databricks documentation.

Databricks distinguishes between two types of tables: managed tables and external tables. Managed tables are created by default and are managed by Unity Catalog, which also handles their lifecycle and file layout. It is not recommended to manipulate files directly in these tables using external tools. In contrast, external tables store data outside of the managed storage location specified for the metastore, catalog, or schema.

So, based on the guidance provided in the documentation, all tables created by writing directly to OneLake using this approach are recommended to be classified as external tables. This is because data is managed outside the scope of the metastore. As a result, the management of these tables should be done elsewhere, such as within Fabric. The motivation for this approach might be the following:

First, storing data physically in OneLake leads to improved performance within Microsoft Fabric. This is due to the fact that OneLake tables are optimized for performance, particularly for queries involving joins and aggregations. In contrast, if you’re reading data from ADLS Gen2 via shortcuts, you might encounter slower performance for queries that involve these operations.

Second, managing data in OneLake is useful for applying security measures within Microsoft Fabric. For example, OneLake tables can be secured using role-based access control (RBAC), simplifying the process of managing data access. However, if you were to use ADLS Gen2, you would need to handle the permissions for the ADLS Gen2 storage account, which could be a more complex task.

Thirdly, OneLake tables can be governed by policies, which makes it easier to ensure that the data is used in a compliant manner. For instance, when (externally) sharing tables with domains that reside elsewhere.

Besides merely reading data, you might want to consider generating new data within Microsoft Fabric. If this is part of your plan, an upcoming feature could be of great interest. Soon, Fabric users will be able to access data items, like lakehouses, via the Unity Catalog in Azure Databricks. Even though the data will remain in OneLake, you’ll have the ability to access and view its lineage and other metadata directly in Azure Databricks. This enhancement will facilitate reading data back from Fabric to Databricks. For instance, if you’re planning to leverage AI using Azure Databricks’ Mosaic AI, you’ll be able to do so by reading back from Microsoft Fabric. The technology for this is likely Lakehouse Federation. More information can be seen in this part of the video: https://youtu.be/BYob0cGW0Nk?t=4125

In conclusion, the strategy of handling all integration and data processing within Databricks, and having a consumption layer managed in Fabric, offers organizations the convenience of leveraging the best features from each application area. This approach ensures optimal performance and security in data handling.

Make Databricks write all data to OneLake (Not Recommended)

Given our experience integrating Databricks with OneLake, we know that OneLake supports the same APIs as ADLS Gen2. With this in mind, let’s consider a hypothetical design possibility: storing all Medallion layers in OneLake. Could you make this work? Let’s find out.

The incentive for this approach could stem from greenfield deployments. The goal here is to leverage Databricks’ native features to efficiently scale data engineering tasks, while advocating design simplicity and self-service for data usage and consumption across all layers using Microsoft Fabric.

Regrettably, this design is not adequate for efficient data management. This configuration may result in administrative overhead due to an increasing number of workspaces, as each layer of the workspace requires its own Lakehouse entity in Microsoft Fabric. This proliferation could give rise to additional challenges such as governance, metadata management, and collaboration overhead when sharing data. Additionally, Databricks does not support this approach when using managed tables. Hence, while this architecture may appear attractive in theory, I strongly discourage its use as a best practice.

Extend Databricks with a V-ORDERED enabled consumption layer

The next design consideration revolves around putting more weight on using Microsoft Fabric and utilizing the V-Order feature. This feature is a write-time optimization for the parquet file format, enabling fast data reads under Microsoft Fabric compute engines such as Power BI.

Both Databricks and Microsoft have chosen to adopt Delta Lake, an open-source columnar file format. However, Microsoft has incorporated an added layer of V-Order compression, which offers up to 50% more compression. V-Order is fully compliant with the open-source parquet format; all parquet engines can read it like regular parquet files.

Please note, you can apply V-ordering to tables that lack it by utilizing Fabric’s maintenance feature.

V-Order provides significant advantages for Microsoft Fabric, especially to components like Power BI and SQL endpoints. For instance, it allows Power BI to connect directly to live data using Direct Lake mode while maintaining high performance during data queries. Since there’s no import process, changes in the data source are instantly reflected in Power BI, eliminating the need to wait for a refresh.

It’s crucial to note that the use of V-Order optimized tables is currently exclusive to Microsoft Fabric. Databricks has not yet incorporated this feature. Therefore, until that happens, you’ll need to utilize a service within Microsoft Fabric for leveraging V-Order optimized tables.

Note that it is possible to argue that the processing step with Databricks between the Silver and Gold stages remains relevant if V-order optimization is not necessary. While this may seem repetitive, it is a viable option that allows for continued data processing with Databricks.

Another noteworthy aspect, why organizations opt for this design, is the transactional consistency across multiple tables. Maintaining such consistency, especially in Gold, is crucial. Today, Spark only supports transactions on individual tables. Thus, if there are any data inconsistencies across tables, they need to be resolved through compensatory measures. For example, you could commit inserts to multiples tables or none of the tables if an error arises. If you’re changing details about a purchase order that affects three tables, you can group those changes into a single transaction. That means when those tables are queried, they either have the all changes or none of them do. This integrity concern highlights the importance of an environment that can manage complex transactions across numerous tables. Microsoft Fabric Warehouse is the only platform capable of supporting this atop Delta Lake. You can learn more about this at here.

In the updated architecture, depicted in the image above, Synapse Engineering now serves as the processing engine from Silver to Gold. This approach guarantees that all tables are V-Order optimized. In addition, Synapse Warehouse has been added for use cases that require transactional capabilities. However, these architectural changes mean that data engineers will need to navigate different distinct data processing services. Therefore, it’s crucial to provide clear guidance to all teams. For example, you could establish principles for Bronze and Silver for utilizing Databricks’ native features such as ingestion tracking with AutoLoader and validations with Delta Live Tables for data quality. And then for Gold, you focus on building consumption-specific integration logic solely with Microsoft Fabric.

Enhance Databricks and Microsoft Fabric’s data processing efficiency by adding extra components

In our previous discussion, we addressed the challenge faced by engineers who have to navigate different data processing services. This issue can be resolved by adopting a metadata-driven approach and a templating framework, such as DBT, for data processing. In the updated architecture illustrated below, I’ve augmented both Databricks and Microsoft Fabric with additional components. Let’s delve into these changes.

On the Databricks side, I’ve added a metadata-driven framework (metadata store), Great Expectations, and the Data Build Tool (DBT). The metadata-driven framework can significantly reduce the amount of code you need to write and maintain. Instead of creating multiple notebooks, this approach enables a universal pipeline for ingesting and validating all data with another open-source framework called Great Expectations. This approach is achieved by reading from the metadata store and dynamically invoking different scripts. If you’re interested in learning more about this approach, I recommend reading another blogpost on this subject.

Next, let’s discuss DBT. This open-source command-line tool, also known as Data Build Tool, is written in Python. Its strength lies in providing a universal interface for defining transformations using templates, with a syntax like SQL’s SELECT statements. Databricks is supported through the dbt-databricks package. For more information on using DBT and Databricks, I suggest reading another blogpost on this subject.

On the Microsoft Fabric side, DBT can also play a significant role too. We have the option to use either dbt-fabric for Synapse Warehousing or dbt-fabricspark for Synapse Spark within Microsoft Fabric. The benefit of this templating approach is that you leverage both services while developers only need to familiarize themselves with a single front-end for all data transformation use cases. This methodology streamlines the process and increases efficiency.

Conclusion

The integration of Azure Databricks and Microsoft Fabric presents a myriad of benefits and possibilities for organizations. The flexibility and scalability of Azure Databricks, combined with the simplicity and user-friendly features of Microsoft Fabric, can significantly enhance data usage and management across all layers. There are several architectural design choices available, from enhancing a Databricks-centric architecture with a Microsoft Fabric layer to incorporating a OneLake gold layer into the architecture for better performance and security.

Furthermore, the introduction of V-Order optimization in Microsoft Fabric and the use of additional components can significantly streamline and enhance data processing efficiency. However, such combinations or integration require careful considerations, as it might involve navigating services and balancing flexibility, data security, and isolation.

In conclusion, the integration of Azure Databricks and Microsoft Fabric, coupled with the exciting advancements announced at the Microsoft Build 2024 Conference, signifies a promising frontier for big data processing workloads.

Saturday, May 4, 2024

Improve link

https://learn.microsoft.com/en-us/training/modules/create-manage-power-bi-assets/

https://learn.microsoft.com/en-us/training/modules/enforce-power-bi-model-security/

https://learn.microsoft.com/en-us/training/modules/understand-scalability-power-bi/

https://info.microsoft.com/ww-thankyou-getting-started-with-microsoft-fabric-in-60-minutes-video.html?LCID=EN-IN&ocid=eml_pg433604_gdc_comm_mw

Subrat's Technical Blog