Unveiling the Power of Word Embeddings with Gensim

In the realm of Natural Language Processing (NLP), word embeddings have emerged as a game-changer. Unlike traditional approaches that use words as features, word embeddings leverage dense, low-dimensional vectors to capture the meaning and usage of a word. One pioneering model in this domain is Word2Vec, developed by Thomas Mikolov and team at Google. In this blog post, we’ll delve into the world of word embeddings using the original Word2Vec approach, implemented with the Gensim library.

Training Word Embeddings

Training word embeddings with Gensim is a breeze. All you need is a corpus of sentences in the language of interest. For our exploration, we’ll use 5,000,000 sentences from Dutch Wikipedia. Let’s jump into the code:

import os
import gensim

class SentenceCorpus(object):
    def __init__(self, filename):
        self.filename = filename

    def __iter__(self):
        with open(self.filename, "r") as i:
            for line in i:
                tokens = line.strip().split()
                yield tokens
                
WIKI_FILE = os.path.join("../data", "nlwiki_20170620_tok_small.txt")
sentences = SentenceCorpus(WIKI_FILE)

model = gensim.models.Word2Vec(sentences, min_count=100, window=5, size=100)

Using Word Embeddings

Now that we have our embeddings trained, let’s explore their capabilities. We can access the embeddings using the wv attribute of the model. For instance:

# Retrieving the embedding for the word "koning" (king) king_embedding = model.wv["koning"]

We can also measure the similarity between two words:

similarity_king_queen = model.wv.similarity("koning", "koningin") # Expected: high similarity_king_coffee = model.wv.similarity("koning", "koffie") # Expected: low

Furthermore, finding words most similar to a target word is straightforward:

similar_words_to_king = model.wv.similar_by_word("koning", topn=10)

The model even allows us to explore analogies:

analogy_result = model.wv.most_similar(positive=['vrouw', 'koning'], negative=["man"], topn=10)

Visualizing Embeddings

Visualizing embeddings in a high-dimensional space can be challenging. We use t-distributed Stochastic Neighbor Embedding (t-SNE) to map the embeddings to a 2D space for visualization:

import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

target_word = "belgië"
selected_words = [w[0] for w in model.wv.most_similar(positive=[target_word], topn=200)]
embeddings = [model.wv[w] for w in selected_words]

mapped_embeddings = TSNE(n_components=2, metric='cosine', init='pca').fit_transform(embeddings)

# Plotting the 2D embeddings
plt.scatter(mapped_embeddings[:, 0], mapped_embeddings[:, 1])

# Annotating words on the plot
for i, txt in enumerate(selected_words):
    plt.annotate(txt, (mapped_embeddings[i, 0], mapped_embeddings[i, 1]))

plt.show()

Exploring Hyperparameters

Choosing the right hyperparameters is crucial. We evaluate the impact of embedding size and context window:

sizes = [100, 200, 300]
windows = [2, 5, 10]

for size in sizes:
    for window in windows:
        model = gensim.models.Word2Vec(sentences, min_count=100, window=window, size=size)
        acc = evaluate(model, word2pos)
        df[size][window] = acc

The results suggest that smaller contexts tend to work better, and 200-dimensional embeddings strike a balance.

Clustering Embeddings

Clustered embeddings can be valuable for tasks like Named Entity Recognition. We use agglomerative clustering and save the clusters to a file:

from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import normalize

vocab = list(model.wv.vocab)
vectors = [model.wv[w] for w in vocab]
vectors_norm = normalize(vectors)

clusterer = AgglomerativeClustering(n_clusters=500)
clusters = clusterer.fit_predict(vectors_norm)

# Save clusters to a file
with open("data/clusters_nl.tsv", "w") as o:
    for c in cluster_dictionary:
        for w in cluster_dictionary[c]:
            o.write(f"{w}\t{c}\n")

Conclusion

Word embeddings open up exciting possibilities in NLP, allowing us to model word meanings and discover semantic relationships. Gensim’s Word2Vec implementation empowers us to navigate this landscape effortlessly. From training embeddings to visualizing and fine-tuning, word embeddings offer a rich playground for language exploration.

In future experiments, we’ll leverage these embeddings for Named Entity Recognition and other advanced NLP tasks. Stay tuned for more insights into the fascinating world of word embeddings!

A Visual Guide To Sampling Techniques in Machine Learning

ByKishore March 10, 2024May 25, 2024

When working with large datasets, it’s often impractical to train machine learning models on the entire dataset. Instead, we opt to work with smaller, representative samples. However, the way we sample can significantly impact the performance and accuracy of our models. Let’s explore some commonly used sampling techniques: 🔹 Simple Random Sampling: Each data point…

NLP

Unraveling Text Classification: Traditional Approaches with Scikit-learn

ByKishore January 31, 2024May 26, 2024

Welcome to a journey into the world of text classification, where we’ll explore some traditional yet powerful approaches using Scikit-learn. While deep learning has taken center stage in Natural Language Processing (NLP), these classical methods remain quick and effective for training text classifiers. Our playground for this experiment is the 20 Newsgroups dataset, a classic…

Agentic AI

The Dawn of Autonomous Intelligence: How Agentic AI is Reshaping the Tech Landscape

Byuser February 28, 2025

The technology industry is no stranger to disruption. From the advent of the internet to the mobile revolution, technological leaps have consistently reshaped how we live and work. Now, a new wave of innovation is cresting, promising to be as transformative as its predecessors: Agentic AI. While traditional AI excels at specific tasks within defined…

Data Analytics | Machine Learning

Exploring Strategies for Handling Imbalanced Classes in Machine Learning

ByKishore January 9, 2024May 27, 2024

Imbalanced class distribution poses a significant challenge in machine learning, where the occurrence of certain events is rare compared to others. In this tutorial, we delve into various strategies to address this issue, exploring oversampling, undersampling, pipeline integration, algorithm awareness, and anomaly detection. By understanding and implementing these techniques, we aim to build more robust…

Data Analytics

Conquering Python Tuples for Beginners and Beyond 🐍

ByKishore January 10, 2024May 27, 2024

In Python, a tuple is a versatile data structure that allows you to store ordered and immutable sequences of elements. In this exploration, we’ll delve into the characteristics, operations, and manipulation techniques associated with tuples. Understanding Tuples A tuple is defined by enclosing a sequence of Python objects in round brackets. It is comparable to…

Data Analytics

Harness the hidden power of nested functions to craft elegant, efficient, and mind-bending Python code 🐍

ByKishore January 10, 2024May 25, 2024

Nested functions, also known as inner functions, are a fascinating aspect of Python that enables the definition of functions within other functions. This feature allows for a more modular and organized structure in code. In this exploration, we will dive into the world of nested functions, understanding their creation, usage, and the concept of nonlocal…