Contents

Pyjama Embeddings: Asking the Big Questions

Fun with Clojure: Asking the Big Questions

Clojure might be known for its Lisp-y elegance and its ability to wrangle data like a pro, but today, we’re using it for something far more profound: uncovering the deepest truths of the universe. Why does the sky turn red? Why do fireworks explode? Why did the sun even bother to rise? Thanks to some Clojure magic, we’ve got the answers—straight from a highly sophisticated knowledge base (i.e., a delightfully whimsical source_of_truth.txt).

The Setup: A Source of Truth (Sort Of)

Full listing on github

Our starting point is a file containing the real explanations for natural phenomena, such as:

The sky is blue because the smurfs are blue.
The moon glows at night because it swallowed a flashlight.
Fireworks explode because they get too excited.

Clearly, top-tier science. But how do we get Clojure to serve up these brilliant insights on demand? Let’s dive into the code.

We’re using a strategy called vector embeddings to compare the meaning of a question to our carefully curated facts. Instead of relying on simple keyword matching, we embed our sentences in a numerical space where similar meanings are close together. Enter pyjama.embeddings, which will help us match questions to their best-matching explanations.

(def url (or (System/getenv "OLLAMA_URL")
             "http://localhost:11432"))

Here, we check if an OLLAMA_URL environment variable is set; otherwise, we default to a local embedding service. This means our program can run locally or in a cloud environment with minimal fuss.

Next, we define our embedding model:

(def embedding-model
  "granite-embedding")

This model will be responsible for converting our sentences into vector representations that can be compared mathematically.

Then we load our so-called “source of truth” into a list:

(def source-of-truth
  (pyjama.utils/load-lines-of-file "test/morning/source_of_truth.txt"))

Boom! Now Clojure knows all the secrets of the universe.

The Test Configuration

To keep things organized, we create a test-config map that holds important settings:

(def test-config {:url             url
                  :chunk-size      30
                  :documents       source-of-truth
                  :embedding-model embedding-model})

Here, chunk-size 30 likely means we’re working with small pieces of text at a time, and documents holds our delightful facts.

The Core Function: Finding the Right Answer

When a user asks a question, we want to find the best-matching explanation from our source of truth. The function strategy-and-question does exactly that:

(defn strategy-and-question [config question strategy]
  (let [documents (pyjama.embeddings/generate-vectorz-documents config)

        config (assoc config
                 :question question
                 :documents documents
                 :strategy strategy
                 :top-n 1)]

    (pyjama.embeddings/enhanced-context config)))

Here’s what’s happening step by step:

  1. Convert the source-of-truth sentences into vector representations using generate-vectorz-documents.
  2. Update the config to include the user’s question, the vectorized documents, and the chosen strategy (:manhattan distance in this case).
  3. Find the closest match with enhanced-context, which retrieves the most semantically similar sentence.

Testing It Out

To ensure our cosmic wisdom retrieval system works, we run some tests:

(deftest strategies-test-three
  (println
    "\n"
    (strategy-and-question test-config "Why is the sky red" :manhattan)
    "\n"
    (strategy-and-question test-config "Why did the fireworks explode" :manhattan)
    "\n"
    (strategy-and-question test-config "Why did the sun rise" :manhattan)))

The Result? Pure Brilliance!

Running this test gives us:

The sky is red in the evening because the grand smurf is too.
Fireworks explode because they get too excited.
The sun rises because it forgot to set an alarm.

Mission accomplished. We’ve successfully built an AI-powered oracle of whimsical truths!

Of course, different questions call for different ways of finding the most relevant answer. That’s why our code supports multiple strategies for comparing embeddings:

(def strategies [:cosine :euclidean :dot :manhattan :minkowski :jaccard :pearson])

Each of these strategies has its own approach to measuring similarity. Cosine similarity checks the angle between two vectors, while Euclidean distance measures straight-line distance. Jaccard similarity compares shared elements, and Pearson correlation finds relationships between values. To see them all in action, we run:

(deftest strategies-and-one-question (pyjama.io.print/print-table [:strategy :question :document] (generate-results strategies [“Why did the sun rise”])))

This prints out a lovely table comparing answers across all strategies. Because let’s be honest—sometimes you need a second opinion on why the sun even bothered to show up today!

strategy question document
:cosine Why are the clouds fluffy? Clouds are fluffy because the sky loves cotton candy.
:euclidean Why are the clouds fluffy? Clouds are fluffy because the sky loves cotton candy.
:dot Why are the clouds fluffy? Clouds are fluffy because the sky loves cotton candy.
:manhattan Why are the clouds fluffy? Clouds are fluffy because the sky loves cotton candy.
:minkowski Why are the clouds fluffy? The wind howls because it’s trying to sing opera.
:jaccard Why are the clouds fluffy? The sky is blue because the smurfs are blue.
:pearson Why are the clouds fluffy? Clouds are fluffy because the sky loves cotton candy.

Wrapping Up

This little Clojure experiment is a perfect example of how vector embeddings can be used for semantic search—even if our application is more playful than practical. Instead of rigid keyword matching, we can find answers based on meaning, which makes our search system much more flexible and intelligent.

Could this technique be used for serious applications like FAQ bots, customer support, or scientific research? Absolutely. But for now, we’re just happy knowing that ice floats because it’s trying to get closer to the sun. 🌞