Module 2 Lab: Embeddings and semantic similarity

Module 2 Lab: Embeddings and semantic similarity#

Build a simple semantic search over course documents.

Lab Context#

This lab uses synthetic support messages, retrieval snippets, intent labels, and factuality checks as a safe proxy for the course setting. It is not a substitute for institutional data, but it lets you practice the reasoning, metrics, and documentation pattern before working with real records.

Lab Tasks#

  1. Run the baseline analysis.

  2. Identify the decision the metric supports.

  3. Change one threshold, score weight, or input assumption.

  4. Compare the result before and after your change.

  5. Record one deployment risk that the synthetic data cannot reveal.

import numpy as np

documents = [
    "refund policy requires receipt and request within thirty days",
    "technical support escalates login failures after three attempts",
    "privacy requests must be routed to data governance",
    "enterprise onboarding includes security review and admin setup",
]
queries = ["customer asks for refund timing", "user cannot log in", "delete my data"]
vocab = sorted(set(" ".join(documents + queries).lower().split()))

def vec(text):
    words = text.lower().split()
    return np.array([words.count(w) for w in vocab], dtype=float)

doc_matrix = np.vstack([vec(d) for d in documents])
for q in queries:
    qv = vec(q)
    scores = doc_matrix @ qv / (np.linalg.norm(doc_matrix, axis=1) * max(np.linalg.norm(qv), 1e-9))
    best = int(np.argmax(scores))
    print(q, "->", documents[best], "score=", round(float(scores[best]), 3))
customer asks for refund timing -> refund policy requires receipt and request within thirty days score= 0.149
user cannot log in -> refund policy requires receipt and request within thirty days score= 0.0
delete my data -> privacy requests must be routed to data governance score= 0.204
reflection = {
    "what_changed": "",
    "metric_before": "",
    "metric_after": "",
    "interpretation": "",
    "synthetic_data_limit": "",
    "next_real_world_evidence_needed": "",
}
reflection
{'what_changed': '',
 'metric_before': '',
 'metric_after': '',
 'interpretation': '',
 'synthetic_data_limit': '',
 'next_real_world_evidence_needed': ''}