In this post we will explore a novel approach to Retrieval-Augmented Generation (RAG) called HyPE (Hypothetical Prompt Embeddings), which I came across in a preprint paper recently. This technique tries to address one of the fundamental challenges in RAG systems: the semantic mismatch between user queries and document content. If you’ve ever built a RAG system, you’ve probably felt the frustration when your carefully crafted vector search returns seemingly irrelevant results. At least for me, it was always tremendously annoying when a question like What is quantum entanglement? wouldn’t reliably match a document section that clearly explains quantum entanglement.
The Question-Answer Alignment Problem ๐
Traditional RAG systems working with user questions face a fundamental challenge: user queries (direct questions or user input rewritten by the preprocessing model) are typically phrased as questions (“What is quantum entanglement?”), while documents are written in expository form (“Quantum entanglement is a phenomenon…”). Add to this chunking, which may not always be optimal and we end up with a semantic gap that vector similarity search struggles to bridge effectively.
Most RAG implementations try to solve this at query time - either by reformulating the user’s question or by generating hypothetical answers (like in HyDE - note the “D” instead of “P”). However, these approaches add latency and complexity to each query. You’re essentially asking the LLM to do extra work every single time someone searches for something.
Hypothetical Prompt Embeddings (HyPE) takes a different approach by addressing the question-answer alignment problem during the indexing phase rather than at query time. The core insight is elegant: instead of embedding document chunks directly, we generate hypothetical questions that each chunk would answer, then embed those questions. The entire vector store becomes a collection of de facto Q-and-A pairs, where each document chunk is associated with multiple hypothetical questions.
This transforms retrieval from a typical “question-to-document” matching into “question-to-question” matching, creating better semantic alignment. It’s a simple but powerful idea that - at least according to the paper - can dramatically improve retrieval quality.
How HyPE Works ๐
The HyPE pattern consists of two main phases:
Indexing Phase (One-time setup):
- Chunk documents into manageable pieces
- For each chunk, use an LLM to generate 3-5 hypothetical questions that the chunk would answer
- Embed these questions (not the original chunk text)
- Store the question embeddings alongside the original chunk content
Retrieval Phase (Runtime):
- Embed the user question directly (no extra LLM call needed anymore!)
- Perform vector search to find similar hypothetical questions
- Extract the original chunks from matching question records
- Generate the final response using the retrieved chunks
Sample implementation with Semantic Kernel ๐
Let’s implement this pattern using Semantic Kernel. We’ll start by setting up the necessary dependencies and data structures.
Note: For this example, we will use Python, but the same principles apply to C# and Java (other languages supported by Semantic Kernel).
First, let’s install the required packages:
pip install semantic-kernel[azure] python-dotenv numpy
Now, let’s define the core data structures for our HyPE implementation:
import asyncio
import os
import sys
import numpy as np
from typing import List, Dict, Any
from dataclasses import dataclass
import json
from dotenv import load_dotenv
from semantic_kernel.agents import ChatCompletionAgent
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion, AzureTextEmbedding
from semantic_kernel.kernel import Kernel
from semantic_kernel.functions import kernel_function, KernelArguments
from typing import Annotated
load_dotenv()
@dataclass
class Document:
id: str
content: str
metadata: Dict[str, Any]
embedding: np.ndarray = None
@dataclass
class HypotheticalQuestion:
question: str
question_embedding: np.ndarray
original_chunk_id: str
original_chunk_content: str
original_chunk_metadata: Dict[str, Any]
The HyPE Vector Store ๐
The core of our implementation is the HyPEVectorStore class, which handles both the indexing and retrieval phases:
class HyPEVectorStore:
def __init__(self):
self.hypothetical_questions: List[HypotheticalQuestion] = []
self.embeddings_service = None
self.llm_service = None
def set_services(self, embeddings_service, llm_service):
self.embeddings_service = embeddings_service
self.llm_service = llm_service
This will be quite simplistic - but intentionally so - we would like to focus on the HyPE pattern itself, rather than building a robust production-ready application. In regular use, you’d likely want to integrate this with a proper vector database or storage solution.
The most critical component here is the hypothetical question generation. This is where we use Semantic Kernel to create a function that generates questions for each document chunk. As it’s usually the case with LLM-bound tasks, the delicate aspect of prompt engineering will be the key - we want questions that sound natural but are specific enough to retrieve the exact content:
async def generate_hypothetical_questions(self, chunk_content: str, chunk_metadata: Dict[str, Any]) -> List[str]:
try:
kernel = Kernel()
kernel.add_service(self.llm_service)
question_generator = kernel.add_function(
plugin_name="HyPE",
function_name="GenerateQuestions",
prompt="""You are an expert at generating hypothetical questions from document content for information retrieval.
Analyze the following text and generate 3-5 essential questions that, when answered, would capture the main points and core meaning of the content.
These questions should:
- Be specific and detailed enough to retrieve this exact content
- Cover different aspects of the information presented
- Be phrased as natural questions a user might ask
- Focus on key facts, numbers, goals, and important details
- Be the type of questions that would lead someone to search for this specific information
Text to analyze:
{{$chunk_content}}
Generate only the questions, one per line, without numbering or bullet points:""",
template_format="semantic-kernel"
)
kernel_arguments = KernelArguments(chunk_content=chunk_content)
response = await kernel.invoke(
plugin_name="HyPE",
function_name="GenerateQuestions",
arguments=kernel_arguments
)
# Process and clean the generated questions
if not response or not str(response).strip():
print(f"โ ๏ธ Empty response from LLM for chunk {chunk_metadata.get('project', 'Unknown')}")
return []
questions_text = str(response).strip()
questions = [q.strip() for q in questions_text.split('\n') if q.strip() and not q.strip().startswith(('1.', '2.', '3.', '4.', '5.', '-', '*', 'โข'))]
# Clean up formatting and ensure quality
cleaned_questions = []
for q in questions:
q = q.strip()
if q:
# Remove common prefixes like "1. ", "- ", etc.
for prefix in ['1. ', '2. ', '3. ', '4. ', '5. ', '- ', '* ', 'โข ']:
if q.startswith(prefix):
q = q[len(prefix):].strip()
# Only add if it's a "reasonable" question
if len(q) > 10 and '?' in q:
cleaned_questions.append(q)
project = chunk_metadata.get('project', 'Unknown')
section = chunk_metadata.get('section', 'Unknown')
print(f"๐ก Generated {len(cleaned_questions)} hypothetical questions for chunk: {project}/{section}")
for i, q in enumerate(cleaned_questions[:3], 1):
print(f" {i}. {q[:80]}{'...' if len(q) > 80 else ''}")
return cleaned_questions[:5] # Limit to 5 questions max
except Exception as e:
print(f"โ ๏ธ Error generating hypothetical questions: {e}")
return []
The indexing process takes document chunks and creates hypothetical question embeddings for each one:
async def add_documents_with_hype(self, documents: List[Document]):
print(f"\n๐ฌ Starting HyPE indexing for {len(documents)} document chunks...")
for i, doc in enumerate(documents):
print(f"\n๐ Processing chunk {i+1}/{len(documents)}: {doc.metadata.get('project', 'Unknown')}/{doc.metadata.get('section', 'Unknown')}")
# Generate hypothetical questions for this chunk
questions = await self.generate_hypothetical_questions(doc.content, doc.metadata)
if not questions:
print(f"โ ๏ธ No questions generated for chunk {doc.id}, skipping...")
continue
# Create embeddings for each question
print(f"๐ฎ Embedding {len(questions)} hypothetical questions...")
question_texts = questions
embeddings = await self.embeddings_service.generate_embeddings(question_texts)
# Store each question with its embedding and link to original content
for question, embedding in zip(questions, embeddings):
hyp_question = HypotheticalQuestion(
question=question,
question_embedding=np.array(embedding),
original_chunk_id=doc.id,
original_chunk_content=doc.content,
original_chunk_metadata=doc.metadata
)
self.hypothetical_questions.append(hyp_question)
print(f"\nโ
HyPE indexing complete! Created {len(self.hypothetical_questions)} hypothetical question embeddings")
Next, we need to implement the retrieval phase. This is where we will search for hypothetical questions that are similar to the user query. We will use cosine similarity to find the most relevant questions based on their embeddings.
def search_by_question_similarity(self, query_embedding: np.ndarray, top_k: int = 3) -> List[HypotheticalQuestion]:
if not self.hypothetical_questions:
return []
similarities = []
for hyp_q in self.hypothetical_questions:
similarity = np.dot(query_embedding, hyp_q.question_embedding) / (
np.linalg.norm(query_embedding) * np.linalg.norm(hyp_q.question_embedding)
)
similarities.append((similarity, hyp_q))
# sort by similarity and return top results
similarities.sort(key=lambda x: x[0], reverse=True)
return [hyp_q for _, hyp_q in similarities[:top_k]]
Since indexing can be time-consuming for large document sets, we’ll add persistence functionality to save and load the HyPE index:
def save_hype_index(hype_store: HyPEVectorStore, file_path: str):
"""Save the HyPE index to disk for persistence."""
os.makedirs(os.path.dirname(file_path), exist_ok=True)
index_data = []
for hyp_q in hype_store.hypothetical_questions:
index_data.append({
'question': hyp_q.question,
'question_embedding': hyp_q.question_embedding.tolist(),
'original_chunk_id': hyp_q.original_chunk_id,
'original_chunk_content': hyp_q.original_chunk_content,
'original_chunk_metadata': hyp_q.original_chunk_metadata
})
with open(file_path, 'w', encoding='utf-8') as f:
json.dump(index_data, f, indent=2)
print(f"๐พ Saved HyPE index with {len(index_data)} questions to {file_path}")
def load_hype_index(file_path: str) -> HyPEVectorStore:
"""Load a previously saved HyPE index from disk."""
if not os.path.exists(file_path):
return None
try:
with open(file_path, 'r', encoding='utf-8') as f:
index_data = json.load(f)
hype_store = HyPEVectorStore()
for item in index_data:
hyp_q = HypotheticalQuestion(
question=item['question'],
question_embedding=np.array(item['question_embedding']),
original_chunk_id=item['original_chunk_id'],
original_chunk_content=item['original_chunk_content'],
original_chunk_metadata=item['original_chunk_metadata']
)
hype_store.hypothetical_questions.append(hyp_q)
print(f"๐ฅ Loaded HyPE index with {len(hype_store.hypothetical_questions)} questions from {file_path}")
return hype_store
except Exception as e:
print(f"โ ๏ธ Error loading HyPE index: {e}")
return None
Sample Data and Document Loading ๐
For this demonstration, we’ll use a synthetic dataset containing information about fictional quantum research projects. The data is structured as a Markdown file where each project is separated by # Project headers, and project sections are delineated by ## subheaders.
Here’s an example of the expected data format:
# Project Mousetrap (Budget: $2.5M)
## Overview
Project Mousetrap is a quantum key distribution research initiative focused on developing secure communication protocols for quantum networks.
## Technical Approach
The project utilizes advanced photonic quantum systems to implement BB84 and similar quantum cryptography protocols...
## Goals
- Develop practical quantum key distribution systems
- Achieve 99.9% security guarantees
- Enable secure quantum communications over 50km distances
# Project Falcon (Budget: $4.2M)
## Overview
Project Falcon focuses on quantum error correction research for fault-tolerant quantum computing systems...
The document loading and chunking function processes this format:
def load_and_chunk_projects_data(file_path: str) -> List[Document]:
"""Load and chunk the projects markdown file into document segments."""
with open(file_path, 'r', encoding='utf-8') as file:
content = file.read()
# Split by project headers
projects = content.split('# Project')[1:] # Skip the first empty element
documents = []
for i, project in enumerate(projects):
project_content = f"# Project{project}" # Restore the header
# Extract project name from the first line
first_line = project_content.split('\n')[0]
project_name = first_line.replace('# Project ', '').split(' (')[0]
# Split each project into sections by ## headers
sections = project_content.split('\n## ')
for j, section in enumerate(sections):
if j == 0:
# First section includes the main project header
section_content = section
section_name = "Overview"
else:
# Restore the ## header for subsequent sections
section_content = f"## {section}"
section_name = section.split('\n')[0]
# Create a document for each section
doc_id = f"project_{i}_section_{j}"
doc = Document(
id=doc_id,
content=section_content.strip(),
metadata={
'project': project_name,
'section': section_name,
'project_index': i,
'section_index': j
}
)
documents.append(doc)
return documents
Now we can create a Semantic Kernel plugin that exposes the HyPE functionality to our AI agent:
class QuantumProjectsHyPEPlugin:
"""HyPE-enhanced plugin for searching quantum project information"""
def __init__(self, hype_store: HyPEVectorStore, embeddings_service):
self.hype_store = hype_store
self.embeddings_service = embeddings_service
@kernel_function(description="Search quantum project information using HyPE to answer questions about quantum research projects.")
async def search_quantum_projects_hype(
self,
query: Annotated[str, "The search query about quantum projects"]
) -> Annotated[str, "Returns relevant information about quantum projects"]:
"""Search for quantum project information using HyPE question-to-question matching."""
print(f"\n๐ HyPE Search: {query}")
# embed the user query
query_embeddings = await self.embeddings_service.generate_embeddings([query])
query_embedding = np.array(query_embeddings[0])
# find similar hypothetical questions
similar_questions = self.hype_store.search_by_question_similarity(query_embedding, top_k=3)
if not similar_questions:
return "No relevant quantum project information found."
# extract original chunks from matching questions
context = []
seen_chunks = set() # Avoid duplicate chunks
print("๐ฏ Question-to-Question Matches Found:")
for i, hyp_q in enumerate(similar_questions):
chunk_id = hyp_q.original_chunk_id
if chunk_id not in seen_chunks:
seen_chunks.add(chunk_id)
project = hyp_q.original_chunk_metadata.get('project', 'Unknown Project')
section = hyp_q.original_chunk_metadata.get('section', 'Unknown Section')
print(f" {i+1}. Matched Question: \"{hyp_q.question[:100]}{'...' if len(hyp_q.question) > 100 else ''}\"")
print(f" From: {project} โ {section}")
context.append(f"From {project} - {section}:\n{hyp_q.original_chunk_content}")
result = "\n\n---\n\n".join(context)
print(f"๐ Retrieved {len(context)} unique document chunks via HyPE")
return result
Putting It All Together ๐
Let’s create a complete example that demonstrates the HyPE pattern in action. We’ll load our quantum projects data, index it with HyPE, and create an agent that can answer user queries with improved retrieval accuracy.
async def main():
print("๐ HyPE-Enhanced Quantum Projects RAG Demo")
print("=" * 60)
# Initialize services
chat_completion_service = AzureChatCompletion(
deployment_name=os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini"),
endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
)
embeddings_service = AzureTextEmbedding(
deployment_name=os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME", "text-embedding-ada-002"),
endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
api_key=os.environ.get("AZURE_OPENAI_API_KEY"),
)
# Initialize HyPE vector store
hype_store = HyPEVectorStore()
hype_store.set_services(embeddings_service, chat_completion_service)
# Load and process documents (implementation depends on your data)
data_path = os.path.join(os.path.dirname(__file__), "..", "shared-data", "projects.md")
documents = load_and_chunk_projects_data(data_path)
print(f"๐ Created {len(documents)} document chunks")
# HyPE indexing phase with persistence
index_file_path = os.path.join(os.path.dirname(__file__), "data", "hype_index.json")
if os.path.exists(index_file_path):
existing_hype_store = load_hype_index(index_file_path)
if existing_hype_store and existing_hype_store.hypothetical_questions:
hype_store = existing_hype_store
hype_store.set_services(embeddings_service, chat_completion_service)
print("โ
Loaded existing HyPE index")
else:
print("๐จ Creating new HyPE index...")
await hype_store.add_documents_with_hype(documents)
save_hype_index(hype_store, index_file_path)
else:
print("๐จ Creating new HyPE index...")
await hype_store.add_documents_with_hype(documents)
save_hype_index(hype_store, index_file_path)
# Create the HyPE-enhanced plugin
quantum_plugin = QuantumProjectsHyPEPlugin(hype_store, embeddings_service)
# Create AI agent with the plugin
agent = ChatCompletionAgent(
service=chat_completion_service,
name="QuantumProjectsHyPEAgent",
instructions="""You are a helpful assistant specializing in quantum research projects.
You have access to information about various quantum computing projects through HyPE (Hypothetical Prompt Embeddings).
Use the search_quantum_projects_hype function to find relevant information from the quantum projects database.
This system uses advanced question-to-question matching for improved retrieval accuracy.
Always base your answers on the retrieved information and cite which projects the information comes from.
If you can't find relevant information, say so clearly.""",
plugins=[quantum_plugin],
)
# Test the system
test_queries = [
"What is Project Mousetrap and what are its goals?",
"Which quantum projects have the largest budgets?",
"Tell me about quantum key distribution research",
]
thread = None
for i, query in enumerate(test_queries, 1):
print(f"\n{'='*60}")
print(f"๐ฌ Test Query {i}/{len(test_queries)}: {query}")
print(f"{'='*60}")
response = await agent.get_response(messages=query, thread=thread)
if response:
print(f"\n๐ค Agent Response:\n{response}")
thread = response.thread
if __name__ == "__main__":
asyncio.run(main())
When you run this implementation, you’ll notice several key aspects (advantages?) of the HyPE pattern:
- Question-to-question matching typically yields more relevant results than question-to-document matching
- No LLM calls needed during retrieval - only embedding generation
- Both user queries and indexed content are in question form
Here’s what the actual system output looks like when running the complete implementation:
๐ HyPE-Enhanced Quantum Projects RAG Demo
============================================================
โ
Created chat completion service
โ
Created embeddings service
๐ Loading quantum projects data from ../shared-data/projects.md
๐ Created 45 document chunks
๐ง Starting HyPE indexing phase...
============================================================
๐จ Creating new HyPE index...
๐ฌ Starting HyPE indexing for 45 document chunks...
๐ Processing chunk 1/45: Mousetrap/Overview
๐ก Generated 5 hypothetical questions for chunk: Mousetrap/Overview
1. What are the main goals and objectives of Project Mousetrap in the context of io...
2. How does ion-trapped quantum computing technology work, and what are its key adv...
3. What specific breakthroughs or innovations has Project Mousetrap achieved in the...
๐ฎ Embedding 5 hypothetical questions...
๐ Processing chunk 2/45: Mousetrap/Overview
๐ก Generated 5 hypothetical questions for chunk: Mousetrap/Overview
1. What is the primary goal of Project Mousetrap in the field of quantum computing?
2. How do ion traps contribute to the performance of qubits in Project Mousetrap?
3. What advantages do ion traps offer in terms of coherence times and error rates f...
๐ฎ Embedding 5 hypothetical questions...
... omitted for brevity ...
============================================================
โ
HyPE indexing complete! Created 225 hypothetical question embeddings
๐พ Saved HyPE index with 225 questions to ~/dev/azure-ai-samples/semantic-kernel/chatcompletions-agent-hype-rag/data/hype_index.json
============================================================
โ
HyPE indexing complete!
โ
Created HyPE-Enhanced Quantum Projects Agent
๐ฌ Test Query: Which quantum projects have the largest budgets?
๐ HyPE Search: largest budget quantum projects
๐ฏ Question-to-Question Matches Found:
1. Matched Question: "What specific types of quantum sensor hardware are mentioned as part of the budget allocation?"
From: Horizon โ Budget
2. Matched Question: "How much of the Project Avalanche budget is dedicated to research and development of quantum algorithms?"
From: Avalanche โ Budget
3. Matched Question: "How is the $40 million budget for Project Skyhook distributed among quantum AI development, hardware..."
From: Skyhook โ Budget
๐ Retrieved 3 unique document chunks via HyPE
๐ค Agent Response:
The following quantum projects have the largest budgets:
1. **Project Skyhook** - Budget of **$40 million** for 2024
2. **Project Horizon** - Budget of **$25 million** for 2024
3. **Project Avalanche** - Budget of **$20 million** for 2024
These projects highlight significant investments aimed at advancing quantum technology and its applications.
Based on this, we can say that the query for “largest budgets” correctly triggered budget-related hypothetical questions from multiple projects. This demonstrates the effectiveness of the HyPE pattern in aligning user queries with relevant content - the question-to-question matching allowed the agent to retrieve the most relevant chunks without needing to reformulate the user query or generate hypothetical answers at query time. Retrieved information came from three different projects (Skyhook, Horizon, Avalanche) with their respective budget amounts, and the agent provided a concise summary of the results.
Conclusion ๐
The HyPE pattern represents a cool and innovative approach to solving the question-answer alignment problem in RAG systems. By shifting the computational overhead from query time to indexing time, it offers both performance improvements and better retrieval accuracy.
If you’re interested in exploring the complete implementation, you can find the demo code (as always) on GitHub. The HyPE pattern is based on the research paper by Vake, Domen and Viฤiฤ, Jernej and Toลกiฤ, Aleksandar, available at SSRN.
And that’s it! We’ve successfully implemented a HyPE-enhanced RAG system that can provide more accurate and contextually relevant answers! Try it out with your own data and see how it performs.