Proposal: Meaning-Oriented RAG Pipeline for Modelling Dynamic Mental States and Appearances

1. Overview

This project proposes the development and evaluation of a Retrieval-Augmented Generation (RAG) pipeline designed to structure and explore source material—conversational archives, Buddhist study transcripts, and guided reflections—through JSON descriptors of dynamic mental states and appearances. The system will be grounded in Mahāmudrā-informed prompt engineering and meaning-space reasoning. The objective is to produce a proof-of-principle demonstrator over a 12-month period using a modest hardware budget (~£5,000).

2. Background and Motivation

Contemporary AI systems increasingly rely on Retrieval-Augmented Generation architectures (Lewis et al., 2020) to enhance groundedness and relevance. However, most pipelines are keyword-based and lack epistemic sensitivity to the dynamics of mental appearances. This project draws on insights from contemplative science (Varela et al., 1991; Wallace, 2006) and prompt-based interpretability frameworks (Reynolds & McDonell, 2021) to develop an LLM-guided pipeline that captures subtle phase changes in appearance and awareness.

The methodology is influenced by Mahāmudrā traditions that describe mind in terms of flowing appearances, tensions, insights, and symbolic resonance. By embedding such descriptors into JSON fields and enabling meaning-space vector queries, we aim to allow semantic retrieval of relevant mental dynamics, not just topical content.

3. System Architecture and Pipeline

3.1 Data Preparation

Source types: ChatGPT conversation exports, Mahāmudrā study transcripts, guided meditations, pujas
Formats: .txt (lightly cleaned), .pdf (converted), .docx
Preprocessing: Remove headers/footers, preserve speaker-free format, no OCR required

3.2 Classification via LLM

Sampling: Take 3-position slices (beginning, middle, end)
Prompted classification: Run through a Mistral LLM with classification prompt
Labels: e.g., study commentary, guided meditation, live dialogue, puja, unstructured

3.3 Segmentation

Logic: Sentence/paragraph aware
Prompt: Designed to detect shifts in state, mode, or experiential tone
Goal: Divide long texts into semantically coherent segments for descriptor tagging

3.4 Double-Pass JSON Description

First Pass: Generates fields like mental_state, appearance, transition_vector, symbolic_element, comment, tension
Second Pass: Adds interpretive refinement using the first JSON as input context
Validator: Ensures schema compliance and structural completeness

3.5 Prompt Engineering Methodology

All prompts are crafted from a Mahāmudrā perspective on appearances, insight, and unfolding meaning. Classifier and descriptor prompts will evolve through experimentation and resonance testing. We avoid procedural logic wherever possible, preferring descriptor generation via LLM interpretive work.

3.6 Embedding and RAG Deployment

Embedding: ChromaDB, local vector storage under /mnt/RAG_VECTORS
Query UI: Flask-based frontend with local Mistral and optional OpenAI GPT-3.5 queries
Retrieval: Semantic meaning-space retrieval via vector similarity

4. Hardware Budget

The proposed system will run on a single research workstation:

Base: Dell Precision T5810 (existing)
Upgrade: RTX 5080 or 4080-class GPU (~£1200)
RAM: Upgrade to 128 GB (~£400)
Storage: 2 TB NVMe SSD + 4 TB HDD (~£400)
Other: Power supply, cooling, UPS (~£500)
Contingency + accessories: £500

Total Hardware Budget: ~£5,000

This will support local LLMs (Mistral 7B, Mixtral), embedding generation, and fast vector search.

5. Deliverables

Fully working RAG pipeline with segmentation, JSON descriptor generation, and query layer
Prompt libraries for classification and Mahāmudrā-based interpretation
Publicly accessible semantic interface (local or VPN access)
Evaluation of semantic coherence and retrieval quality
Companion technical report and reflective discussion

6. Timeline (12 months)

Month	Milestone
1–2	Dataset assembly, cleaning, and classification prompt tuning
3–4	Segmenter refinement and baseline descriptor pass (Mistral)
5–6	Double-pass pipeline validation; prompt adjustment via resonance tests
7–8	Embedding into ChromaDB and interface build
9–10	Retrieval testing, live query trials, and schema iteration
11	Final audit and evaluation report
12	Paper submission and publication preparation

7. Relevance and Impact

This proof-of-principle study contributes to:

RAG interpretability in non-propositional domains
Low-cost, high-agency research tools for self-directed inquiry
Technical–contemplative integration of meaning models
Groundwork for AI mentors aligned with contemplative epistemology

It aligns with emerging interest in epistemic and ethical uses of AI (Andersen et al., 2023; Tsai et al., 2023).

References

Lewis, P. et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. NeurIPS.
Reynolds, L., & McDonell, K. (2021). Prompt programming for large language models: Beyond the few-shot paradigm. arXiv:2102.07350.
Varela, F. J., Thompson, E., & Rosch, E. (1991). The Embodied Mind: Cognitive Science and Human Experience. MIT Press.
Wallace, B. A. (2006). The Attention Revolution: Unlocking the Power of the Focused Mind. Wisdom Publications.
Andersen, J., Chang, K., & Toner, H. (2023). Epistemic competence for trustworthy AI. AI & Society.
Tsai, J. Y., et al. (2023). Designing AI to align with epistemic values. Philosophy & Technology.