Deep LearningEvaluationMachine LearningLLMFine-tuningPython

PocketGuide

Domain-adapted language model for structured travel guidance. Built with evaluation-first methodology, synthetic instruction tuning, and quantized for offline inference.

TL;DR

  • Evaluation-first LLM adaptation with fixed 20-prompt benchmark suite and objective metrics (parse success, schema compliance, uncertainty markers)
  • Synthetic instruction pipeline using teacher-student generation with multi-stage quality gating and cost-controlled OpenRouter backend
  • LoRA fine-tuning on Llama-2-7B with 5 documented iterations improving parse success from 80% to 100%
  • Structured output contracts (JSON envelope + typed payloads) validated at inference time for reliable schema compliance
  • GGUF quantization via llama.cpp for offline deployment on consumer hardware

What I Built

The system addresses a specific ML challenge: adapting a general-purpose 7B model to produce reliable, schema-compliant JSON for travel planning—under constraints that matter in practice.

Evaluation Infrastructure

I defined output contracts and benchmarks before training. A fixed 20-prompt suite measures parse success, schema compliance, and uncertainty marker presence. All training iterations are evaluated against these contracts with timestamped runs and artifact archiving for reproducibility.

Synthetic Data Pipeline

Built a three-stage teacher-student pipeline for training data generation:

  • Prompt planning with spec-driven diversity across categories, regions, and difficulty levels
  • Draft generation using OpenRouter with rate limiting (15 RPM) and multi-model fallback
  • Quality gating with critique-based filtering and acceptance thresholds

The pipeline produces balanced datasets with full provenance tracking (config snapshots, prompt hashes, token counts).

Output Contracts

Implemented JSON schema validation with two layers:

  • Envelope schema (v0) enforcing consistent structure: summary, assumptions, uncertainty notes, verification steps
  • Payload schemas (v1) for domain-specific outputs: itinerary, checklist, decision_tree, procedure

Validation occurs at inference time with strict and lenient parsing modes.

Parameter-Efficient Fine-Tuning

LoRA adapters on Llama-2-7B with config-driven training, gradient checkpointing, and checkpoint management. Five training iterations (v1–v5) targeted specific failure modes identified through evaluation, with each intervention measured and documented.

Local Inference Runtime

GGUF quantization enables offline inference on consumer hardware. A registry-driven model selection system maps logical names to quantized artifacts. The runtime supports both Hugging Face + PEFT for evaluation and llama.cpp for production deployment.


Why It Matters

ML/Research Discipline

Evaluation-first development ensures all improvements are measured objectively. Fixed benchmarks, deterministic seeding, and versioned schemas provide reproducible baselines. Structured outputs with uncertainty markers demonstrate domain adaptation beyond generic chat capabilities.

ML Engineering

The system demonstrates production ML practices: synthetic data generation at scale, parameter-efficient fine-tuning with experiment tracking, quantization for deployment constraints, and iterative improvement based on measured failure modes. All pipeline stages are config-driven with artifact provenance.

Software Engineering

Clean architecture with modular separation: evaluation, data generation, training, inference. Comprehensive test coverage across all pipeline stages. Makefile provides stable commands for consistent developer experience. Type hints and docstrings throughout.


Architecture

The system follows a contracts-first, evaluation-driven design:

Output Schemas (Envelope + Payloads) ↓ [Benchmark Suite: 20 prompts → Fixed Eval] ↓ [Prompt Planning] → [Teacher Generation (OpenRouter)] ↓ ↓ [Draft Generation] [Quality Gating] ↓ ↓ [Dataset v1-v5] ←──────┘ ↓ [LoRA Training: Llama-2-7B + PEFT] ↓ [Checkpoint] → [Evaluation Run] ↓ ↓ [Metrics] [Failure Analysis] ↓ ↓ [Iteration v+1] ←┘ ↓ [GGUF Quantization (llama.cpp)] ↓ [Local Inference Runtime]

Results

Five training iterations demonstrated measurable improvement on the held-out evaluation suite:

Metricv1v3v5Target
Parse Success0.801.001.001.00 ✓
Uncertainty Markers0.851.001.001.00 ✓
Envelope Fields0.000.150.201.00

Key findings:

  • Parse success improved from 80% to 100% through data quality interventions and training duration adjustments
  • Uncertainty marker presence reached 100%, ensuring reliable acknowledgment of assumptions
  • Envelope field compliance improved to 20% but requires further architectural exploration

The iteration process confirmed that structural compliance responds to targeted data interventions while full schema enforcement requires additional objective design. Quantized models (Q4_K_M) maintain acceptable latency for interactive use.


Reliability & Safety

PocketGuide enforces uncertainty by design through required schema fields (uncertainty_notes, verification_steps). The system is a research prototype for demonstrating domain adaptation and structured output generation—not an authoritative travel information source.

Training data has a knowledge cutoff; visa requirements, border policies, and local conditions change. Users must verify time-sensitive or safety-critical information with official government sources.


Future Directions

Schema Compliance Improvements — Explore loss weighting, constrained decoding, or architectural changes to improve envelope field coverage beyond 20%.

JSON Repair — Post-processing to recover from minor format violations without full regeneration.

Larger Model Exploration — Adapter training on 13B+ base models to assess capacity vs efficiency trade-offs.

Server Mode — FastAPI wrapper for persistent inference service with request logging and performance measurement.

Multi-Domain Benchmarks — Expand evaluation beyond travel to test generalization of structured output methodology.