Source: summerfang-jolli/code-to-doc-demo Last Updated: 2/11/2026

Code-to-Documentation AI Agent System Design

Overview

This document outlines the design for an AI agent system that converts code to documentation using LangGraph orchestration and PostgreSQL vector storage for RAG capabilities.

Architecture Overview


┌─────────────────────────────────────────────────────────────────┐
│                    LangGraph Orchestrator                      │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐    ┌──────────────┐    ┌─────────────────┐    │
│  │   Code      │    │ Documentation│    │   Embedding     │    │
│  │ Analyzer    │───▶│  Generator   │───▶│ & Chunking      │    │
│  │   Agent     │    │    Agent     │    │    Agent        │    │
│  └─────────────┘    └──────────────┘    └─────────────────┘    │
│         │                   │                       │          │
│         ▼                   ▼                       ▼          │
│  ┌─────────────┐    ┌──────────────┐    ┌─────────────────┐    │
│  │  Metadata   │    │   Quality    │    │  PostgreSQL     │    │
│  │ Extractor   │    │  Validator   │    │   Vector        │    │
│  │   Agent     │    │    Agent     │    │   Storage       │    │
│  └─────────────┘    └──────────────┘    └─────────────────┘    │
│                                                 │               │
│         ┌───────────────────────────────────────┘               │
│         ▼                                                       │
│  ┌─────────────┐    ┌──────────────┐                           │
│  │   Search    │    │  Retrieval   │                           │
│  │ & Query     │◀───│   Agent      │                           │
│  │   Agent     │    │              │                           │
│  └─────────────┘    └──────────────┘                           │
└─────────────────────────────────────────────────────────────────┘

System Components

1. PostgreSQL Vector Database Schema

Core Tables:


-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;
 
-- Projects table
CREATE TABLE projects (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(255) NOT NULL,
    description TEXT,
    repository_url VARCHAR(500),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Code files table
CREATE TABLE code_files (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    project_id UUID REFERENCES projects(id) ON DELETE CASCADE,
    file_path VARCHAR(1000) NOT NULL,
    file_type VARCHAR(50), -- 'python', 'javascript', 'java', etc.
    content TEXT NOT NULL,
    content_hash VARCHAR(64), -- SHA256 hash for change detection
    last_analyzed TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(project_id, file_path)
);
 
-- Code elements (functions, classes, modules, etc.)
CREATE TABLE code_elements (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    file_id UUID REFERENCES code_files(id) ON DELETE CASCADE,
    element_type VARCHAR(50), -- 'function', 'class', 'method', 'variable', 'module'
    name VARCHAR(255) NOT NULL,
    signature TEXT, -- Function signatures, class definitions
    docstring TEXT, -- Existing docstrings
    start_line INTEGER,
    end_line INTEGER,
    complexity_score FLOAT, -- Cyclomatic complexity
    dependencies JSONB, -- List of dependencies/imports
    metadata JSONB, -- Additional metadata (parameters, return types, etc.)
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Generated documentation
CREATE TABLE documentation (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    element_id UUID REFERENCES code_elements(id) ON DELETE CASCADE,
    doc_type VARCHAR(50), -- 'overview', 'api', 'tutorial', 'example'
    title VARCHAR(500),
    content TEXT NOT NULL,
    generated_by VARCHAR(100), -- Which LLM/agent generated this
    quality_score FLOAT, -- Quality assessment score
    human_reviewed BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Vector embeddings for semantic search
CREATE TABLE document_embeddings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    documentation_id UUID REFERENCES documentation(id) ON DELETE CASCADE,
    chunk_index INTEGER, -- For document chunking
    chunk_text TEXT NOT NULL,
    embedding vector(1536), -- OpenAI ada-002 dimensions
    metadata JSONB, -- Chunk metadata
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Search and query logs
CREATE TABLE search_queries (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    project_id UUID REFERENCES projects(id),
    query_text TEXT NOT NULL,
    query_embedding vector(1536),
    results_found INTEGER,
    response_time_ms INTEGER,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
 
-- Indexes for performance
CREATE INDEX idx_code_files_project_type ON code_files(project_id, file_type);
CREATE INDEX idx_code_elements_file_type ON code_elements(file_id, element_type);
CREATE INDEX idx_doc_embeddings_vector ON document_embeddings USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX idx_search_queries_embedding ON search_queries USING ivfflat (query_embedding vector_cosine_ops);

2. LangGraph Agent Architecture

Agent State Definition:


from typing import Dict, List, Optional, TypedDict
from dataclasses import dataclass
 
class CodeAnalysisState(TypedDict):
    # Input
    project_id: str
    file_path: str
    file_content: str
 
    # Analysis Results
    parsed_ast: Dict
    code_elements: List[Dict]
    dependencies: List[str]
    complexity_metrics: Dict
 
    # Documentation Generation
    generated_docs: List[Dict]
    doc_quality_scores: List[float]
 
    # Vector Storage
    embeddings: List[List[float]]
    chunk_mappings: List[Dict]
    storage_ids: List[str]
 
    # Search & Retrieval
    search_query: Optional[str]
    search_results: List[Dict]
 
    # Workflow State
    current_step: str
    errors: List[str]
    warnings: List[str]

Core Agents:

Code Analyzer Agent
- Parse AST using ast (Python) or tree-sitter (multi-language)
- Extract functions, classes, methods, variables
- Calculate complexity metrics
- Identify dependencies and imports
Metadata Extractor Agent
- Extract type hints and annotations
- Analyze function signatures
- Identify design patterns
- Extract existing docstrings
Documentation Generator Agent
- Generate comprehensive documentation using LLM
- Create multiple doc types (API docs, tutorials, examples)
- Follow documentation standards (Google, NumPy, Sphinx styles)
Quality Validator Agent
- Assess documentation quality
- Check for completeness
- Validate examples and code snippets
- Score documentation usefulness
Embedding & Chunking Agent
- Split documents into semantic chunks
- Generate embeddings using OpenAI or local models
- Store vectors in PostgreSQL
Search & Retrieval Agent
- Semantic search using vector similarity
- Hybrid search (semantic + keyword)
- Context ranking and reranking

3. Technology Stack

Core Dependencies:


# LangGraph and LangChain
langgraph>=0.0.40
langchain>=0.1.0
langchain-openai
langchain-community
 
# Database and Vector Storage
psycopg2-binary>=2.9.0
pgvector>=0.2.0
sqlalchemy>=2.0.0
alembic  # Database migrations
 
# Code Analysis
ast  # Built-in Python AST
tree-sitter>=0.20.0
tree-sitter-python
tree-sitter-javascript
gitpython
 
# Embeddings and ML
openai>=1.0.0
sentence-transformers  # Alternative to OpenAI
numpy>=1.24.0
 
# Text Processing
tiktoken  # Token counting
nltk  # Text processing
spacy  # NLP processing
 
# API and Web
fastapi>=0.100.0
uvicorn
streamlit  # Demo interface
pydantic>=2.0.0
 
# Utilities
python-dotenv
pyyaml
rich  # Beautiful terminal output

4. Document Chunking Strategy

Semantic Chunking Approach:


class DocumentChunker:
    def __init__(self, chunk_size: int = 1000, overlap: int = 200):
        self.chunk_size = chunk_size
        self.overlap = overlap
 
    def chunk_by_code_structure(self, content: str, elements: List[Dict]) -> List[Dict]:
        """Chunk based on code structure (functions, classes)"""
        pass
 
    def chunk_by_semantic_similarity(self, content: str) -> List[Dict]:
        """Chunk based on semantic similarity between sentences"""
        pass
 
    def adaptive_chunking(self, content: str, content_type: str) -> List[Dict]:
        """Adaptive chunking based on content type"""
        pass

5. Search and Retrieval Strategy


class SearchStrategy:
    def semantic_search(self, query: str, limit: int = 10) -> List[Dict]:
        """Vector similarity search"""
        pass
 
    def keyword_search(self, query: str, limit: int = 10) -> List[Dict]:
        """Full-text search"""
        pass
 
    def hybrid_search(self, query: str, semantic_weight: float = 0.7) -> List[Dict]:
        """Combine semantic and keyword search"""
        pass
 
    def contextual_rerank(self, results: List[Dict], context: str) -> List[Dict]:
        """Rerank results based on context"""
        pass

Implementation Roadmap

Phase 1: Foundation (Week 1-2)

Set up PostgreSQL with pgvector
Create database schema and migrations
Implement basic LangGraph workflow
Build Code Analyzer Agent

Phase 2: Core Features (Week 2-3)

Implement Documentation Generator Agent
Add Embedding & Chunking Agent
Create PostgreSQL vector storage layer
Build basic search functionality

Phase 3: Enhancement (Week 3-4)

Add Quality Validator Agent
Implement hybrid search
Create demo interface
Add comprehensive testing

Phase 4: RAGFlow Integration (Week 4-5)

Create RAGFlow integration layer
Implement comparison framework
Build unified demo interface
Performance benchmarking

Configuration and Environment

Environment Variables:


# Database
DATABASE_URL=postgresql://user:pass@localhost:5432/code_to_doc
PGVECTOR_ENABLED=true
 
# OpenAI (or alternative)
OPENAI_API_KEY=your_key_here
EMBEDDING_MODEL=text-embedding-ada-002
COMPLETION_MODEL=gpt-4
 
# LangGraph
LANGCHAIN_TRACING_V2=true
LANGCHAIN_API_KEY=your_langsmith_key
 
# Application
LOG_LEVEL=INFO
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

This design provides a solid foundation for understanding RAG internals before moving to RAGFlow. Would you like me to start implementing any specific component first?