AutoC4 Project Documentation

AutoC4 Project Documentation

Project Overview

AutoC4 is an AI-driven codebase analysis and C4 architecture generation platform that combines advanced code parsing, AI-powered summarization, vector embeddings, and intelligent search capabilities to automatically generate C4 architecture models from source code repositories.

Architecture

Frontend Application

Technology Stack:

  • React 18.2.0 with TypeScript
  • Material-UI (MUI) v5.15.15 for UI components
  • Vite as build tool and development server
  • React Router DOM v6.22.3 for navigation
  • TanStack React Query v5.28.9 for data fetching and state management
  • Axios for HTTP client
  • React Markdown for rendering markdown content

Key Features:

  • Responsive dashboard with system status monitoring
  • Repository analysis interface supporting GitHub, GitLab, Bitbucket, and ZIP uploads
  • Real-time pipeline progress tracking
  • Interactive C4 model viewer with four levels (System Context, Containers, Components, Code)
  • Chat interface for natural language codebase queries
  • Modern dark/light theme with Material Design principles

Application Structure:

txt

Backend Application

Technology Stack:

  • NestJS framework with TypeScript
  • Tree-sitter for AST parsing (JavaScript, TypeScript, Python, Java)
  • Azure OpenAI for AI completions and embeddings
  • Azure AI Search for vector and hybrid search
  • Simple-git for Git repository operations
  • JSZip for archive handling
  • UUID for unique identifier generation

Core Services:

  1. IngestionService: Handles repository cloning and ZIP file extraction

    • Supports GitHub, GitLab, Bitbucket (public/private)
    • ZIP file upload and extraction
    • Repository validation and branch detection
  2. ParserService: AST-based code analysis

    • Multi-language support (JS, TS, Python, Java)
    • Extracts classes, functions, methods, imports, exports
    • Generates structured project representations
  3. SummarizationService: AI-powered code summarization

    • Azure OpenAI integration for code analysis
    • Generates summaries with complexity analysis
    • Extracts key features and dependencies
  4. EmbeddingService: Vector embedding generation

    • text-embedding-ada-002 model integration
    • Batch processing with rate limiting
    • 1536-dimensional vectors for semantic search
  5. SearchService: Azure AI Search integration

    • Hybrid search (keyword + vector)
    • Index management and document operations
    • Faceted search with filtering capabilities
  6. PipelineService: Orchestrates the analysis workflow

    • Multi-stage processing pipeline
    • Real-time progress tracking
    • Error handling and recovery
  7. AgentService: AI agent for C4 model generation

    • Tool-based architecture for codebase exploration
    • Structured C4 model output
    • Multi-level architecture analysis
  8. ChatService: RAG-based chat interface

    • Context-aware responses using retrieved code
    • Conversation history management
    • Source attribution and relevance scoring

Data Flow and Processing Pipeline

Stage 1: Repository Ingestion

  • Repository URL validation and authentication
  • Git cloning or ZIP extraction to temporary directory
  • Metadata collection (size, file count, repository type)

Stage 2: Code

  • Tree-sitter AST parsing for supported languages
  • Extraction of code structures (classes, functions, methods)
  • Import/export relationship analysis
  • Generation of structured project representation

Stage 3: Code Summarization

  • AI-powered analysis of code components
  • Generation of natural language summaries
  • Complexity assessment and feature extraction
  • Dependency identification

Stage 4: Vector Embedding Generation

  • Conversion of summaries to vector embeddings
  • Batch processing for efficiency
  • 1536-dimensional vectors using Azure OpenAI

Stage 5: Search Indexing

  • Azure AI Search index creation/update
  • Document upload with metadata
  • Vector search configuration
  • Full-text and hybrid search setup

Stage 6: C4 Model Generation

  • AI agent-based architecture analysis
  • Multi-level C4 model creation:
    • C1: System Context (actors, external systems)
    • C2: Containers (applications, databases)
    • C3: Components (modules, services)
    • C4: Code (classes, functions)

API Endpoints

Analysis API (/api/v1/analysis)

  • POST / - Create repository analysis
  • POST /upload - Upload ZIP file analysis
  • POST /branches - Get repository branches
  • GET /:id/status - Get analysis status
  • POST /:id/c4-model - Generate C4 model
  • GET /:id/c4-model - Retrieve C4 model
  • GET /:id/files - Get file structure

Chat API (/api/v1/chat)

  • POST / - Send chat message
  • GET /:id/info - Get analysis info
  • GET /:id/suggestions - Get suggested questions

System API (/api)

  • GET /health - System health check
  • GET / - Application information

Data Models

Analysis Request

typescript

Project Structure

typescript

C4 Model

typescript

Configuration

Environment Variables

bash

Frontend Configuration

bash

File System Structure

Temporary Analysis Storage

txt

Azure AI Search Indexes

  • Index naming: c4mcp-codebase-{analysis-id}
  • Vector dimensions: 1536 (text-embedding-ada-002)
  • Search algorithms: HNSW for vector search
  • Document schema includes code chunks, summaries, embeddings, metadata

Supported Languages and File Types

Fully Supported (AST Parsing):

  • JavaScript (.js, .jsx)
  • TypeScript (.ts, .tsx)
  • Python (.py)
  • Java (.java)

Repository Sources:

  • GitHub (public/private)
  • GitLab (public/private)
  • Bitbucket (public/private)
  • ZIP file upload (max 100MB)

Key Features

  1. Multi-Source Repository Ingestion: Support for major Git providers with authentication
  2. Advanced Code Parsing: AST-based analysis with Tree-sitter for accurate code structure extraction
  3. AI-Powered Analysis: Azure OpenAI integration for intelligent code summarization and C4 generation
  4. Semantic Search: Vector embeddings with hybrid search capabilities
  5. Interactive C4 Models: Four-level architecture visualization with drill-down capabilities
  6. Natural Language Chat: RAG-based conversational interface for codebase exploration
  7. Real-time Processing: Live pipeline progress tracking with detailed status updates
  8. Comprehensive API: RESTful API with OpenAPI documentation

Technical Specifications

  • Frontend Build: Vite-based development and production builds
  • Backend Framework: NestJS with dependency injection and modular architecture
  • Database: Stateless design using Azure AI Search for persistence
  • Authentication: Personal Access Token support for private repositories
  • File Processing: Streaming file operations with configurable batch sizes
  • Error Handling: Comprehensive error boundaries and graceful degradation
  • Performance: Optimized for large codebases with efficient chunking and batching

Architecture Diagram

Drag to pan, scroll to zoom