AutoC4 Project Documentation#

Project Overview##

AutoC4 is an AI-driven codebase analysis and C4 architecture generation platform that combines advanced code parsing, AI-powered summarization, vector embeddings, and intelligent search capabilities to automatically generate C4 architecture models from source code repositories.

Architecture##

Frontend Application###

Technology Stack:

  • React 18.2.0 with TypeScript
  • Material-UI (MUI) v5.15.15 for UI components
  • Vite as build tool and development server
  • React Router DOM v6.22.3 for navigation
  • TanStack React Query v5.28.9 for data fetching and state management
  • Axios for HTTP client
  • React Markdown for rendering markdown content

Key Features:

  • Responsive dashboard with system status monitoring
  • Repository analysis interface supporting GitHub, GitLab, Bitbucket, and ZIP uploads
  • Real-time pipeline progress tracking
  • Interactive C4 model viewer with four levels (System Context, Containers, Components, Code)
  • Chat interface for natural language codebase queries
  • Modern dark/light theme with Material Design principles

Application Structure:

txt

Backend Application###

Technology Stack:

  • NestJS framework with TypeScript
  • Tree-sitter for AST parsing (JavaScript, TypeScript, Python, Java)
  • Azure OpenAI for AI completions and embeddings
  • Azure AI Search for vector and hybrid search
  • Simple-git for Git repository operations
  • JSZip for archive handling
  • UUID for unique identifier generation

Core Services:

  1. IngestionService: Handles repository cloning and ZIP file extraction

    • Supports GitHub, GitLab, Bitbucket (public/private)
    • ZIP file upload and extraction
    • Repository validation and branch detection
  2. ParserService: AST-based code analysis

    • Multi-language support (JS, TS, Python, Java)
    • Extracts classes, functions, methods, imports, exports
    • Generates structured project representations
  3. SummarizationService: AI-powered code summarization

    • Azure OpenAI integration for code analysis
    • Generates summaries with complexity analysis
    • Extracts key features and dependencies
  4. EmbeddingService: Vector embedding generation

    • text-embedding-ada-002 model integration
    • Batch processing with rate limiting
    • 1536-dimensional vectors for semantic search
  5. SearchService: Azure AI Search integration

    • Hybrid search (keyword + vector)
    • Index management and document operations
    • Faceted search with filtering capabilities
  6. PipelineService: Orchestrates the analysis workflow

    • Multi-stage processing pipeline
    • Real-time progress tracking
    • Error handling and recovery
  7. AgentService: AI agent for C4 model generation

    • Tool-based architecture for codebase exploration
    • Structured C4 model output
    • Multi-level architecture analysis
  8. ChatService: RAG-based chat interface

    • Context-aware responses using retrieved code
    • Conversation history management
    • Source attribution and relevance scoring

Data Flow and Processing Pipeline##

Stage 1: Repository Ingestion###

  • Repository URL validation and authentication
  • Git cloning or ZIP extraction to temporary directory
  • Metadata collection (size, file count, repository type)

Stage 2: Code###

  • Tree-sitter AST parsing for supported languages
  • Extraction of code structures (classes, functions, methods)
  • Import/export relationship analysis
  • Generation of structured project representation

Stage 3: Code Summarization###

  • AI-powered analysis of code components
  • Generation of natural language summaries
  • Complexity assessment and feature extraction
  • Dependency identification

Stage 4: Vector Embedding Generation###

  • Conversion of summaries to vector embeddings
  • Batch processing for efficiency
  • 1536-dimensional vectors using Azure OpenAI

Stage 5: Search Indexing###

  • Azure AI Search index creation/update
  • Document upload with metadata
  • Vector search configuration
  • Full-text and hybrid search setup

Stage 6: C4 Model Generation###

  • AI agent-based architecture analysis
  • Multi-level C4 model creation:
    • C1: System Context (actors, external systems)
    • C2: Containers (applications, databases)
    • C3: Components (modules, services)
    • C4: Code (classes, functions)

API Endpoints##

Analysis API (/api/v1/analysis)###

  • POST / - Create repository analysis
  • POST /upload - Upload ZIP file analysis
  • POST /branches - Get repository branches
  • GET /:id/status - Get analysis status
  • POST /:id/c4-model - Generate C4 model
  • GET /:id/c4-model - Retrieve C4 model
  • GET /:id/files - Get file structure

Chat API (/api/v1/chat)###

  • POST / - Send chat message
  • GET /:id/info - Get analysis info
  • GET /:id/suggestions - Get suggested questions

System API (/api)###

  • GET /health - System health check
  • GET / - Application information

Data Models##

Analysis Request###

typescript

Project Structure###

typescript

C4 Model###

typescript

Configuration##

Environment Variables###

bash

Frontend Configuration###

bash

File System Structure##

Temporary Analysis Storage###

txt

Azure AI Search Indexes###

  • Index naming: c4mcp-codebase-{analysis-id}
  • Vector dimensions: 1536 (text-embedding-ada-002)
  • Search algorithms: HNSW for vector search
  • Document schema includes code chunks, summaries, embeddings, metadata

Supported Languages and File Types##

Fully Supported (AST Parsing):

  • JavaScript (.js, .jsx)
  • TypeScript (.ts, .tsx)
  • Python (.py)
  • Java (.java)

Repository Sources:

  • GitHub (public/private)
  • GitLab (public/private)
  • Bitbucket (public/private)
  • ZIP file upload (max 100MB)

Key Features##

  1. Multi-Source Repository Ingestion: Support for major Git providers with authentication
  2. Advanced Code Parsing: AST-based analysis with Tree-sitter for accurate code structure extraction
  3. AI-Powered Analysis: Azure OpenAI integration for intelligent code summarization and C4 generation
  4. Semantic Search: Vector embeddings with hybrid search capabilities
  5. Interactive C4 Models: Four-level architecture visualization with drill-down capabilities
  6. Natural Language Chat: RAG-based conversational interface for codebase exploration
  7. Real-time Processing: Live pipeline progress tracking with detailed status updates
  8. Comprehensive API: RESTful API with OpenAPI documentation

Technical Specifications##

  • Frontend Build: Vite-based development and production builds
  • Backend Framework: NestJS with dependency injection and modular architecture
  • Database: Stateless design using Azure AI Search for persistence
  • Authentication: Personal Access Token support for private repositories
  • File Processing: Streaming file operations with configurable batch sizes
  • Error Handling: Comprehensive error boundaries and graceful degradation
  • Performance: Optimized for large codebases with efficient chunking and batching

Architecture Diagram

Drag to pan, scroll to zoom