AutoC4 Project Documentation

Project Overview

AutoC4 is an AI-driven codebase analysis and C4 architecture generation platform that combines advanced code parsing, AI-powered summarization, vector embeddings, and intelligent search capabilities to automatically generate C4 architecture models from source code repositories.

Architecture

Frontend Application

Technology Stack:

React 18.2.0 with TypeScript
Material-UI (MUI) v5.15.15 for UI components
Vite as build tool and development server
React Router DOM v6.22.3 for navigation
TanStack React Query v5.28.9 for data fetching and state management
Axios for HTTP client
React Markdown for rendering markdown content

Key Features:

Responsive dashboard with system status monitoring
Repository analysis interface supporting GitHub, GitLab, Bitbucket, and ZIP uploads
Real-time pipeline progress tracking
Interactive C4 model viewer with four levels (System Context, Containers, Components, Code)
Chat interface for natural language codebase queries
Modern dark/light theme with Material Design principles

Application Structure:

txt

Backend Application

Technology Stack:

NestJS framework with TypeScript
Tree-sitter for AST parsing (JavaScript, TypeScript, Python, Java)
Azure OpenAI for AI completions and embeddings
Azure AI Search for vector and hybrid search
Simple-git for Git repository operations
JSZip for archive handling
UUID for unique identifier generation

Core Services:

IngestionService: Handles repository cloning and ZIP file extraction
- Supports GitHub, GitLab, Bitbucket (public/private)
- ZIP file upload and extraction
- Repository validation and branch detection
ParserService: AST-based code analysis
- Multi-language support (JS, TS, Python, Java)
- Extracts classes, functions, methods, imports, exports
- Generates structured project representations
SummarizationService: AI-powered code summarization
- Azure OpenAI integration for code analysis
- Generates summaries with complexity analysis
- Extracts key features and dependencies
EmbeddingService: Vector embedding generation
- text-embedding-ada-002 model integration
- Batch processing with rate limiting
- 1536-dimensional vectors for semantic search
SearchService: Azure AI Search integration
- Hybrid search (keyword + vector)
- Index management and document operations
- Faceted search with filtering capabilities
PipelineService: Orchestrates the analysis workflow
- Multi-stage processing pipeline
- Real-time progress tracking
- Error handling and recovery
AgentService: AI agent for C4 model generation
- Tool-based architecture for codebase exploration
- Structured C4 model output
- Multi-level architecture analysis
ChatService: RAG-based chat interface
- Context-aware responses using retrieved code
- Conversation history management
- Source attribution and relevance scoring

Data Flow and Processing Pipeline

Stage 1: Repository Ingestion

Repository URL validation and authentication
Git cloning or ZIP extraction to temporary directory
Metadata collection (size, file count, repository type)

Stage 2: Code

Tree-sitter AST parsing for supported languages
Extraction of code structures (classes, functions, methods)
Import/export relationship analysis
Generation of structured project representation

Stage 3: Code Summarization

AI-powered analysis of code components
Generation of natural language summaries
Complexity assessment and feature extraction
Dependency identification

Stage 4: Vector Embedding Generation

Conversion of summaries to vector embeddings
Batch processing for efficiency
1536-dimensional vectors using Azure OpenAI

Stage 5: Search Indexing

Azure AI Search index creation/update
Document upload with metadata
Vector search configuration
Full-text and hybrid search setup

Stage 6: C4 Model Generation

AI agent-based architecture analysis
Multi-level C4 model creation:
- C1: System Context (actors, external systems)
- C2: Containers (applications, databases)
- C3: Components (modules, services)
- C4: Code (classes, functions)

API Endpoints

Analysis API (`/api/v1/analysis`)

POST / - Create repository analysis
POST /upload - Upload ZIP file analysis
POST /branches - Get repository branches
GET /:id/status - Get analysis status
POST /:id/c4-model - Generate C4 model
GET /:id/c4-model - Retrieve C4 model
GET /:id/files - Get file structure

Chat API (`/api/v1/chat`)

POST / - Send chat message
GET /:id/info - Get analysis info
GET /:id/suggestions - Get suggested questions

System API (`/api`)

GET /health - System health check
GET / - Application information

Data Models

Analysis Request

typescript

Project Structure

typescript

C4 Model

typescript

Configuration

Environment Variables

bash

Frontend Configuration

bash

File System Structure

Temporary Analysis Storage

txt

Azure AI Search Indexes

Index naming: c4mcp-codebase-{analysis-id}
Vector dimensions: 1536 (text-embedding-ada-002)
Search algorithms: HNSW for vector search
Document schema includes code chunks, summaries, embeddings, metadata

Supported Languages and File Types

Fully Supported (AST Parsing):

JavaScript (.js, .jsx)
TypeScript (.ts, .tsx)
Python (.py)
Java (.java)

Repository Sources:

GitHub (public/private)
GitLab (public/private)
Bitbucket (public/private)
ZIP file upload (max 100MB)

Key Features

Multi-Source Repository Ingestion: Support for major Git providers with authentication
Advanced Code Parsing: AST-based analysis with Tree-sitter for accurate code structure extraction
AI-Powered Analysis: Azure OpenAI integration for intelligent code summarization and C4 generation
Semantic Search: Vector embeddings with hybrid search capabilities
Interactive C4 Models: Four-level architecture visualization with drill-down capabilities
Natural Language Chat: RAG-based conversational interface for codebase exploration
Real-time Processing: Live pipeline progress tracking with detailed status updates
Comprehensive API: RESTful API with OpenAPI documentation

Technical Specifications

Frontend Build: Vite-based development and production builds
Backend Framework: NestJS with dependency injection and modular architecture
Database: Stateless design using Azure AI Search for persistence
Authentication: Personal Access Token support for private repositories
File Processing: Streaming file operations with configurable batch sizes
Error Handling: Comprehensive error boundaries and graceful degradation
Performance: Optimized for large codebases with efficient chunking and batching

Architecture Diagram

Drag to pan, scroll to zoom