Week #3 #

Implemented MVP features #

Complete AI-Powered Legal Document Analysis Platform #

SmartClause is a fully functional MVP that provides comprehensive legal document analysis capabilities focused on the Russian legal market. The platform leverages RAG (Retrieval-Augmented Generation) technology with legal vector databases.

Core Implemented Features: #

Document Upload and Analysis Pipeline

Smart Upload Interface: Drag-and-drop document upload with progress tracking (up to 10MB files)
AI-Powered Risk Analysis: Comprehensive legal document analysis for risks, compliance issues, and recommendations
Detailed Results: Analysis with causes, risks, and actionable recommendations
Multiple File Format Support: Text files and structured documents

Semantic Legal Search Engine

Vector-based Search: Natural language queries against comprehensive Civil Code database
Chunked Database: 190,000+ legal rules with 413,000+ optimized text chunks
BAAI/bge-m3 Embeddings: 1024-dimensional vectors for high-quality semantic understanding
Configurable Retrieval: Multiple distance functions (cosine, L2, inner product)
Advanced Retrieval Techniques: Vector Search together with BM25 via Reciprocal Rank Fusion
Structured Metadata: Articles organized by sections, chapters, and legal references

Complete Web Application

Vue.js 3 Frontend: Modern responsive UI with three main screens:
- Upload Screen: Intuitive document upload interface
- Results Screen: Comprehensive analysis results presentation
Full User Journey: Complete workflow from document upload to detailed analysis results

REST API Integration

Spring Boot Backend (Port 8000): Document processing and orchestration
FastAPI AI Service (Port 8001): RAG pipeline and LLM integration
Swagger Documentation: Interactive API documentation at /docs
Comprehensive Endpoints: Document analysis, semantic search, embedding generation

Production-Ready Infrastructure

Docker Compose Orchestration: Complete service deployment with one command
PostgreSQL with pgvector: Vector similarity search capabilities
Error Handling: Comprehensive validation and user feedback
Performance Optimization: Concurrent processing and retry mechanisms

Functional User Journey: #

Document Upload: User drags and drops legal document onto web interface
Processing Initiation: System validates file and begins analysis pipeline
Real-time Feedback: User sees live processing status with progress updates
AI Analysis: Document analyzed using RAG with legal knowledge base
Results Presentation: Comprehensive analysis with risks and recommendations displayed

Demonstration of the working MVP #

MVP Demo

🌐 Live Demo: Platform deployed and accessible at SmartClause

API Documentation #

Analyzer API (Port 8001) #

GET /health: Health check with database connectivity status
POST /api/v1/retrieve-chunk: Semantic document chunk retrieval
POST /api/v1/retrieve-rules: Semantic retrieval of unique legal rules
POST /api/v1/analyze: Legal document analysis with risk assessment
POST /api/v1/embed: Text embedding generation
GET /api/v1/metrics/retrieval: Embedding quality metrics

Backend API (Port 8000) #

POST /api/v1/get_analysis: Document upload and analysis orchestration
GET /api/v1/health: Service health check

ML #

Model Architecture #

Embedding Model: BAAI/bge-m3

Model Type: Multilingual sentence transformer optimized for semantic similarity
Embedding Dimension: 1024-dimensional vectors
Language Support: Multilingual with strong Russian language capabilities
Use Case: Converting legal text chunks into semantic vectors for similarity search

LLM: Google Gemini 2.5 Flash

Integration: Via OpenRouter API for analysis generation
Purpose: Legal document analysis, risk assessment, and recommendation generation
Configuration: Optimized for legal domain with structured prompt engineering

Embedding Generation Process #

Legal Knowledge Base:

Dataset: Complete Russian Civil Code with comprehensive legal rules
Scale: 190,846 legal rules processed into 413,453 text chunks
Chunking Strategy: 800-character chunks with 500-character overlap for comprehensive coverage
Processing Pipeline: Rules → Chunks → Embeddings → Vector Database

Embedding Generation Process:

# Batch processing for efficiency
Batch Size: 50-200 chunks per batch
Model: BAAI/bge-m3
Dimension: 1024
Processing: ~413k chunks with concurrent batch processing

Vector Database Configuration:

Storage: PostgreSQL with pgvector extension
Indexing: Optimized vector storage with efficient similarity search capabilities

Retrieval System #

Hybrid Retrieval Architecture:

Vector Search: Semantic similarity search using configurable distance functions (cosine, euclidean, dot product)
BM25 Integration: Traditional keyword-based search for exact term matching
Fusion Strategy: Reciprocal Rank Fusion (RRF) combining vector and BM25 results for optimal relevance

Data Processing Scripts #

Link to processing code: analyzer/scripts/process_and_upload_datasets.py

Key Processing Features:

Unified script for embedding generation and database upload
Batch processing with memory optimization
Concurrent embedding generation
Progress tracking and error handling
Flexible configuration for different deployment scenarios

Model Artifacts #

Generated Artifacts:

chunks_with_embeddings.csv link: ~1GB file with pre-generated embeddings (stored via Git LFS)
Vector Database: PostgreSQL tables with indexed embeddings for fast retrieval
Metadata Storage: Complete rule hierarchy and chunk relationships

Model Performance:

Embedding Quality: Optimized for legal domain semantic understanding
Retrieval Speed: Sub-second response times for similarity search
Scalability: Handles 400k+ vectors with efficient indexing

Internal demo #

Demo Results and Feedback #

Successfully Demonstrated:

Complete end-to-end document analysis workflow
Semantic search capabilities with relevant legal code retrieval

Key Strengths Identified:

Intuitive user interface with clear workflow
Fast and relevant semantic search results
Comprehensive legal analysis output

Areas for Future Enhancement:

User authentication and document history
Interactive chat interface for legal consultation

Technical Performance:

Platform handles large legal documents efficiently
Vector search provides relevant legal code matches
AI analysis generates comprehensive risk assessments
System demonstrates production-ready stability

Weekly commitments #

Individual contribution of each participant #

Alexander Malyy:

Reviewed team’s work and merged PRs
Optimized processing part (batch processing, async, concurrency, caching) (issue)
Co-authored in PR #86 and PR #89
Created a report for week 3

Arthur Babkin:

Expanded legal dataset (issue)
Optimized prompt for LLM (issue)
Experimented and provided report on different document chunking techniques (issue)

Vladimir Zhidkov:

Deployed the MVP (issue)

Ilsaf Abdulkhakov:

Fixed frontend issues (issue) and (issue)
Updated views (PR)
Recorded demo

Nikita Tsukanov:

Researched better retrieve techniques and analyzed the results (issue)
Created an assessment data and criteria (issue)

Plan for Next Week #

Priority Tasks:

Message broker: Implement message broker for communication between microservices
Finalize user flow: Define user flow for the final product
Design the final UI/UX: Design the final UI/UX for the final product
User Authentication: Implement user registration and login system

Confirmation of the code’s operability #

We confirm that the code in the main branch:

[+] In working condition.
[+] Run via docker-compose (or another alternative described in the README.md).