Week #3 #
Implemented MVP features #
Complete AI-Powered Legal Document Analysis Platform #
SmartClause is a fully functional MVP that provides comprehensive legal document analysis capabilities focused on the Russian legal market. The platform leverages RAG (Retrieval-Augmented Generation) technology with legal vector databases.
Core Implemented Features: #
Document Upload and Analysis Pipeline
- Smart Upload Interface: Drag-and-drop document upload with progress tracking (up to 10MB files)
- AI-Powered Risk Analysis: Comprehensive legal document analysis for risks, compliance issues, and recommendations
- Detailed Results: Analysis with causes, risks, and actionable recommendations
- Multiple File Format Support: Text files and structured documents
Semantic Legal Search Engine
- Vector-based Search: Natural language queries against comprehensive Civil Code database
- Chunked Database: 190,000+ legal rules with 413,000+ optimized text chunks
- BAAI/bge-m3 Embeddings: 1024-dimensional vectors for high-quality semantic understanding
- Configurable Retrieval: Multiple distance functions (cosine, L2, inner product)
- Advanced Retrieval Techniques: Vector Search together with BM25 via Reciprocal Rank Fusion
- Structured Metadata: Articles organized by sections, chapters, and legal references
Complete Web Application
- Vue.js 3 Frontend: Modern responsive UI with three main screens:
- Upload Screen: Intuitive document upload interface
- Results Screen: Comprehensive analysis results presentation
- Full User Journey: Complete workflow from document upload to detailed analysis results
REST API Integration
- Spring Boot Backend (Port 8000): Document processing and orchestration
- FastAPI AI Service (Port 8001): RAG pipeline and LLM integration
- Swagger Documentation: Interactive API documentation at
/docs
- Comprehensive Endpoints: Document analysis, semantic search, embedding generation
Production-Ready Infrastructure
- Docker Compose Orchestration: Complete service deployment with one command
- PostgreSQL with pgvector: Vector similarity search capabilities
- Error Handling: Comprehensive validation and user feedback
- Performance Optimization: Concurrent processing and retry mechanisms
Functional User Journey: #
- Document Upload: User drags and drops legal document onto web interface
- Processing Initiation: System validates file and begins analysis pipeline
- Real-time Feedback: User sees live processing status with progress updates
- AI Analysis: Document analyzed using RAG with legal knowledge base
- Results Presentation: Comprehensive analysis with risks and recommendations displayed
Demonstration of the working MVP #
🌐 Live Demo: Platform deployed and accessible at SmartClause
API Documentation #
Analyzer API (Port 8001) #
- GET /health: Health check with database connectivity status
- POST /api/v1/retrieve-chunk: Semantic document chunk retrieval
- POST /api/v1/retrieve-rules: Semantic retrieval of unique legal rules
- POST /api/v1/analyze: Legal document analysis with risk assessment
- POST /api/v1/embed: Text embedding generation
- GET /api/v1/metrics/retrieval: Embedding quality metrics
Backend API (Port 8000) #
- POST /api/v1/get_analysis: Document upload and analysis orchestration
- GET /api/v1/health: Service health check
ML #
Model Architecture #
Embedding Model: BAAI/bge-m3
- Model Type: Multilingual sentence transformer optimized for semantic similarity
- Embedding Dimension: 1024-dimensional vectors
- Language Support: Multilingual with strong Russian language capabilities
- Use Case: Converting legal text chunks into semantic vectors for similarity search
LLM: Google Gemini 2.5 Flash
- Integration: Via OpenRouter API for analysis generation
- Purpose: Legal document analysis, risk assessment, and recommendation generation
- Configuration: Optimized for legal domain with structured prompt engineering
Embedding Generation Process #
Legal Knowledge Base:
- Dataset: Complete Russian Civil Code with comprehensive legal rules
- Scale: 190,846 legal rules processed into 413,453 text chunks
- Chunking Strategy: 800-character chunks with 500-character overlap for comprehensive coverage
- Processing Pipeline: Rules → Chunks → Embeddings → Vector Database
Embedding Generation Process:
# Batch processing for efficiency
Batch Size: 50-200 chunks per batch
Model: BAAI/bge-m3
Dimension: 1024
Processing: ~413k chunks with concurrent batch processing
Vector Database Configuration:
- Storage: PostgreSQL with pgvector extension
- Indexing: Optimized vector storage with efficient similarity search capabilities
Retrieval System #
Hybrid Retrieval Architecture:
- Vector Search: Semantic similarity search using configurable distance functions (cosine, euclidean, dot product)
- BM25 Integration: Traditional keyword-based search for exact term matching
- Fusion Strategy: Reciprocal Rank Fusion (RRF) combining vector and BM25 results for optimal relevance
Data Processing Scripts #
Link to processing code: analyzer/scripts/process_and_upload_datasets.py
Key Processing Features:
- Unified script for embedding generation and database upload
- Batch processing with memory optimization
- Concurrent embedding generation
- Progress tracking and error handling
- Flexible configuration for different deployment scenarios
Model Artifacts #
Generated Artifacts:
chunks_with_embeddings.csv
link: ~1GB file with pre-generated embeddings (stored via Git LFS)- Vector Database: PostgreSQL tables with indexed embeddings for fast retrieval
- Metadata Storage: Complete rule hierarchy and chunk relationships
Model Performance:
- Embedding Quality: Optimized for legal domain semantic understanding
- Retrieval Speed: Sub-second response times for similarity search
- Scalability: Handles 400k+ vectors with efficient indexing
Internal demo #
Demo Results and Feedback #
Successfully Demonstrated:
- Complete end-to-end document analysis workflow
- Semantic search capabilities with relevant legal code retrieval
Key Strengths Identified:
- Intuitive user interface with clear workflow
- Fast and relevant semantic search results
- Comprehensive legal analysis output
Areas for Future Enhancement:
- User authentication and document history
- Interactive chat interface for legal consultation
Technical Performance:
- Platform handles large legal documents efficiently
- Vector search provides relevant legal code matches
- AI analysis generates comprehensive risk assessments
- System demonstrates production-ready stability
Weekly commitments #
Individual contribution of each participant #
Alexander Malyy:
- Reviewed team’s work and merged PRs
- Optimized processing part (batch processing, async, concurrency, caching) (issue)
- Co-authored in PR #86 and PR #89
- Created a report for week 3
Arthur Babkin:
- Expanded legal dataset (issue)
- Optimized prompt for LLM (issue)
- Experimented and provided report on different document chunking techniques (issue)
Vladimir Zhidkov:
- Deployed the MVP (issue)
Ilsaf Abdulkhakov:
Nikita Tsukanov:
- Researched better retrieve techniques and analyzed the results (issue)
- Created an assessment data and criteria (issue)
Plan for Next Week #
Priority Tasks:
- Message broker: Implement message broker for communication between microservices
- Finalize user flow: Define user flow for the final product
- Design the final UI/UX: Design the final UI/UX for the final product
- User Authentication: Implement user registration and login system
Confirmation of the code’s operability #
We confirm that the code in the main branch:
- [+] In working condition.
- [+] Run via docker-compose (or another alternative described in the
README.md
).