Architecture
WebSockets
Socket.io
System Design
Building Production-Ready Real-Time Chat
Project Context
At HealthSIA, we built a real-time chat system for 5,000+ concurrent users enabling:
- •Patient-therapist communication
- •Clinical staff coordination
- •HIPAA-compliant messaging
- •Multi-device support
Architecture Decisions
1. WebSocket vs HTTP Polling
Decision: WebSocket (Socket.io)
Why?
- •Bidirectional communication needed
- •Lower latency (<200ms requirement)
- •Reduced server load
- •Built-in reconnection logic
Trade-offs:
- •More complex infrastructure
- •Stateful connections (harder horizontal scaling)
- •Need Redis pub/sub for multi-server
Architecture Diagram

2. Message Storage
Decision: MySQL with optimized schema
CREATE TABLE messages (
id BIGINT PRIMARY KEY AUTO_INCREMENT,
room_id VARCHAR(255) NOT NULL,
sender_id BIGINT NOT NULL,
content TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
seen_at TIMESTAMP NULL,
INDEX idx_room_created (room_id, created_at),
INDEX idx_recipient_seen (recipient_id, seen_at)
);
3. Scaling Strategy
Redis Pub/Sub for inter-server communication
const io = require('socket.io')(server, {
adapter: require('socket.io-redis')({
host: process.env.REDIS_HOST
})
});
socket.on('send_message', async (data) => {
const saved = await saveMessage(data);
io.to(data.roomId).emit('new_message', saved);
});
Key Implementation
Room-Based Architecture
// Consistent room IDs
function generateRoomId(userId1, userId2) {
const ids = [userId1, userId2].sort();
return `room_${ids[0]}_${ids[1]}`;
}
// Join rooms on connect
socket.on('authenticate', async (token) => {
const user = await verifyToken(token);
const rooms = await getUserRooms(user.id);
rooms.forEach(room => socket.join(room.id));
});
Offline Message Handling
socket.on('send_message', async (data) => {
const message = await saveMessage(data);
io.to(data.roomId).emit('new_message', message);
// Queue push notification if offline
const online = await checkUserOnline(data.recipientId);
if (!online) {
await queuePushNotification(data);
}
});
Challenges & Solutions
Challenge 1: Message Ordering
Problem: Messages from different servers out of order
Solution:
- •Microsecond precision timestamps
- •Client-side sorting
- •Server sequence numbers as fallback
Challenge 2: Connection Storms
Problem: All users reconnect on deploy
Solution:
- •Exponential backoff
- •Connection rate limiting
- •Client-side queue
let attempts = 0;
socket.on('disconnect', () => {
const delay = Math.min(1000 * Math.pow(2, attempts), 30000);
setTimeout(() => socket.connect(), delay);
attempts++;
});
Production Metrics
After 6 months:
- •Average latency: 145ms
- •P95 latency: 280ms
- •Uptime: 99.94%
- •Peak connections: 8,500
- •Messages/day: ~50,000
- •Zero message loss
Key Learnings
- •Start Simple - Single server first, add Redis when needed
- •Room Architecture Scales - Easy permissions and delivery
- •Client Resilience Critical - Robust reconnection essential
- •Monitor Everything - Connections, latency, errors, memory
Tools Used
- •Socket.io - WebSocket library
- •Redis - Pub/sub
- •MySQL - Persistence
- •Nginx - Load balancing
- •Docker & GCP
Conclusion
Building production chat requires careful architecture, scalability planning, and reliability focus. Start simple, monitor everything, and scale as needed.