Architecture

WebSockets

Socket.io

System Design

Building Production-Ready Real-Time Chat

Project Context

At HealthSIA, we built a real-time chat system for 5,000+ concurrent users enabling:

•Patient-therapist communication
•Clinical staff coordination
•HIPAA-compliant messaging
•Multi-device support

Architecture Decisions

1. WebSocket vs HTTP Polling

Decision: WebSocket (Socket.io)

Why?

•Bidirectional communication needed
•Lower latency (<200ms requirement)
•Reduced server load
•Built-in reconnection logic

Trade-offs:

•More complex infrastructure
•Stateful connections (harder horizontal scaling)
•Need Redis pub/sub for multi-server

Architecture Diagram

Chat Architecture

2. Message Storage

Decision: MySQL with optimized schema

CREATE TABLE messages (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  room_id VARCHAR(255) NOT NULL,
  sender_id BIGINT NOT NULL,
  content TEXT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  seen_at TIMESTAMP NULL,
  
  INDEX idx_room_created (room_id, created_at),
  INDEX idx_recipient_seen (recipient_id, seen_at)
);

3. Scaling Strategy

Redis Pub/Sub for inter-server communication

const io = require('socket.io')(server, {
  adapter: require('socket.io-redis')({
    host: process.env.REDIS_HOST
  })
});

socket.on('send_message', async (data) => {
  const saved = await saveMessage(data);
  io.to(data.roomId).emit('new_message', saved);
});

Key Implementation

Room-Based Architecture

// Consistent room IDs
function generateRoomId(userId1, userId2) {
  const ids = [userId1, userId2].sort();
  return `room_${ids[0]}_${ids[1]}`;
}

// Join rooms on connect
socket.on('authenticate', async (token) => {
  const user = await verifyToken(token);
  const rooms = await getUserRooms(user.id);
  rooms.forEach(room => socket.join(room.id));
});

Offline Message Handling

socket.on('send_message', async (data) => {
  const message = await saveMessage(data);
  io.to(data.roomId).emit('new_message', message);
  
  // Queue push notification if offline
  const online = await checkUserOnline(data.recipientId);
  if (!online) {
    await queuePushNotification(data);
  }
});

Challenges & Solutions

Challenge 1: Message Ordering

Problem: Messages from different servers out of order

Solution:

•Microsecond precision timestamps
•Client-side sorting
•Server sequence numbers as fallback

Challenge 2: Connection Storms

Problem: All users reconnect on deploy

Solution:

•Exponential backoff
•Connection rate limiting
•Client-side queue

let attempts = 0;
socket.on('disconnect', () => {
  const delay = Math.min(1000 * Math.pow(2, attempts), 30000);
  setTimeout(() => socket.connect(), delay);
  attempts++;
});

Production Metrics

After 6 months:

•Average latency: 145ms
•P95 latency: 280ms
•Uptime: 99.94%
•Peak connections: 8,500
•Messages/day: ~50,000
•Zero message loss

Key Learnings

•Start Simple - Single server first, add Redis when needed
•Room Architecture Scales - Easy permissions and delivery
•Client Resilience Critical - Robust reconnection essential
•Monitor Everything - Connections, latency, errors, memory

Tools Used

•Socket.io - WebSocket library
•Redis - Pub/sub
•MySQL - Persistence
•Nginx - Load balancing
•Docker & GCP

Conclusion

Building production chat requires careful architecture, scalability planning, and reliability focus. Start simple, monitor everything, and scale as needed.

Building Production-Ready Real-Time Chat: Architecture & Trade-offs