Backend
January 5, 2025
15 min read

Building Production-Ready Real-Time Chat: Architecture & Trade-offs

Deep dive into architectural decisions, trade-offs, and lessons learned while building a scalable real-time chat system for healthcare.

Sheryar Ahmed

Sheryar Ahmed

Full-Stack Engineer | Building scalable systems

Building Production-Ready Real-Time Chat: Architecture & Trade-offs
Architecture
WebSockets
Socket.io
System Design

Building Production-Ready Real-Time Chat

Project Context

At HealthSIA, we built a real-time chat system for 5,000+ concurrent users enabling:

Architecture Decisions

1. WebSocket vs HTTP Polling

Decision: WebSocket (Socket.io)

Why?

Trade-offs:

Architecture Diagram

Chat Architecture

2. Message Storage

Decision: MySQL with optimized schema

CREATE TABLE messages (
  id BIGINT PRIMARY KEY AUTO_INCREMENT,
  room_id VARCHAR(255) NOT NULL,
  sender_id BIGINT NOT NULL,
  content TEXT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  seen_at TIMESTAMP NULL,
  
  INDEX idx_room_created (room_id, created_at),
  INDEX idx_recipient_seen (recipient_id, seen_at)
);

3. Scaling Strategy

Redis Pub/Sub for inter-server communication

const io = require('socket.io')(server, {
  adapter: require('socket.io-redis')({
    host: process.env.REDIS_HOST
  })
});

socket.on('send_message', async (data) => {
  const saved = await saveMessage(data);
  io.to(data.roomId).emit('new_message', saved);
});

Key Implementation

Room-Based Architecture

// Consistent room IDs
function generateRoomId(userId1, userId2) {
  const ids = [userId1, userId2].sort();
  return `room_${ids[0]}_${ids[1]}`;
}

// Join rooms on connect
socket.on('authenticate', async (token) => {
  const user = await verifyToken(token);
  const rooms = await getUserRooms(user.id);
  rooms.forEach(room => socket.join(room.id));
});

Offline Message Handling

socket.on('send_message', async (data) => {
  const message = await saveMessage(data);
  io.to(data.roomId).emit('new_message', message);
  
  // Queue push notification if offline
  const online = await checkUserOnline(data.recipientId);
  if (!online) {
    await queuePushNotification(data);
  }
});

Challenges & Solutions

Challenge 1: Message Ordering

Problem: Messages from different servers out of order

Solution:

Challenge 2: Connection Storms

Problem: All users reconnect on deploy

Solution:

let attempts = 0;
socket.on('disconnect', () => {
  const delay = Math.min(1000 * Math.pow(2, attempts), 30000);
  setTimeout(() => socket.connect(), delay);
  attempts++;
});

Production Metrics

After 6 months:

Key Learnings

  1. Start Simple - Single server first, add Redis when needed
  2. Room Architecture Scales - Easy permissions and delivery
  3. Client Resilience Critical - Robust reconnection essential
  4. Monitor Everything - Connections, latency, errors, memory

Tools Used

Conclusion

Building production chat requires careful architecture, scalability planning, and reliability focus. Start simple, monitor everything, and scale as needed.

Share this article