Skip to content

Queue-Based Ticket Checkout Architecture Proposal

Document Version: 1.0

Date: 2025-10-28

Status: Proposal - Ready for Review

Author: Technical Architecture Team


Table of Contents

  1. Executive Summary
  2. Current Architecture Analysis
  3. Proposed Queue-Based Architecture
  4. Benefits and Trade-offs
  5. Implementation Plan
  6. Technical Specifications
  7. Migration Strategy
  8. Testing Strategy
  9. Monitoring and Observability
  10. Rollback Plan
  11. Cost and Resource Analysis
  12. Appendices

Executive Summary

Problem Statement

The current synchronous checkout implementation uses a three-layer locking mechanism (Redis cache locks + database transactions + row-level locks) to prevent race conditions during concurrent ticket purchases. While functional, this approach introduces complexity and performance limitations:

  • 90+ lines of lock management code requiring careful coordination
  • Lock contention under high load (P2-003 tests show up to 5s wait times)
  • 120-second lock timeout limiting transaction duration
  • Poor user experience during flash sales (users wait for lock acquisition)
  • Complex error handling requiring manual lock cleanup
  • Limited scalability for high-concurrency scenarios (100+ simultaneous checkouts)

Proposed Solution

Migrate to an asynchronous, queue-based checkout workflow using the existing Celery + Redis infrastructure. This approach:

  • Eliminates Redis lock management (~90 lines of code removed)
  • Natural serialization through worker queue processing
  • Better scalability for flash sales and high-traffic events
  • Improved error handling with automatic retry mechanisms
  • Better user experience with immediate response and status polling
  • Enhanced observability through Flower dashboard and task monitoring

Key Metrics

Metric Current Proposed Improvement
Code Complexity ~400 lines ~310 lines -23%
Lock Management 90 lines 0 lines -100%
Response Time (p50) 2-5s (blocking) <500ms (async) 4-10x faster
Max Concurrent Users ~50 (before contention) 500+ (queue-based) 10x increase
Lock Timeout Errors Yes (120s limit) No Eliminated
Retry Logic Manual Automatic Simplified
Error Recovery Complex cleanup Automatic Simplified

Recommendation

Implement a phased rollout of the queue-based architecture: - Phase 1 (Week 1-2): Build and test async infrastructure - Phase 2 (Week 3): Hybrid deployment with feature flag - Phase 3 (Week 4): Full migration and lock removal - Estimated Development Time: 60-80 hours - Risk Level: Medium (mitigated by feature flags and gradual rollout)


Current Architecture Analysis

System Overview

┌─────────────────────────────────────────────────────────────────┐
│                  Current Synchronous Checkout Flow              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  User Request                                                    │
│       ↓                                                          │
│  CheckoutSessionView.get()                                       │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 1: Request Validation               │                   │
│  │ - Parse parameters                       │                   │
│  │ - Validate customer info                 │                   │
│  │ - Validate show/tickets                  │                   │
│  │ - Parse promo codes                      │                   │
│  └─────────────────────────────────────────┘                   │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 2: Redis Lock Acquisition           │ ← User waits here │
│  │ - Acquire locks for all tickets          │   (blocking)      │
│  │ - UUID-based lock values                 │                   │
│  │ - 120-second timeout                     │                   │
│  │ - Cleanup on partial failure             │                   │
│  └─────────────────────────────────────────┘                   │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 3: Atomic DB Transaction            │                   │
│  │ - select_for_update() row locks          │                   │
│  │ - Check ticket availability              │                   │
│  │ - Create Order object                    │                   │
│  │ - Create TicketOrder objects             │                   │
│  │ - Calculate fees                         │                   │
│  └─────────────────────────────────────────┘                   │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 4: Lock Release (finally block)     │                   │
│  │ - Verify lock ownership (UUID check)     │                   │
│  │ - Delete each lock individually          │                   │
│  └─────────────────────────────────────────┘                   │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 5: Stripe Session Creation          │ ← User still      │
│  │ - Call Stripe API (network I/O)          │   waiting         │
│  │ - Handle Stripe errors                   │                   │
│  └─────────────────────────────────────────┘                   │
│       ↓                                                          │
│  Response with Stripe URL (200 OK)                              │
│       ↓                                                          │
│  User redirects to Stripe                                        │
│                                                                  │
│  Total Time: 2-10 seconds (depending on lock contention)        │
└─────────────────────────────────────────────────────────────────┘

Code Locations

Primary Files: - apps/api/tickets/views/order_views.py - Main checkout logic (640 lines) - CheckoutSessionView (lines 93-636) - _create_order_with_line_items() (lines 253-570) - Complex lock management - apps/api/tickets/views/order_validation.py - Validation functions - apps/api/tickets/models.py - Order, Ticket, TicketOrder models - apps/api/tickets/utils.py - check_ticket_availability() helper

Lock Management Code: - Lock acquisition: order_views.py:296-336 (41 lines) - Lock release: order_views.py:570-582 (13 lines) - Error handling: order_views.py:583-595 (13 lines) - Lock timeout handling: Distributed throughout

Current Concurrency Mechanisms

1. Redis Cache Locks (Distributed)

# apps/api/tickets/views/order_views.py:304-328
for ticket_id in tickets_data.keys():
    lock_key = f"ticket_lock_{ticket_id}"
    lock_value = str(uuid.uuid4())  # Unique value per acquisition

    if not cache.add(lock_key, lock_value, timeout=120):
        # Lock acquisition failed - cleanup and return error
        for prev_lock_key in lock_keys:
            try:
                current_value = cache.get(prev_lock_key)
                if current_value == lock_values.get(prev_lock_key):
                    cache.delete(prev_lock_key)
            except Exception:
                pass
        return None, Response({"status": "error", ...})

Purpose: Prevent multiple processes from checking availability simultaneously Complexity: High - manual cleanup, timeout handling, UUID verification Performance Impact: Blocks user during acquisition (0-5s depending on contention)

2. Database Row-Level Locks

# apps/api/tickets/views/order_views.py:351
ticket = Ticket.objects.select_for_update(nowait=False).get(id=ticket_id)

Purpose: Prevent concurrent ticket modifications Complexity: Medium - automatic release on transaction commit/rollback Performance Impact: Minimal - PostgreSQL handles efficiently

3. Atomic Transactions

# apps/api/tickets/views/order_views.py:342
with transaction.atomic():
    # All DB operations

Purpose: Ensure all-or-nothing order creation Complexity: Low - standard Django pattern Performance Impact: Minimal

Pain Points

1. Lock Management Complexity

Code Complexity:

# Lock acquisition (~40 lines)
lock_keys = []
lock_values = {}
try:
    for ticket_id in tickets_data.keys():
        lock_key = f"{LOCK_KEY_PREFIX}_{ticket_id}"
        lock_value = str(uuid.uuid4())
        if not cache.add(lock_key, lock_value, timeout=LOCK_TIMEOUT):
            # Cleanup all previously acquired locks
            for prev_lock_key in lock_keys:
                try:
                    current_value = cache.get(prev_lock_key)
                    if current_value == lock_values.get(prev_lock_key):
                        cache.delete(prev_lock_key)
                except Exception:
                    pass
            return error_response()
        lock_keys.append(lock_key)
        lock_values[lock_key] = lock_value
finally:
    # Lock cleanup (~15 lines)
    for lock_key in lock_keys:
        try:
            current_value = cache.get(lock_key)
            if current_value == lock_values.get(lock_key):
                cache.delete(lock_key)
        except Exception as e:
            logger.error(f"Error releasing lock {lock_key}: {e}")

Problems: - Manual lock cleanup required in multiple code paths - Race condition if lock expires during transaction - Difficult to reason about correctness - Hard to test (requires threading/multiprocessing)

2. Lock Contention Performance

From test_checkout_performance.py:P2-003:

Lock Contention Results (5 concurrent requests):
Max Wait Time: 4.87s
Avg Wait Time: 2.34s

Impact: - Poor user experience during flash sales - Timeout errors under high load - Unpredictable response times

3. Limited Scalability

Current architecture limits: - Lock timeout: 120 seconds maximum - Concurrent capacity: ~50 users before significant contention - Manual scaling: Adding servers doesn't help (Redis lock bottleneck)

4. Error Recovery Complexity

Manual cleanup required for: - Lock acquisition failures - Database errors - Stripe API failures - Transaction rollbacks

Code example:

try:
    # Acquire locks
    try:
        # Create order
    except Exception:
        # Rollback transaction
        pass
finally:
    # Cleanup locks
    pass

Performance Characteristics

Current Performance (from test results):

Scenario Response Time Success Rate Notes
Single checkout 1.5-2.5s 100% Baseline
10 concurrent 2-5s 95% Some lock contention
50 concurrent 3-10s 80% Significant contention
100 concurrent 5-15s 60% Frequent timeouts
Flash sale (1000+ concurrent) 10-30s 30-50% Unacceptable

Resource Utilization: - Redis: 52 connected clients (from inspection) - Database connections: Limited by pool size (default: 60) - Celery workers: 4 workers already running - API servers: Scales horizontally but limited by Redis locks


Proposed Queue-Based Architecture

System Overview

┌─────────────────────────────────────────────────────────────────┐
│              Proposed Asynchronous Queue-Based Checkout         │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  User Request                                                    │
│       ↓                                                          │
│  CheckoutSessionView.get()                                       │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 1: Quick Validation                 │                   │
│  │ - Parse parameters (5-10ms)              │                   │
│  │ - Basic validation (10-20ms)             │                   │
│  └─────────────────────────────────────────┘                   │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 2: Create Pending Order             │                   │
│  │ - Order.objects.create(status='pending') │                   │
│  │ - Create TicketOrder records (50-100ms)  │                   │
│  └─────────────────────────────────────────┘                   │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 3: Enqueue Checkout Task            │                   │
│  │ - process_checkout_async.apply_async()   │                   │
│  │ - Redis RPUSH to 'checkout' queue (1ms)  │                   │
│  └─────────────────────────────────────────┘                   │
│       ↓                                                          │
│  Immediate Response (202 Accepted)                              │
│  {                                                               │
│    "order_id": "123",                                            │
│    "status": "processing",                                       │
│    "status_url": "/api/orders/123/status"                       │
│  }                                                               │
│       ↓                                                          │
│  User polls status endpoint every 500ms                          │
│                                                                  │
│  Total Response Time: 100-200ms (10-50x faster!)                │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                       BACKGROUND PROCESSING                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Celery Worker (checkout queue)                                 │
│       ↓                                                          │
│  @shared_task: process_checkout_async(order_id)                 │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 4: Process Order                    │                   │
│  │ - Select order with select_for_update()  │                   │
│  │ - Lock tickets (DB locks only!)          │                   │
│  │ - Check availability                     │                   │
│  │ - Calculate fees                         │                   │
│  └─────────────────────────────────────────┘                   │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 5: Create Stripe Session            │                   │
│  │ - Call Stripe API                        │                   │
│  │ - Update order with session_id           │                   │
│  │ - Set status = 'awaiting_payment'        │                   │
│  └─────────────────────────────────────────┘                   │
│       ↓                                                          │
│  ┌─────────────────────────────────────────┐                   │
│  │ Step 6: Notify User                      │                   │
│  │ - Send email with payment link           │                   │
│  │ - Or: User polling detects status change │                   │
│  └─────────────────────────────────────────┘                   │
│                                                                  │
│  Total Processing Time: 1-3 seconds (in background)             │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Key Components

1. Quick Validation + Order Creation (View Layer)

File: apps/api/tickets/views/order_views.py

class CheckoutSessionView(APIView):
    """
    Simplified checkout view - validates and enqueues.

    Response time target: <200ms (p95)
    """

    def get(self, request):
        # Step 1: Quick validation (50-100ms)
        params, error = validate_request_params(request)
        if error:
            return error

        show, error = validate_show(params['show_id'])
        if error:
            return error

        tickets_data, error = self._parse_ticket_orders(request)
        if error:
            return error

        promo_code, error = validate_promo_code(
            params.get('promo_code'),
            show.id
        )
        if error:
            return error

        # Step 2: Create pending order (50-100ms)
        with transaction.atomic():
            order = Order.objects.create(
                first_name=params['first_name'],
                last_name=params['last_name'],
                email=params['email'],
                phone=params.get('phone', ''),
                show=show,
                status='pending',  # NEW STATUS
                promo_code=promo_code,
            )

            # Create ticket orders
            for ticket_id, data in tickets_data.items():
                ticket = Ticket.objects.get(id=ticket_id)
                TicketOrder.objects.create(
                    ticket=ticket,
                    quantity=data['quantity'],
                    donation_amount=data['donation_amount'],
                    price_per_ticket=ticket.price + data['donation_amount'],
                    total_price=(ticket.price + data['donation_amount']) * data['quantity'],
                    promo_code=promo_code.code if promo_code else None,
                )

        # Step 3: Enqueue async processing (1-5ms)
        task = process_checkout_async.apply_async(
            args=[str(order.id)],
            queue='checkout',
            priority=9,  # High priority
            expires=300,  # 5 minute expiration
        )

        logger.info(f"Order {order.id} enqueued for processing (task: {task.id})")

        # Step 4: Return immediately (total: <200ms)
        return Response({
            'order_id': str(order.id),
            'task_id': task.id,
            'status': 'processing',
            'status_url': reverse('order_status', args=[order.id]),
            'message': 'Your order is being processed. Please wait...'
        }, status=status.HTTP_202_ACCEPTED)

Benefits: - ✅ User gets response in <200ms - ✅ No blocking on locks - ✅ Simple validation logic - ✅ Order recorded immediately

2. Async Checkout Processor (Celery Task)

File: apps/api/tickets/tasks.py

from celery import shared_task
from django.db import transaction
from tickets.models import Order, Ticket, TicketOrder
from tickets.utils import check_ticket_availability
import stripe
import logging

logger = logging.getLogger(__name__)


@shared_task(
    bind=True,
    name='tickets.process_checkout_async',
    max_retries=3,
    default_retry_delay=5,  # 5 seconds between retries
    autoretry_for=(
        stripe.error.RateLimitError,
        stripe.error.APIConnectionError,
    ),
    retry_backoff=True,  # Exponential backoff: 5s, 10s, 20s
    retry_backoff_max=60,  # Max 60s between retries
    retry_jitter=True,  # Add randomness to prevent thundering herd
    queue='checkout',
    priority=9,
    acks_late=True,  # Acknowledge after completion
    reject_on_worker_lost=True,  # Re-queue if worker crashes
)
def process_checkout_async(self, order_id):
    """
    Process checkout asynchronously.

    This task is naturally serialized by Celery workers,
    eliminating the need for Redis locks. Database locks
    (select_for_update) are sufficient.

    Args:
        order_id: UUID string of the order to process

    Returns:
        dict: Result with status and details

    Raises:
        Retry: Automatically retries on transient errors
    """
    try:
        logger.info(f"Processing checkout for order {order_id}")

        # NO REDIS LOCKS NEEDED!
        # Worker naturally serializes ticket access

        with transaction.atomic():
            # Lock order (prevents duplicate processing)
            try:
                order = Order.objects.select_for_update(
                    nowait=True  # Fail fast if another worker has it
                ).get(id=order_id)
            except Order.DoesNotExist:
                logger.error(f"Order {order_id} not found")
                return {'status': 'error', 'reason': 'order_not_found'}
            except DatabaseError:
                # Another worker is processing this order
                logger.warning(f"Order {order_id} already being processed")
                return {'status': 'skipped', 'reason': 'already_processing'}

            # Check if already processed
            if order.status != 'pending':
                logger.info(f"Order {order_id} already processed (status: {order.status})")
                return {'status': 'skipped', 'reason': 'already_processed'}

            # Get ticket orders
            ticket_orders = order.tickets.select_related('ticket').all()

            # Lock all tickets (DB locks only!)
            ticket_ids = [to.ticket.id for to in ticket_orders]
            locked_tickets = {
                t.id: t for t in Ticket.objects.select_for_update().filter(
                    id__in=ticket_ids
                )
            }

            # Validate availability
            for ticket_order in ticket_orders:
                ticket = locked_tickets[ticket_order.ticket.id]

                # Check if still available
                if not check_ticket_availability(
                    ticket,
                    ticket_order.quantity,
                    include_pending=True
                ):
                    logger.warning(
                        f"Ticket {ticket.name} sold out during processing "
                        f"for order {order_id}"
                    )
                    order.status = 'failed'
                    order.error_message = f'Ticket {ticket.name} is no longer available'
                    order.save()

                    # Send failure notification
                    send_order_failed_email.apply_async(
                        args=[str(order.id)],
                        queue='emails'
                    )

                    return {
                        'status': 'failed',
                        'reason': 'sold_out',
                        'ticket': ticket.name
                    }

            # Calculate fees
            total_amount = sum(to.total_price for to in ticket_orders)
            total_tickets = sum(to.quantity for to in ticket_orders)

            platform_fee = Decimal("1.50") * total_tickets
            processing_fee = (
                (total_amount + platform_fee) * Decimal("0.029") + Decimal("0.30")
            ).quantize(Decimal("0.01"), rounding=ROUND_HALF_UP)

            total_with_fees = total_amount + platform_fee + processing_fee

            # Update order with calculated fees
            order.total = total_with_fees
            order.platform_fees = platform_fee
            order.payment_processing_fees = processing_fee
            order.save()

        # Transaction committed - tickets are reserved

        # Create Stripe session (outside transaction for speed)
        try:
            # Check if order is for free tickets
            if total_with_fees == 0:
                # Free order - mark as successful immediately
                order.session_id = f'FREE-{order.id}'
                order.status = 'awaiting_payment'  # Will be completed by success handler
                order.save()

                logger.info(f"Free order {order_id} created successfully")

                return {
                    'status': 'success',
                    'order_type': 'free',
                    'redirect_url': f"{settings.FRONTEND_URL}/checkout/success?session_id=FREE-{order.id}"
                }

            # Paid order - create Stripe session
            stripe_session = stripe.checkout.Session.create(
                payment_method_types=['card'],
                line_items=self._build_line_items(ticket_orders, platform_fee, processing_fee),
                mode='payment',
                success_url=f"{settings.FRONTEND_URL}/checkout/success?session_id={{CHECKOUT_SESSION_ID}}",
                cancel_url=f"{settings.FRONTEND_URL}/checkout/cancel?order_id={order.id}",
                metadata={
                    'order_id': str(order.id),
                    'show_id': str(order.show.id),
                },
                payment_intent_data={
                    'transfer_data': {
                        'destination': order.show.producer.financial.stripe_account_id,
                    },
                    'application_fee_amount': int((platform_fee + processing_fee) * 100),
                },
                automatic_tax={'enabled': True},
            )

            # Update order with Stripe session
            order.session_id = stripe_session.id
            order.status = 'awaiting_payment'
            order.save()

            logger.info(
                f"Stripe session created for order {order_id}: {stripe_session.id}"
            )

            # Send payment link email
            send_payment_link_email.apply_async(
                args=[str(order.id), stripe_session.url],
                queue='emails',
                countdown=2,  # Wait 2s to allow polling to detect status first
            )

            return {
                'status': 'success',
                'order_type': 'paid',
                'session_id': stripe_session.id,
                'stripe_url': stripe_session.url
            }

        except stripe.error.StripeError as e:
            # Stripe error - order remains pending, will retry
            logger.error(f"Stripe error for order {order_id}: {e}")

            # Retry with exponential backoff
            raise self.retry(exc=e, countdown=2 ** self.request.retries)

    except Exception as e:
        # Unexpected error - log and update order
        logger.exception(f"Unexpected error processing order {order_id}: {e}")

        try:
            order = Order.objects.get(id=order_id)
            order.status = 'failed'
            order.error_message = f'System error: {str(e)[:200]}'
            order.save()
        except:
            pass

        # Don't retry on unexpected errors
        return {
            'status': 'error',
            'reason': 'unexpected_error',
            'message': str(e)
        }


def _build_line_items(ticket_orders, platform_fee, processing_fee):
    """Build Stripe line items from ticket orders."""
    line_items = []

    for ticket_order in ticket_orders:
        if ticket_order.price_per_ticket > 0:
            line_items.append({
                'price_data': {
                    'currency': 'usd',
                    'product_data': {
                        'name': ticket_order.ticket.name,
                        'description': ticket_order.ticket.description,
                    },
                    'unit_amount': int(ticket_order.price_per_ticket * 100),
                },
                'quantity': ticket_order.quantity,
            })

    # Add platform fee
    if platform_fee > 0:
        line_items.append({
            'price_data': {
                'currency': 'usd',
                'product_data': {'name': 'Platform Fee'},
                'unit_amount': int(platform_fee * 100),
            },
            'quantity': 1,
        })

    # Add processing fee
    if processing_fee > 0:
        line_items.append({
            'price_data': {
                'currency': 'usd',
                'product_data': {'name': 'Processing Fee'},
                'unit_amount': int(processing_fee * 100),
            },
            'quantity': 1,
        })

    return line_items


@shared_task(queue='emails', priority=5)
def send_payment_link_email(order_id, stripe_url):
    """Send email with payment link to customer."""
    try:
        order = Order.objects.get(id=order_id)

        subject = f"Complete your ticket purchase for {order.show.title}"
        message = f"""
        Hi {order.first_name},

        Your order is ready! Please complete your payment:
        {stripe_url}

        This link will expire in 24 hours.

        Order Details:
        - Show: {order.show.title}
        - Total: ${order.total}

        Thank you for using Pique Tickets!
        """

        send_mail(
            subject=subject,
            message=message,
            from_email='no-reply@piquetickets.com',
            recipient_list=[order.email],
        )

        logger.info(f"Payment link email sent for order {order_id}")

    except Exception as e:
        logger.error(f"Error sending payment link email for order {order_id}: {e}")


@shared_task(queue='emails', priority=5)
def send_order_failed_email(order_id):
    """Send email notifying customer of order failure."""
    try:
        order = Order.objects.get(id=order_id)

        subject = f"Ticket unavailable for {order.show.title}"
        message = f"""
        Hi {order.first_name},

        Unfortunately, the tickets you selected are no longer available:
        {order.error_message}

        Please visit our website to see other available tickets.

        We apologize for the inconvenience.

        - Pique Tickets Team
        """

        send_mail(
            subject=subject,
            message=message,
            from_email='no-reply@piquetickets.com',
            recipient_list=[order.email],
        )

        logger.info(f"Order failed email sent for order {order_id}")

    except Exception as e:
        logger.error(f"Error sending order failed email for order {order_id}: {e}")

3. Status Polling Endpoint

File: apps/api/tickets/views/order_views.py

class OrderStatusView(APIView):
    """
    Lightweight endpoint for polling order status.

    Used by frontend to detect when async checkout completes.
    """

    permission_classes = [IsAuthenticatedOrReadOnly]

    def get(self, request, order_id):
        """
        Get current order status.

        Returns different responses based on order state:
        - pending: Still processing
        - awaiting_payment: Ready for payment (includes Stripe URL)
        - failed: Order failed (includes error message)
        - success: Order completed
        """
        try:
            order = Order.objects.select_related('show').get(id=order_id)
        except Order.DoesNotExist:
            return Response({
                'status': 'error',
                'message': 'Order not found'
            }, status=status.HTTP_404_NOT_FOUND)

        # Build response based on status
        response_data = {
            'order_id': str(order.id),
            'status': order.status,
            'show_title': order.show.title,
        }

        if order.status == 'pending':
            response_data.update({
                'message': 'Your order is being processed. Please wait...',
                'estimated_wait': '2-5 seconds'
            })

        elif order.status == 'awaiting_payment':
            # Ready for payment!
            if order.session_id.startswith('FREE-'):
                # Free order - redirect to success
                response_data.update({
                    'message': 'Your free tickets are ready!',
                    'redirect_url': f"{settings.FRONTEND_URL}/checkout/success?session_id={order.session_id}"
                })
            else:
                # Paid order - redirect to Stripe
                response_data.update({
                    'message': 'Ready for payment',
                    'stripe_url': f'https://checkout.stripe.com/c/pay/{order.session_id}',
                    'expires_at': (order.created_at + timedelta(hours=24)).isoformat()
                })

        elif order.status == 'failed':
            response_data.update({
                'message': order.error_message or 'Order failed',
                'can_retry': True,
            })
            return Response(response_data, status=status.HTTP_400_BAD_REQUEST)

        elif order.status == 'success':
            response_data.update({
                'message': 'Order completed successfully!',
                'confirmation_url': f"{settings.FRONTEND_URL}/orders/{order.id}"
            })

        return Response(response_data)

4. Frontend Polling Implementation

File: apps/frontend/components/CheckoutPoller.tsx (example)

async function pollOrderStatus(orderId: string): Promise<void> {
  const maxAttempts = 20; // 10 seconds max (20 * 500ms)
  let attempts = 0;

  while (attempts < maxAttempts) {
    try {
      const response = await fetch(`/api/orders/${orderId}/status`);
      const data = await response.json();

      switch (data.status) {
        case 'awaiting_payment':
          // Redirect to Stripe
          if (data.stripe_url) {
            window.location.href = data.stripe_url;
          } else if (data.redirect_url) {
            window.location.href = data.redirect_url;
          }
          return;

        case 'failed':
          // Show error
          showError(data.message);
          return;

        case 'pending':
          // Still processing, continue polling
          break;

        default:
          showError('Unexpected order status');
          return;
      }

      // Wait 500ms before next poll
      await new Promise(resolve => setTimeout(resolve, 500));
      attempts++;

    } catch (error) {
      console.error('Error polling order status:', error);
      showError('Failed to check order status');
      return;
    }
  }

  // Timeout - show error
  showError('Order processing timeout. Please check your email.');
}

// Usage in checkout flow
async function handleCheckout(checkoutData) {
  try {
    // Submit checkout
    const response = await fetch('/api/checkout', {
      method: 'GET',
      body: JSON.stringify(checkoutData)
    });

    if (response.status === 202) {
      // Accepted - start polling
      const { order_id } = await response.json();
      showProcessing('Processing your order...');
      await pollOrderStatus(order_id);
    } else {
      // Immediate error
      const error = await response.json();
      showError(error.message);
    }
  } catch (error) {
    showError('Network error. Please try again.');
  }
}

Architecture Improvements

Component Before After Improvement
Lock Management Redis locks + DB locks DB locks only -90 lines
Error Handling Manual cleanup Automatic retry -40 lines
Concurrency Control Manual coordination Worker serialization Natural
Response Time 2-10s (blocking) 100-200ms 10-50x faster
Scalability Limited by locks Queue-based 10x capacity
Monitoring Custom logs Flower dashboard Built-in
Retry Logic Manual Exponential backoff Automatic
Code Complexity High Medium Simpler

Benefits and Trade-offs

Benefits

1. Dramatic Code Simplification

Metrics: - Remove 90+ lines of lock management code - Remove 40+ lines of error cleanup code - Reduce overall complexity by ~23% - Eliminate UUID lock value tracking - Eliminate timeout management

Code Quality: - Easier to read and understand - Easier to test (no threading required) - Fewer edge cases to handle - Better separation of concerns

2. Performance Improvements

User-Facing Performance:

Response Time Comparison:
┌─────────────────────┬─────────┬──────────┬────────────┐
│ Scenario            │ Current │ Proposed │ Improvement│
├─────────────────────┼─────────┼──────────┼────────────┤
│ Single checkout     │  2.5s   │  0.15s   │  17x       │
│ 10 concurrent       │  4.2s   │  0.18s   │  23x       │
│ 50 concurrent       │  8.5s   │  0.22s   │  39x       │
│ 100 concurrent      │ 15.0s   │  0.30s   │  50x       │
└─────────────────────┴─────────┴──────────┴────────────┘

Processing Time (background):
- Checkout processing: 1-3s (unchanged)
- Total user wait: 2-5s with polling (unchanged)

Backend Performance:

Throughput Comparison:
┌─────────────────────┬─────────┬──────────┬────────────┐
│ Metric              │ Current │ Proposed │ Improvement│
├─────────────────────┼─────────┼──────────┼────────────┤
│ Requests/sec        │   ~20   │   ~200   │  10x       │
│ Concurrent capacity │    50   │   500+   │  10x       │
│ Lock timeout errors │  5-10%  │    0%    │  100%      │
│ DB connection usage │  High   │  Medium  │  Better    │
└─────────────────────┴─────────┴──────────┴────────────┘

3. Better User Experience

Immediate Feedback: - User gets response in <200ms vs 2-10s - No "stuck" feeling waiting for locks - Progress indication possible ("Order processing...")

During Flash Sales:

Flash Sale Scenario (1000 concurrent users):
┌─────────────────────┬─────────────┬─────────────┐
│ Metric              │ Current     │ Proposed    │
├─────────────────────┼─────────────┼─────────────┤
│ Success rate        │ 30-50%      │ 95%+        │
│ Avg response time   │ 15-30s      │ 0.3s        │
│ Timeout errors      │ 500-700     │ 0           │
│ User frustration    │ Very high   │ Low         │
└─────────────────────┴─────────────┴─────────────┘

Error Recovery: - Automatic retry on transient failures - Better error messages - Email notification if checkout fails

4. Improved Scalability

Horizontal Scaling:

Before:
- Add API server → Still limited by Redis locks
- Lock contention increases with servers

After:
- Add API server → More request capacity
- Add Celery worker → More processing capacity
- Independent scaling of request and processing layers

Queue Benefits:

Queue Characteristics:
┌─────────────────────┬──────────────────────────────┐
│ Feature             │ Benefit                      │
├─────────────────────┼──────────────────────────────┤
│ Burst absorption    │ Handle 1000+ concurrent      │
│ Rate limiting       │ Natural through workers      │
│ Priority queuing    │ VIP tickets get priority     │
│ Overflow handling   │ Queue grows, no rejection    │
└─────────────────────┴──────────────────────────────┘

5. Better Observability

Monitoring Tools:

Flower Dashboard (already running on :5555):
- Active tasks
- Task success/failure rates
- Processing times
- Queue lengths
- Worker utilization

Task-Level Metrics:
- Task duration histogram
- Retry counts
- Error rates by type
- Throughput over time

Alerting:

# Example: Alert on high failure rate
if task_failure_rate > 10%:
    alert("High checkout failure rate!")

# Example: Alert on queue buildup
if queue_length > 100:
    alert("Checkout queue backing up - scale workers!")

6. Simplified Testing

Test Complexity Reduction:

# BEFORE: Complex threading tests
def test_concurrent_checkout():
    with ThreadPoolExecutor(max_workers=10) as executor:
        futures = [executor.submit(checkout, i) for i in range(10)]
        results = [f.result() for f in futures]
    # Need to close DB connections manually!
    connections.close_all()

# AFTER: Simple task tests
def test_checkout_task():
    order = create_test_order()
    result = process_checkout_async(order.id)
    assert result['status'] == 'success'

Test Coverage: - Unit tests for task logic - Integration tests with test Redis - No threading/multiprocessing needed - Easier to mock Stripe API

7. Better Error Handling

Automatic Retry:

# BEFORE: Manual retry logic
def checkout():
    for attempt in range(3):
        try:
            return process()
        except Exception:
            if attempt == 2:
                raise
            time.sleep(2 ** attempt)

# AFTER: Declarative retry
@shared_task(max_retries=3, autoretry_for=(StripeError,))
def process_checkout_async(order_id):
    return process()  # Celery handles retry!

Dead Letter Queue: - Failed tasks after max retries → DLQ - Admin can investigate and retry manually - No lost orders

Trade-offs and Challenges

1. Polling Overhead

Challenge:

User must poll /orders/{id}/status endpoint every 500ms
- Increased API load
- Slight delay before redirect (0.5-2s)

Mitigation:

// Smart polling with exponential backoff
const poll = async (orderId) => {
  let delay = 500; // Start at 500ms
  for (let i = 0; i < 10; i++) {
    const status = await checkStatus(orderId);
    if (status !== 'pending') return status;

    await sleep(delay);
    delay = Math.min(delay * 1.2, 2000); // Max 2s
  }
};

Alternative: WebSocket for real-time updates (future enhancement)

2. Two-Step Flow

Challenge:

Before: One request → Stripe URL
After: Request → Poll → Stripe URL

Additional complexity in frontend

Mitigation:

// Abstract polling behind checkout() function
async function checkout(data) {
  const { order_id } = await initiateCheckout(data);
  const status = await pollUntilReady(order_id);
  if (status.stripe_url) {
    window.location.href = status.stripe_url;
  }
}

// Frontend code remains simple

3. Increased Infrastructure Complexity

Challenge:

Additional components to monitor:
- Celery workers (already running)
- Redis queue depth
- Task success rates
- Dead letter queue

Potential failure points:
- Celery worker crashes
- Redis connection issues
- Task processing delays

Mitigation:

# Monitoring alerts
alerts:
  - name: checkout_queue_depth
    threshold: queue_length > 100
    action: scale_workers()

  - name: checkout_failure_rate
    threshold: failure_rate > 5%
    action: page_oncall()

  - name: worker_health
    threshold: healthy_workers < 2
    action: restart_workers()

4. Email Dependency

Challenge:

If polling fails, user relies on email with payment link
- Email delays (1-5 minutes)
- Spam folder issues
- User may not check email

Mitigation:

# Multiple notification channels
- Polling (primary)
- Email with payment link (backup)
- SMS for VIP tickets (future)
- WebSocket push notification (future)

# Extend polling timeout
MAX_POLL_ATTEMPTS = 30  # 15 seconds

5. Testing Complexity for Async

Challenge:

Testing async tasks requires:
- Celery test configuration
- Task result backend
- Async test utilities

Mitigation:

# Use eager mode for tests
@override_settings(CELERY_TASK_ALWAYS_EAGER=True)
class TestCheckout(TestCase):
    def test_checkout(self):
        # Tasks run synchronously in tests
        result = process_checkout_async(order.id)

6. Stripe Session Expiration

Challenge:

If queue processing is slow (>10 minutes):
- Stripe sessions expire
- User clicks link → expired session

Mitigation:

# Task timeout + priority
@shared_task(
    time_limit=300,  # 5 minute hard limit
    soft_time_limit=240,  # 4 minute soft limit
    priority=9,  # High priority
    expires=600,  # Expire queued tasks after 10 min
)

# Monitor queue processing time
if avg_processing_time > 3_minutes:
    alert("Checkout processing too slow!")
    scale_workers()

Decision Matrix

Use Queue-Based When: - ✅ Expecting high concurrency (flash sales, popular events) - ✅ Lock contention is causing timeout errors - ✅ Response time is important (user experience) - ✅ Need better observability and monitoring - ✅ Want to simplify codebase - ✅ Have Celery infrastructure (you do!)

Keep Current Synchronous When: - ❌ Low traffic only (< 10 concurrent checkouts) - ❌ Simple user flow is critical (no polling) - ❌ Don't want any async complexity - ❌ No flash sales or high-concurrency events

Recommendation for PiqueTickets:Implement queue-based - You have the infrastructure, experience flash sale scenarios (per testing plan), and would benefit significantly from simplified code and better scalability.


Implementation Plan

Phase 1: Infrastructure Setup (Week 1)

Estimated Time: 16-20 hours

1.1 Database Migrations

Create new order status field:

# apps/api/tickets/migrations/XXXX_add_order_status.py
from django.db import migrations, models

class Migration(migrations.Migration):
    dependencies = [
        ('tickets', 'YYYY_previous_migration'),
    ]

    operations = [
        migrations.AddField(
            model_name='order',
            name='status',
            field=models.CharField(
                max_length=20,
                choices=[
                    ('pending', 'Pending'),
                    ('processing', 'Processing'),
                    ('awaiting_payment', 'Awaiting Payment'),
                    ('failed', 'Failed'),
                    ('success', 'Success'),
                    ('cancelled', 'Cancelled'),
                ],
                default='pending',
            ),
        ),
        migrations.AddField(
            model_name='order',
            name='error_message',
            field=models.TextField(blank=True, null=True),
        ),
        migrations.AddIndex(
            model_name='order',
            index=models.Index(fields=['status', 'created_at']),
            name='order_status_created_idx',
        ),
    ]

Update model:

# apps/api/tickets/models.py
class Order(models.Model):
    STATUS_CHOICES = [
        ('pending', 'Pending'),
        ('processing', 'Processing'),
        ('awaiting_payment', 'Awaiting Payment'),
        ('failed', 'Failed'),
        ('success', 'Success'),
        ('cancelled', 'Cancelled'),
    ]

    status = models.CharField(
        max_length=20,
        choices=STATUS_CHOICES,
        default='pending',
        db_index=True,
    )
    error_message = models.TextField(blank=True, null=True)

    class Meta:
        indexes = [
            models.Index(fields=['status', 'created_at']),
        ]

Task: 2-3 hours

1.2 Celery Queue Configuration

Update celery.py:

# apps/api/brktickets/celery.py
app.conf.update(
    # Queue routing
    task_routes={
        'tickets.tasks.process_checkout_async': {
            'queue': 'checkout',
            'priority': 9,
        },
        'tickets.tasks.send_payment_link_email': {
            'queue': 'emails',
            'priority': 5,
        },
        'tickets.tasks.send_order_failed_email': {
            'queue': 'emails',
            'priority': 5,
        },
    },

    # Worker configuration
    worker_prefetch_multiplier=2,  # Limit prefetch for priority
    worker_max_tasks_per_child=1000,  # Restart after 1000 tasks
    task_acks_late=True,  # Acknowledge after completion
    task_reject_on_worker_lost=True,  # Re-queue on worker crash

    # Task time limits
    task_time_limit=300,  # 5 minutes hard limit
    task_soft_time_limit=240,  # 4 minutes soft limit

    # Result backend
    result_expires=3600,  # Keep results for 1 hour
)

Update docker-compose.yml:

# Add dedicated checkout worker
celery_checkout_worker:
  build:
    context: ./apps/api
    dockerfile: Dockerfile
  env_file:
    - ./apps/api/.env
  volumes:
    - ./apps/api:/app:delegated
  environment:
    - PGDATABASE=piquetickets
    - PGUSER=user
    - PGPASSWORD=password
    - PGHOST=db
    - PGPORT=5432
    - REDIS_URL=redis://redis:6379/0
    - DEBUG=True
  depends_on:
    db:
      condition: service_healthy
    redis:
      condition: service_healthy
  command: >
    celery -A brktickets worker
    --queues checkout
    --loglevel INFO
    --concurrency 2
    --max-tasks-per-child 1000
    --prefetch-multiplier 2
  networks:
    - custom_network
  healthcheck:
    test: ["CMD-SHELL", "celery -A brktickets inspect ping"]
    interval: 30s
    timeout: 10s
    retries: 3

# Update existing worker to handle other queues
celery_worker:
  # ... existing config ...
  command: >
    celery -A brktickets worker
    --queues celery,emails
    --loglevel INFO
    --concurrency 4

Task: 3-4 hours

1.3 Create Async Task

Create process_checkout_async task:

See Technical Specifications section for full implementation.

File: apps/api/tickets/tasks.py

Task: 6-8 hours (includes testing)

1.4 Create Status Endpoint

Add OrderStatusView:

See Technical Specifications section for full implementation.

File: apps/api/tickets/views/order_views.py

Add URL route:

# apps/api/tickets/urls.py
from tickets.views.order_views import OrderStatusView

urlpatterns = [
    # ... existing patterns ...
    path('orders/<uuid:order_id>/status', OrderStatusView.as_view(), name='order_status'),
]

Task: 2-3 hours

1.5 Monitoring Setup

Configure Flower:

# apps/api/brktickets/settings.py
CELERY_FLOWER_BASIC_AUTH = [
    (os.getenv('FLOWER_USER', 'admin'), os.getenv('FLOWER_PASSWORD', 'admin'))
]

Add Prometheus metrics (optional):

# apps/api/brktickets/celery.py
from celery.signals import task_success, task_failure

@task_success.connect
def task_success_handler(sender=None, **kwargs):
    # Log metrics
    pass

@task_failure.connect
def task_failure_handler(sender=None, **kwargs):
    # Alert on failures
    pass

Task: 3-4 hours

Phase 1 Deliverables: - ✅ Database migration complete - ✅ Celery queues configured - ✅ Async task implemented - ✅ Status endpoint created - ✅ Monitoring in place - ✅ All changes tested locally


Phase 2: Hybrid Implementation (Week 2)

Estimated Time: 20-24 hours

2.1 Feature Flag System

Add feature flag:

# apps/api/brktickets/settings.py
ENABLE_ASYNC_CHECKOUT = os.getenv('ENABLE_ASYNC_CHECKOUT', 'false').lower() == 'true'

# Per-show override (future enhancement)
# Allows enabling for specific high-traffic shows

Environment variable:

# .env
ENABLE_ASYNC_CHECKOUT=false  # Start disabled

Task: 1-2 hours

2.2 Update CheckoutSessionView

Implement hybrid approach:

# apps/api/tickets/views/order_views.py
class CheckoutSessionView(APIView):
    def get(self, request):
        # Check feature flag
        if settings.ENABLE_ASYNC_CHECKOUT:
            return self._handle_async_checkout(request)
        else:
            return self._handle_sync_checkout(request)

    def _handle_async_checkout(self, request):
        """New queue-based checkout."""
        # Quick validation
        params, error = validate_request_params(request)
        if error:
            return error

        # ... rest of async implementation ...

    def _handle_sync_checkout(self, request):
        """Original synchronous checkout (existing code)."""
        # All existing logic unchanged
        # ... current implementation ...

Task: 4-6 hours

2.3 Frontend Polling

Add polling component:

// apps/frontend/lib/checkout-poller.ts
export async function pollOrderStatus(
  orderId: string,
  onStatusChange: (status: OrderStatus) => void
): Promise<OrderStatus> {
  const maxAttempts = 30; // 15 seconds max
  const initialDelay = 500; // Start with 500ms
  const maxDelay = 2000; // Max 2s between polls

  let attempts = 0;
  let delay = initialDelay;

  while (attempts < maxAttempts) {
    try {
      const response = await fetch(`/api/orders/${orderId}/status`);
      const data: OrderStatusResponse = await response.json();

      onStatusChange(data);

      // Terminal states
      if (['awaiting_payment', 'failed', 'success'].includes(data.status)) {
        return data;
      }

      // Still processing - wait and retry
      await sleep(delay);
      delay = Math.min(delay * 1.2, maxDelay); // Exponential backoff
      attempts++;

    } catch (error) {
      console.error('Polling error:', error);
      // Continue polling on error (network blip)
      await sleep(delay);
      attempts++;
    }
  }

  throw new Error('Polling timeout - order processing took too long');
}

// Usage
async function handleCheckout(formData) {
  try {
    const response = await fetch('/api/checkout', {
      method: 'GET',
      body: JSON.stringify(formData),
    });

    const data = await response.json();

    if (response.status === 202) {
      // Async checkout
      showSpinner('Processing your order...');

      const result = await pollOrderStatus(data.order_id, (status) => {
        // Update UI with current status
        updateStatus(status.message);
      });

      if (result.stripe_url) {
        window.location.href = result.stripe_url;
      } else if (result.status === 'failed') {
        showError(result.message);
      }

    } else if (response.status === 200) {
      // Sync checkout (fallback)
      window.location.href = data.url;

    } else {
      // Error
      showError(data.message);
    }

  } catch (error) {
    showError('Checkout failed. Please try again.');
  }
}

Add React component:

// apps/frontend/components/CheckoutButton.tsx
export function CheckoutButton({ checkoutData }: CheckoutButtonProps) {
  const [status, setStatus] = useState<'idle' | 'processing' | 'error'>('idle');
  const [message, setMessage] = useState('');

  const handleCheckout = async () => {
    setStatus('processing');
    setMessage('Processing your order...');

    try {
      await handleCheckout(checkoutData);
    } catch (error) {
      setStatus('error');
      setMessage(error.message || 'Checkout failed');
    }
  };

  return (
    <div>
      <button
        onClick={handleCheckout}
        disabled={status === 'processing'}
      >
        {status === 'processing' ? 'Processing...' : 'Complete Purchase'}
      </button>

      {status === 'processing' && (
        <div className="loading-spinner">
          <Spinner />
          <p>{message}</p>
        </div>
      )}

      {status === 'error' && (
        <div className="error-message">{message}</div>
      )}
    </div>
  );
}

Task: 6-8 hours

2.4 Integration Testing

Test both paths:

# apps/api/tickets/tests/test_checkout_async.py
import pytest
from django.test import override_settings
from tickets.tasks import process_checkout_async

@override_settings(ENABLE_ASYNC_CHECKOUT=True)
class TestAsyncCheckout(TransactionTestCase):
    def test_async_checkout_success(self):
        """Test async checkout with immediate task execution."""
        # Create test data
        show = create_test_show()
        ticket = create_test_ticket(show, quantity=10)

        # Initiate checkout
        response = self.client.get('/api/checkout', {
            'showId': str(show.id),
            'firstName': 'John',
            'lastName': 'Doe',
            'email': 'john@example.com',
            'ticketIds': [str(ticket.id)],
            'quantities': ['2'],
        })

        # Should return 202 Accepted
        self.assertEqual(response.status_code, 202)
        data = response.json()
        self.assertEqual(data['status'], 'processing')
        self.assertIn('order_id', data)

        # Get order
        order = Order.objects.get(id=data['order_id'])
        self.assertEqual(order.status, 'pending')

        # Process task (runs synchronously in eager mode)
        result = process_checkout_async(str(order.id))

        # Verify success
        self.assertEqual(result['status'], 'success')
        order.refresh_from_db()
        self.assertEqual(order.status, 'awaiting_payment')
        self.assertIsNotNone(order.session_id)

    def test_async_checkout_sold_out(self):
        """Test async checkout when tickets sell out."""
        show = create_test_show()
        ticket = create_test_ticket(show, quantity=1)

        # Create competing order
        Order.objects.create_with_tickets(show, ticket, quantity=1)

        # Try to checkout
        response = self.client.get('/api/checkout', {
            'showId': str(show.id),
            'firstName': 'Jane',
            'lastName': 'Doe',
            'email': 'jane@example.com',
            'ticketIds': [str(ticket.id)],
            'quantities': ['1'],
        })

        # Order created
        data = response.json()
        order = Order.objects.get(id=data['order_id'])

        # Process task
        result = process_checkout_async(str(order.id))

        # Should fail
        self.assertEqual(result['status'], 'failed')
        self.assertEqual(result['reason'], 'sold_out')
        order.refresh_from_db()
        self.assertEqual(order.status, 'failed')

    def test_status_polling_endpoint(self):
        """Test status endpoint."""
        order = create_test_order(status='pending')

        # Poll status
        response = self.client.get(f'/api/orders/{order.id}/status')
        self.assertEqual(response.status_code, 200)

        data = response.json()
        self.assertEqual(data['status'], 'pending')
        self.assertIn('message', data)

@override_settings(ENABLE_ASYNC_CHECKOUT=False)
class TestSyncCheckout(TransactionTestCase):
    def test_sync_checkout_still_works(self):
        """Ensure sync checkout unchanged."""
        # All existing tests should pass
        pass

Task: 6-8 hours

2.5 Load Testing

Test both modes under load:

# scripts/load_test_async.py
import asyncio
import aiohttp
from datetime import datetime

async def test_checkout(session, checkout_data):
    start = datetime.now()
    async with session.get('/api/checkout', json=checkout_data) as response:
        data = await response.json()
        response_time = (datetime.now() - start).total_seconds()

        if response.status == 202:
            # Async mode - poll for status
            order_id = data['order_id']
            while True:
                async with session.get(f'/api/orders/{order_id}/status') as status_resp:
                    status_data = await status_resp.json()
                    if status_data['status'] != 'pending':
                        total_time = (datetime.now() - start).total_seconds()
                        return {
                            'response_time': response_time,
                            'total_time': total_time,
                            'status': status_data['status']
                        }
                await asyncio.sleep(0.5)
        else:
            # Sync mode or error
            return {
                'response_time': response_time,
                'total_time': response_time,
                'status': data.get('status', 'error')
            }

async def main():
    # Test 100 concurrent checkouts
    async with aiohttp.ClientSession() as session:
        tasks = [
            test_checkout(session, create_checkout_data(i))
            for i in range(100)
        ]
        results = await asyncio.gather(*tasks)

    # Analyze results
    response_times = [r['response_time'] for r in results]
    total_times = [r['total_time'] for r in results]
    successes = sum(1 for r in results if r['status'] in ['success', 'awaiting_payment'])

    print(f"Success rate: {successes}/100")
    print(f"Avg response time: {sum(response_times)/len(response_times):.2f}s")
    print(f"Avg total time: {sum(total_times)/len(total_times):.2f}s")
    print(f"p95 response: {sorted(response_times)[94]:.2f}s")
    print(f"p95 total: {sorted(total_times)[94]:.2f}s")

if __name__ == '__main__':
    asyncio.run(main())

Task: 4-6 hours

Phase 2 Deliverables: - ✅ Feature flag implemented - ✅ Hybrid checkout working - ✅ Frontend polling implemented - ✅ Both modes tested thoroughly - ✅ Load tests show improvement - ✅ Ready for staged rollout


Phase 3: Full Migration (Week 3-4)

Estimated Time: 24-32 hours

3.1 Gradual Rollout

Week 3: Enable for low-traffic shows

# Strategy 1: Per-show feature flag
class Show(models.Model):
    use_async_checkout = models.BooleanField(default=False)

# In checkout view
if show.use_async_checkout or settings.ENABLE_ASYNC_CHECKOUT:
    return self._handle_async_checkout(request)

Enable for specific shows:

-- Enable for test shows first
UPDATE tickets_show
SET use_async_checkout = true
WHERE title LIKE '%Test%' OR producer_id IN (test_producers);

-- Monitor for 48 hours

-- Enable for low-traffic shows
UPDATE tickets_show
SET use_async_checkout = true
WHERE id IN (
    SELECT show_id
    FROM tickets_order
    GROUP BY show_id
    HAVING COUNT(*) < 100
);

-- Monitor for 1 week

-- Enable for all shows
UPDATE tickets_show SET use_async_checkout = true;

Strategy 2: Percentage-based rollout

# Enable for X% of traffic
import random

if random.random() < float(os.getenv('ASYNC_CHECKOUT_PERCENTAGE', '0')):
    return self._handle_async_checkout(request)
else:
    return self._handle_sync_checkout(request)

# Gradual increase:
# Week 3 Day 1: ASYNC_CHECKOUT_PERCENTAGE=0.10 (10%)
# Week 3 Day 3: ASYNC_CHECKOUT_PERCENTAGE=0.25 (25%)
# Week 3 Day 5: ASYNC_CHECKOUT_PERCENTAGE=0.50 (50%)
# Week 4 Day 1: ASYNC_CHECKOUT_PERCENTAGE=0.75 (75%)
# Week 4 Day 3: ASYNC_CHECKOUT_PERCENTAGE=1.00 (100%)

Task: 4-6 hours (includes monitoring)

3.2 Remove Lock Management Code

Once async is at 100%:

# apps/api/tickets/views/order_views.py

# DELETE: _create_order_with_line_items() method (~300 lines)
# DELETE: Lock acquisition code (~40 lines)
# DELETE: Lock cleanup code (~15 lines)
# DELETE: UUID lock value tracking (~10 lines)

# KEEP: DB transaction and row-level locks
# KEEP: Availability checking
# KEEP: Fee calculation

# Result: Simpler codebase, easier to maintain

Create backup branch:

git checkout -b backup/sync-checkout-before-removal
git push origin backup/sync-checkout-before-removal

# Tag the last sync version
git tag -a v1.0.0-sync-checkout -m "Last version with sync checkout"
git push origin v1.0.0-sync-checkout

# Now safe to remove sync code
git checkout main
# ... make changes ...

Task: 6-8 hours (includes testing)

3.3 Optimize Queue Processing

Fine-tune worker configuration:

# apps/api/brktickets/celery.py

# Optimize based on metrics
app.conf.update(
    # Tune concurrency based on CPU/memory
    worker_concurrency=os.cpu_count() * 2,

    # Optimize prefetch
    worker_prefetch_multiplier=1,  # Strict ordering

    # Task timeouts based on p95
    task_time_limit=int(os.getenv('CHECKOUT_TASK_TIMEOUT', '300')),
    task_soft_time_limit=int(os.getenv('CHECKOUT_TASK_SOFT_TIMEOUT', '240')),

    # Priority configuration
    task_default_priority=5,
    task_queue_max_priority=10,
)

Add auto-scaling (if using cloud):

# kubernetes/checkout-worker-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: celery-checkout-worker
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: celery-checkout-worker
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Pods
    pods:
      metric:
        name: celery_queue_length
      target:
        type: AverageValue
        averageValue: "50"

Task: 4-6 hours

3.4 Enhanced Monitoring

Add custom metrics:

# apps/api/tickets/monitoring.py
from prometheus_client import Counter, Histogram, Gauge

checkout_requests = Counter(
    'checkout_requests_total',
    'Total checkout requests',
    ['mode', 'status']
)

checkout_duration = Histogram(
    'checkout_duration_seconds',
    'Checkout processing duration',
    ['mode']
)

queue_length = Gauge(
    'checkout_queue_length',
    'Current checkout queue length'
)

# In tasks
@shared_task
def process_checkout_async(order_id):
    with checkout_duration.labels(mode='async').time():
        result = _process_checkout(order_id)

    checkout_requests.labels(
        mode='async',
        status=result['status']
    ).inc()

    return result

Add Grafana dashboard:

{
  "dashboard": {
    "title": "Checkout Performance",
    "panels": [
      {
        "title": "Checkout Success Rate",
        "targets": [{
          "expr": "rate(checkout_requests_total{status='success'}[5m]) / rate(checkout_requests_total[5m])"
        }]
      },
      {
        "title": "Queue Length",
        "targets": [{
          "expr": "checkout_queue_length"
        }]
      },
      {
        "title": "Processing Duration (p95)",
        "targets": [{
          "expr": "histogram_quantile(0.95, checkout_duration_seconds)"
        }]
      }
    ]
  }
}

Task: 6-8 hours

3.5 Documentation

Update docs:

# docs/checkout-architecture.md
## Checkout Flow

### Async Queue-Based Architecture (Current)

1. User submits checkout form
2. API validates request and creates pending order (100-200ms)
3. Order queued for processing (Celery + Redis)
4. User polls status endpoint every 500ms
5. Worker processes order in background (1-3s)
6. Worker creates Stripe session
7. User redirected to Stripe for payment

### Components

- **CheckoutSessionView**: Validates and enqueues orders
- **process_checkout_async**: Celery task for order processing
- **OrderStatusView**: Polling endpoint for status updates
- **Celery workers**: Process checkout queue (2 dedicated workers)
- **Redis**: Task queue and result backend

### Monitoring

- **Flower Dashboard**: http://localhost:5555
- **Grafana Dashboard**: http://localhost:3000/d/checkout
- **Logs**: docker-compose logs celery_checkout_worker

Task: 4-6 hours

Phase 3 Deliverables: - ✅ Async checkout at 100% - ✅ Sync code removed - ✅ Worker configuration optimized - ✅ Monitoring enhanced - ✅ Documentation updated - ✅ Team trained on new architecture


Technical Specifications

API Specification

POST /api/checkout

Request:

{
  "showId": "uuid",
  "firstName": "string (1-100 chars)",
  "lastName": "string (1-100 chars)",
  "email": "string (valid email)",
  "phone": "string (optional)",
  "ticketIds": ["uuid", ...],
  "quantities": ["int", ...],
  "donationAmounts": ["decimal", ...] (optional),
  "promoCode": "string (optional)"
}

Response (202 Accepted):

{
  "order_id": "uuid",
  "task_id": "string",
  "status": "processing",
  "status_url": "/api/orders/{order_id}/status",
  "message": "Your order is being processed..."
}

Response (400 Bad Request):

{
  "status": "error",
  "message": "Error description",
  "error_code": "ERROR_CODE"
}

GET /api/orders/{order_id}/status

Response (pending):

{
  "order_id": "uuid",
  "status": "pending",
  "message": "Your order is being processed...",
  "estimated_wait": "2-5 seconds"
}

Response (awaiting_payment):

{
  "order_id": "uuid",
  "status": "awaiting_payment",
  "message": "Ready for payment",
  "stripe_url": "https://checkout.stripe.com/...",
  "expires_at": "2025-10-29T12:00:00Z"
}

Response (failed):

{
  "order_id": "uuid",
  "status": "failed",
  "message": "Ticket no longer available",
  "can_retry": true
}

Database Schema Changes

-- Add status tracking to orders
ALTER TABLE tickets_order
ADD COLUMN status VARCHAR(20) DEFAULT 'pending',
ADD COLUMN error_message TEXT NULL;

CREATE INDEX idx_order_status_created
ON tickets_order(status, created_at);

-- Status values:
-- 'pending': Order created, awaiting processing
-- 'processing': Worker is processing order (not used currently)
-- 'awaiting_payment': Stripe session created, awaiting payment
-- 'failed': Order failed (tickets unavailable, error, etc.)
-- 'success': Payment completed
-- 'cancelled': User cancelled order

Celery Task Configuration

# Task routing
CELERY_TASK_ROUTES = {
    'tickets.tasks.process_checkout_async': {
        'queue': 'checkout',
        'priority': 9,
    },
    'tickets.tasks.send_payment_link_email': {
        'queue': 'emails',
        'priority': 5,
    },
}

# Queue priorities
CELERY_TASK_QUEUE_MAX_PRIORITY = 10
CELERY_TASK_DEFAULT_PRIORITY = 5

# Retry configuration
CELERY_TASK_MAX_RETRIES = 3
CELERY_TASK_DEFAULT_RETRY_DELAY = 5  # seconds
CELERY_TASK_RETRY_BACKOFF = True
CELERY_TASK_RETRY_BACKOFF_MAX = 60  # seconds

# Time limits
CELERY_TASK_TIME_LIMIT = 300  # 5 minutes
CELERY_TASK_SOFT_TIME_LIMIT = 240  # 4 minutes

# Worker configuration
CELERY_WORKER_PREFETCH_MULTIPLIER = 2
CELERY_WORKER_MAX_TASKS_PER_CHILD = 1000
CELERY_TASK_ACKS_LATE = True
CELERY_TASK_REJECT_ON_WORKER_LOST = True

Error Handling

Retry Strategy:

@shared_task(
    autoretry_for=(
        stripe.error.RateLimitError,
        stripe.error.APIConnectionError,
        OperationalError,  # DB connection issues
    ),
    retry_kwargs={'max_retries': 3},
    retry_backoff=True,
    retry_backoff_max=60,
    retry_jitter=True,
)
def process_checkout_async(order_id):
    # Retries automatically on listed exceptions
    # Backoff: 5s, 10s, 20s (with jitter)
    pass

Dead Letter Queue:

# Failed tasks after max retries
CELERY_TASK_RESULT_EXPIRES = 86400  # Keep results for 24 hours
CELERY_TASK_SEND_FAILED_EVENT = True

# Monitor failed tasks
@app.task
def check_failed_tasks():
    """Alert on high failure rate."""
    failed = Task.objects.filter(
        status='FAILURE',
        created_at__gte=timezone.now() - timedelta(hours=1)
    ).count()

    if failed > 10:
        alert("High checkout failure rate!")

Migration Strategy

Pre-Migration Checklist

Infrastructure: - [ ] Redis running and healthy - [ ] Celery workers running (4+ workers) - [ ] Database migration ready - [ ] Monitoring configured (Flower, logs) - [ ] Backup system in place

Code: - [ ] All tests passing - [ ] Async task implemented - [ ] Status endpoint implemented - [ ] Frontend polling implemented - [ ] Feature flag configured

Team: - [ ] Team trained on new architecture - [ ] Rollback plan documented - [ ] On-call rotation scheduled - [ ] Incident response plan ready

Migration Timeline

┌─────────────────────────────────────────────────────────────┐
│                    Migration Timeline                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│ Week 1: Phase 1 - Infrastructure Setup                      │
│ ├─ Day 1-2: Database migrations, Celery config             │
│ ├─ Day 3-4: Implement async task                           │
│ └─ Day 5: Status endpoint, monitoring                       │
│                                                              │
│ Week 2: Phase 2 - Hybrid Implementation                     │
│ ├─ Day 1-2: Feature flag, hybrid view                      │
│ ├─ Day 3-4: Frontend polling                               │
│ └─ Day 5: Integration testing, load testing                 │
│                                                              │
│ Week 3: Phase 3 - Gradual Rollout                          │
│ ├─ Day 1: Enable for test shows (10%)                      │
│ ├─ Day 2-3: Monitor, adjust as needed                      │
│ ├─ Day 4: Enable for 50% of traffic                        │
│ └─ Day 5: Enable for 75% of traffic                        │
│                                                              │
│ Week 4: Phase 4 - Full Migration & Cleanup                 │
│ ├─ Day 1: Enable for 100% of traffic                       │
│ ├─ Day 2-3: Monitor stability                              │
│ ├─ Day 4: Remove sync code                                 │
│ └─ Day 5: Documentation, retrospective                      │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Rollout Strategy

Stage 1: Internal Testing (Week 1-2) - Enable async for development environment - Enable async for staging environment - Run automated tests - Perform manual testing

Stage 2: Canary Deployment (Week 3 Day 1-2) - Enable for 10% of traffic - Monitor metrics closely: - Success rate (target: > 95%) - Response time (target: < 200ms) - Queue length (target: < 50) - Error rate (target: < 1%) - Alert on any anomalies

Stage 3: Gradual Increase (Week 3 Day 3-5) - Increase to 25% if metrics good - Wait 24 hours, monitor - Increase to 50% - Wait 24 hours, monitor - Increase to 75%

Stage 4: Full Rollout (Week 4 Day 1-2) - Increase to 100% - Monitor for 48 hours - Confirm stability

Stage 5: Cleanup (Week 4 Day 3-5) - Remove sync checkout code - Update documentation - Train team on new architecture

Success Criteria

Metrics to Monitor:

Metric Target Alert Threshold
Success rate > 95% < 90%
Response time (p95) < 300ms > 500ms
Queue length < 50 > 100
Error rate < 1% > 3%
Task duration (p95) < 5s > 10s
Worker availability 100% < 100%

Go/No-Go Decision Points:

After each stage, evaluate: 1. ✅ All metrics within target 2. ✅ No customer complaints 3. ✅ No critical bugs 4. ✅ Team confident to proceed

If any criteria fails: - ⚠️ Pause rollout - 🔍 Investigate issue - 🛠️ Fix and re-test - ♻️ Resume rollout


Testing Strategy

Unit Tests

# apps/api/tickets/tests/test_checkout_async_unit.py
class TestAsyncCheckoutTask(TestCase):
    """Unit tests for async checkout task."""

    @override_settings(CELERY_TASK_ALWAYS_EAGER=True)
    def test_process_checkout_success(self):
        """Test successful checkout processing."""
        order = create_test_order(status='pending')

        result = process_checkout_async(str(order.id))

        self.assertEqual(result['status'], 'success')
        order.refresh_from_db()
        self.assertEqual(order.status, 'awaiting_payment')

    @override_settings(CELERY_TASK_ALWAYS_EAGER=True)
    def test_process_checkout_sold_out(self):
        """Test checkout when tickets sell out."""
        order = create_test_order_with_sold_out_tickets()

        result = process_checkout_async(str(order.id))

        self.assertEqual(result['status'], 'failed')
        self.assertEqual(result['reason'], 'sold_out')

    @override_settings(CELERY_TASK_ALWAYS_EAGER=True)
    @patch('tickets.tasks.stripe.checkout.Session.create')
    def test_process_checkout_stripe_error(self, mock_stripe):
        """Test retry on Stripe error."""
        mock_stripe.side_effect = stripe.error.RateLimitError("Rate limit")
        order = create_test_order(status='pending')

        with self.assertRaises(Retry):
            process_checkout_async(str(order.id))

Integration Tests

# apps/api/tickets/tests/test_checkout_async_integration.py
class TestAsyncCheckoutIntegration(TransactionTestCase):
    """Integration tests for async checkout flow."""

    def test_full_checkout_flow(self):
        """Test complete checkout flow from request to payment."""
        # 1. Submit checkout
        response = self.client.get('/api/checkout', {
            'showId': str(self.show.id),
            'firstName': 'John',
            'lastName': 'Doe',
            'email': 'john@example.com',
            'ticketIds': [str(self.ticket.id)],
            'quantities': ['2'],
        })

        self.assertEqual(response.status_code, 202)
        data = response.json()
        order_id = data['order_id']

        # 2. Poll status
        for _ in range(10):
            status_response = self.client.get(f'/api/orders/{order_id}/status')
            status_data = status_response.json()

            if status_data['status'] == 'awaiting_payment':
                self.assertIn('stripe_url', status_data)
                break

            time.sleep(0.5)
        else:
            self.fail("Checkout did not complete in 5 seconds")

        # 3. Verify order
        order = Order.objects.get(id=order_id)
        self.assertEqual(order.status, 'awaiting_payment')
        self.assertIsNotNone(order.session_id)

Load Tests

# scripts/load_test.py
from locust import HttpUser, task, between

class CheckoutUser(HttpUser):
    wait_time = between(1, 3)

    def on_start(self):
        """Set up test data."""
        self.show_id = create_test_show()
        self.ticket_id = create_test_ticket()

    @task
    def checkout(self):
        """Simulate checkout flow."""
        # 1. Submit checkout
        response = self.client.get('/api/checkout', json={
            'showId': self.show_id,
            'firstName': 'Load',
            'lastName': 'Test',
            'email': f'test-{time.time()}@example.com',
            'ticketIds': [self.ticket_id],
            'quantities': ['1'],
        })

        if response.status_code == 202:
            order_id = response.json()['order_id']

            # 2. Poll status
            for _ in range(20):
                status = self.client.get(f'/api/orders/{order_id}/status')
                if status.json()['status'] != 'pending':
                    break
                time.sleep(0.5)

Run load test:

# Test with 100 concurrent users
locust -f scripts/load_test.py --users 100 --spawn-rate 10

# Monitor:
# - Response times
# - Success rate
# - Queue length
# - Worker CPU/memory

Test Coverage Goals

Component Coverage Target Current Gap
Async task 95% - New
Status endpoint 95% - New
View layer 90% 85% +5%
Models 85% 85% -
Overall 90% 87% +3%

Monitoring and Observability

Metrics

Key Metrics to Track:

# Checkout success rate
checkout_success_rate = (
    successful_checkouts / total_checkouts
) * 100

# Target: > 95%
# Alert: < 90%

# Response time (user-facing)
response_time_p50 = percentile(response_times, 0.50)
response_time_p95 = percentile(response_times, 0.95)
response_time_p99 = percentile(response_times, 0.99)

# Targets:
# p50: < 150ms
# p95: < 300ms
# p99: < 500ms

# Processing time (worker)
processing_time_p50 = percentile(task_durations, 0.50)
processing_time_p95 = percentile(task_durations, 0.95)

# Targets:
# p50: < 2s
# p95: < 5s

# Queue metrics
queue_length = redis.llen('celery:checkout')
queue_age = oldest_task_age_seconds

# Targets:
# length: < 50
# age: < 30s

# Worker health
active_workers = count_active_workers('checkout')
worker_utilization = (active_tasks / (active_workers * concurrency)) * 100

# Targets:
# workers: >= 2
# utilization: 50-80%

Dashboards

Flower (Celery Monitor): - URL: http://localhost:5555 - Username/password: admin/admin (configure in .env) - Real-time task monitoring - Worker status - Task history

Grafana Dashboard:

# grafana/dashboards/checkout.json
panels:
  - title: Checkout Success Rate
    type: graph
    targets:
      - expr: |
          rate(checkout_requests_total{status="success"}[5m]) /
          rate(checkout_requests_total[5m]) * 100
    alert:
      condition: < 90

  - title: Response Time (p95)
    type: graph
    targets:
      - expr: histogram_quantile(0.95, checkout_duration_seconds)
    alert:
      condition: > 0.5

  - title: Queue Length
    type: graph
    targets:
      - expr: checkout_queue_length
    alert:
      condition: > 100

  - title: Worker Health
    type: stat
    targets:
      - expr: celery_active_workers{queue="checkout"}
    alert:
      condition: < 2

Alerts

PagerDuty/Slack Integration:

# apps/api/monitoring/alerts.py
from datadog import statsd

def check_checkout_health():
    """Monitor checkout health and alert on issues."""
    metrics = get_checkout_metrics()

    # Alert on low success rate
    if metrics['success_rate'] < 90:
        alert_critical(
            title="Checkout success rate below 90%",
            message=f"Current: {metrics['success_rate']}%",
            severity="critical"
        )

    # Alert on high queue length
    if metrics['queue_length'] > 100:
        alert_warning(
            title="Checkout queue backing up",
            message=f"Queue length: {metrics['queue_length']}",
            severity="warning"
        )

    # Alert on worker issues
    if metrics['active_workers'] < 2:
        alert_critical(
            title="Checkout workers unavailable",
            message=f"Active workers: {metrics['active_workers']}",
            severity="critical"
        )

# Run every minute
@celery_app.task
def monitor_checkout_health():
    check_checkout_health()

Logging

Structured Logging:

import structlog

logger = structlog.get_logger()

# In task
def process_checkout_async(order_id):
    logger.info(
        "checkout.started",
        order_id=order_id,
        timestamp=time.time()
    )

    try:
        result = _process_checkout(order_id)

        logger.info(
            "checkout.completed",
            order_id=order_id,
            status=result['status'],
            duration=result['duration'],
            timestamp=time.time()
        )

        return result

    except Exception as e:
        logger.error(
            "checkout.failed",
            order_id=order_id,
            error=str(e),
            stack_trace=traceback.format_exc(),
            timestamp=time.time()
        )
        raise

Log Analysis:

# View checkout logs
docker-compose logs celery_checkout_worker -f --tail=100

# Search for failures
docker-compose logs celery_checkout_worker | grep "checkout.failed"

# Analyze processing times
docker-compose logs celery_checkout_worker | grep "checkout.completed" |
  jq '.duration' |
  awk '{sum+=$1; count++} END {print "Avg:", sum/count}'

Rollback Plan

Immediate Rollback (< 5 minutes)

If critical issues arise:

# 1. Disable async checkout via feature flag
docker-compose exec api python manage.py shell
>>> from django.conf import settings
>>> settings.ENABLE_ASYNC_CHECKOUT = False

# Or restart with env var
docker-compose down
ENABLE_ASYNC_CHECKOUT=false docker-compose up -d

# 2. Verify sync checkout working
curl http://localhost:8001/api/checkout?showId=...
# Should return 200 with Stripe URL (not 202)

# 3. Monitor for recovery
# - Check success rate
# - Check response times
# - Check customer complaints

# 4. Investigate issue
# - Check Celery logs
# - Check Redis connection
# - Check worker health

Partial Rollback (< 15 minutes)

If issues with specific shows:

# Disable async for specific show
show = Show.objects.get(id=problem_show_id)
show.use_async_checkout = False
show.save()

# Or disable for percentage of traffic
# .env
ASYNC_CHECKOUT_PERCENTAGE=0.5  # Reduce to 50%

Full Rollback (< 1 hour)

If async architecture is fundamentally flawed:

# 1. Switch to backup branch
git checkout backup/sync-checkout-before-removal
git checkout -b rollback-async-checkout

# 2. Deploy sync version
# ... deployment steps ...

# 3. Database migration (if needed)
python manage.py migrate tickets XXXX_rollback_status_field

# 4. Clean up queue
redis-cli FLUSHDB  # Clear pending tasks

# 5. Restart services
docker-compose restart

# 6. Verify sync working
# - Run integration tests
# - Test manual checkout
# - Monitor success rate

Post-Rollback

After rollback: 1. Root cause analysis: What went wrong? 2. Fix identified issues: Address problems 3. Update tests: Add tests for failure scenarios 4. Document lessons learned: Update this document 5. Plan retry: When to attempt migration again?

Rollback Triggers

Automatic rollback if: - Success rate < 80% for > 5 minutes - Worker availability = 0 for > 2 minutes - Queue length > 500 for > 10 minutes

Manual rollback if: - > 10 customer complaints in 15 minutes - Data corruption detected - Security issue discovered - Team loses confidence


Cost and Resource Analysis

Infrastructure Costs

Current (Sync):

API Servers: 2 instances × $50/month = $100/month
Redis: 1 instance × $30/month = $30/month
Database: 1 instance × $100/month = $100/month
Celery Workers (emails): 1 instance × $50/month = $50/month

Total: $280/month

Proposed (Async):

API Servers: 2 instances × $50/month = $100/month
  (No increase - same API servers)

Redis: 1 instance × $30/month = $30/month
  (Same Redis, used for queue + locks)

Database: 1 instance × $100/month = $100/month
  (Same database)

Celery Workers:
  - Checkout queue: 1 instance × $50/month = $50/month (NEW)
  - Email/other: 1 instance × $50/month = $50/month (EXISTING)
  Subtotal: $100/month (+$50/month increase)

Total: $330/month (+$50/month or +18%)

Cost-Benefit Analysis:

Additional Cost: $50/month ($600/year)

Benefits:
- 10x throughput increase
- Simplified codebase (faster development)
- Better user experience (10-50x faster response)
- Reduced support costs (fewer timeout complaints)
- Enable flash sales (new revenue opportunities)

ROI: If flash sales generate $5000+ revenue/year → 8x ROI

Development Resources

Initial Implementation:

Phase 1 (Week 1): 16-20 hours
Phase 2 (Week 2): 20-24 hours
Phase 3 (Week 3-4): 24-32 hours

Total: 60-76 hours (1.5-2 developer-months)

At $100/hour: $6,000-$7,600 one-time cost

Ongoing Maintenance:

Current (Sync): ~2 hours/week troubleshooting lock issues
Proposed (Async): ~1 hour/week monitoring queues

Savings: ~1 hour/week = 52 hours/year = $5,200/year

Net First Year Cost:

Development: $6,500 (one-time)
Infrastructure: $600/year (ongoing)
Savings: -$5,200/year (reduced maintenance)

Net Year 1: $1,900
Net Year 2+: -$4,600/year (savings)

Break-even: ~4 months


Appendices

Appendix A: Glossary

Terms: - Async Checkout: Queue-based checkout using Celery workers - Sync Checkout: Current blocking checkout with Redis locks - Celery: Distributed task queue (using Redis as broker) - Redis: In-memory data store (used for queue + cache) - Worker: Celery process that executes queued tasks - Queue: Redis list containing pending tasks - Polling: Frontend repeatedly checking order status - 202 Accepted: HTTP status for async processing - DLQ: Dead Letter Queue for failed tasks

Appendix B: Reference Architecture

Similar Implementations: - Ticketmaster: Uses queue-based checkout for high-traffic events - Eventbrite: Async order processing with polling - StubHub: Queue-based inventory management - Shopify: Checkout queue for flash sales

Industry Best Practices: - Celery Best Practices - Redis Queue Patterns - Async API Design

Appendix C: Team Training Materials

Required Training: - Celery fundamentals (2 hours) - Queue-based architectures (1 hour) - Monitoring with Flower (30 minutes) - Troubleshooting guide (1 hour) - Incident response procedures (1 hour)

Training Resources: - Celery Documentation - Queue-Based Architectures Video Course - Internal Wiki: Async Checkout Guide

Appendix D: FAQ

Q: What happens if Redis goes down? A: Orders are saved in database (status='pending'). When Redis recovers, admin can re-queue orders manually.

Q: What happens if Celery worker crashes mid-task? A: Task is re-queued automatically (reject_on_worker_lost=True). Order remains pending until successfully processed.

Q: How long does polling continue? A: Max 15 seconds (30 attempts × 500ms). After timeout, user receives email with payment link.

Q: Can we process orders faster than 1-3 seconds? A: Yes, by adding more workers or optimizing Stripe API calls. Current p95 is ~3s, could reduce to ~1s.

Q: What if user closes browser during polling? A: Order still processes in background. User receives email with payment link. Can also resume via order history.

Q: How do we handle flash sales? A: Queue naturally handles burst traffic. Scale workers horizontally before event. Monitor queue length and scale automatically.

Q: Can we revert to sync checkout if needed? A: Yes, feature flag allows instant rollback. Backup branch preserves sync code.


Approval and Sign-off

Role Name Signature Date
Technical Lead
Product Manager
DevOps Lead
QA Lead
CTO/Engineering Director

Document Control

Version History:

Version Date Author Changes
1.0 2025-10-28 Technical Team Initial proposal

Review Schedule: - Technical review: [Date] - Security review: [Date] - Final approval: [Date]

Contact: - Technical questions: [Email] - Product questions: [Email] - Deployment questions: [Email]


Next Steps: 1. ✅ Review this proposal 2. ✅ Address any concerns or questions 3. ✅ Get approval from stakeholders 4. ✅ Create implementation tickets 5. ✅ Begin Phase 1 development 6. ✅ Schedule regular check-ins during migration


This is a living document. Please update as the implementation progresses and new learnings emerge.