Cost-Optimized AWS Networking: Custom EC2 NAT vs Managed NAT Gateway

Cost-Optimized AWS Networking: Custom EC2 NAT vs Managed NAT Gateway

The AWS NAT Cost Challenge

AWS NAT Gateway provides managed network address translation for private subnet resources—databases, ECS tasks, Lambda functions—requiring internet access. The service costs $32.40 per month per Availability Zone plus $0.045 per GB processed. A standard three-AZ production architecture incurs $97.20 monthly baseline ($1,166 annually) before data transfer charges.

Organizations running multiple environments (development, staging, production) face costs scaling linearly with deployment footprint. Private subnet internet access remains non-negotiable for security patches, package dependencies, and API integrations. The question becomes: how much operational complexity is acceptable to reduce costs?

Key Results:

  • 81% cost reduction using per-AZ EC2 NAT ($18/month vs $97/month for 3-AZ)
  • 94% cost reduction using single EC2 NAT ($6/month vs $97/month)
  • Terraform-automated deployment eliminating manual configuration
  • High availability through Auto Scaling Groups

Architecture: VPC Networking Fundamentals

AWS VPC network segmentation separates internet-facing resources from internal systems through subnet classification and route table associations.

Public Subnets:

  • Internet-facing resources (load balancers, bastion hosts, NAT instances)
  • Route table default route (0.0.0.0/0) targets Internet Gateway
  • Resources receive public IP addresses

Private Subnets:

  • Internal resources (RDS, ECS tasks, Lambda functions)
  • All egress traffic routes through NAT
  • Private IP addresses only—no inbound internet connections

Route Table Logic:

Public subnet:

Destination         Target
10.0.0.0/16        local
0.0.0.0/0          igw-xxxxx

Private subnet:

Destination         Target
10.0.0.0/16        local
0.0.0.0/0          nat-xxxxx OR eni-xxxxx

The routing distinction determines subnet classification. Both NAT patterns use identical route table structure—only the NAT target differs.

NAT Gateway vs EC2 NAT Instance

NAT Gateway (Managed Service):

  • AWS-managed high-availability NAT service
  • Deployed in public subnet, one per AZ
  • Automatic scaling—no capacity planning
  • $0.045/hour ($32.40/month) plus data processing ($0.045/GB)
  • Zero operational overhead

EC2 NAT Instance (Custom Implementation):

  • Standard EC2 instance with IP forwarding and iptables
  • source_dest_check = false enables packet forwarding
  • iptables MASQUERADE rule rewrites outbound packet source IPs
  • Route table targets EC2 network interface (eni-xxxxx)
  • Operational overhead: OS patching, monitoring, failover automation

Both provide identical functionality—private resources send traffic to NAT, NAT rewrites source IP, forwards to internet, rewrites response back to private resource.

Implementation: Terraform EC2 NAT Configuration

Terraform configuration implements EC2-based NAT through dual-purpose instances serving as both ECS cluster members and network routers.

VPC and Route Table Configuration

# terraform/shared/vpc.tf
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.14.0"

  name = "${var.app_name}-vpc"
  cidr = var.vpc_cidr

  azs             = var.availability_zones
  public_subnets  = var.public_subnet_cidrs
  private_subnets = var.private_subnet_cidrs

  enable_nat_gateway = false  # Using custom EC2 NAT
  enable_dns_hostnames = true
}

# Custom NAT routing
locals {
  private_rt_by_az = {
    for idx, az in var.availability_zones :
    az => module.vpc.private_route_table_ids[idx]
  }

  single_nat = var.nat_instance_count == 1
  single_nat_eni_id = local.single_nat ? aws_instance.ecs_instance[var.availability_zones[0]].primary_network_interface_id : null
}

resource "aws_route" "private_default_to_nat" {
  for_each               = local.private_rt_by_az
  route_table_id         = each.value
  destination_cidr_block = "0.0.0.0/0"
  network_interface_id   = local.single_nat ? local.single_nat_eni_id : aws_instance.ecs_instance[each.key].primary_network_interface_id
}

Variable nat_instance_count determines routing:

  • = 1: All private subnets route to single instance (max cost savings)
  • = 3: Each private subnet routes to instance in same AZ (high availability)

EC2 NAT Instance Configuration

# terraform/shared/ec2.tf
resource "aws_instance" "ecs_instance" {
  for_each               = local.nat_public_by_az
  ami                    = var.ecs_optimized_ami
  instance_type          = "t4g.micro"
  subnet_id              = each.value
  vpc_security_group_ids = [aws_security_group.ecs_instance.id]
  iam_instance_profile   = aws_iam_instance_profile.ecs_instance.name

  source_dest_check = false  # Required for NAT

  user_data = base64encode(templatefile("${path.module}/user-data.sh", {
    cluster_name = "${var.app_name}-ecs-cluster"
    vpc_cidr     = var.vpc_cidr
  }))
}

resource "aws_eip" "nat" {
  for_each = aws_instance.ecs_instance
  instance = each.value.id
  domain   = "vpc"
}

User Data Script:

#!/bin/bash
# ECS membership
echo ECS_CLUSTER=${cluster_name} >> /etc/ecs/ecs.config

# Enable IP forwarding
echo "net.ipv4.ip_forward = 1" >> /etc/sysctl.conf
sysctl -w net.ipv4.ip_forward=1

# NAT masquerading
IFACE=$(ip route | grep default | awk '{print $5}')
iptables -t nat -A POSTROUTING -s ${vpc_cidr} ! -d ${vpc_cidr} -o "$IFACE" -j MASQUERADE
service iptables save

Key Configuration:

  1. source_dest_check = false: Allows packet forwarding (non-default EC2 behavior)
  2. IP Forwarding: Linux kernel parameter enabling routing between interfaces
  3. iptables MASQUERADE: Rewrites outbound packet source IPs to NAT instance public IP
  4. Dual Purpose: Instance handles both ECS workloads and NAT routing

High Availability with Auto Scaling Groups

resource "aws_autoscaling_group" "ecs_nat" {
  for_each                = local.nat_public_by_az
  name                    = "${var.app_name}-ecs-nat-asg-${each.key}"
  vpc_zone_identifier     = [each.value]
  desired_capacity        = 1
  min_size                = 1
  max_size                = 1
  health_check_type       = "EC2"
  health_check_grace_period = 300

  launch_template {
    id      = aws_launch_template.ecs_nat[each.key].id
    version = "$Latest"
  }
}

Auto Scaling Group monitors instance health and triggers automatic replacement on failure (2-5 minute recovery).

Cost Analysis and Trade-offs

Three-AZ Production Environment

NAT Gateway:

Per AZ: $32.40/month
Total (3 AZs): $97.20/month
Annual: $1,166.40

EC2 NAT (Per-AZ):

t4g.micro: $6.13/month × 3 = $18.39/month
Annual: $220.68
Savings: $945.72/year (81% reduction)

EC2 NAT (Single Instance):

t4g.micro: $6.13/month
Annual: $73.56
Savings: $1,092.84/year (94% reduction)
Trade-off: Single point of failure

Operational Considerations

Monitoring:

  • CloudWatch metrics: Network throughput, CPU, status checks
  • Auto Scaling Group health checks trigger replacement
  • Alerts on instance failure → automated recovery

Performance:

  • t4g.micro baseline: 85 Mbps, burst: 5 Gbps
  • Adequate for typical dev/staging/production workloads
  • Upgrade instance type if sustained throughput exceeds baseline

Security:

  • Identical isolation to NAT Gateway
  • Security groups control traffic flow
  • VPC Flow Logs for audit

When to Use Each Pattern

Use NAT Gateway:

  • Zero operational overhead required
  • Budget allows managed services premium
  • Limited networking expertise
  • High throughput requirements (>5 Gbps)

Use EC2 NAT:

  • Cost optimization priority
  • Networking expertise available
  • Terraform/IaC already in use
  • Development/staging environments

Decision Matrix:

Factor NAT Gateway EC2 NAT (Per-AZ) EC2 NAT (Single)
Cost (3 AZs) $97/mo $18/mo $6/mo
Savings Baseline 81% 94%
Recovery Time <1 min 2-5 min 2-5 min
Overhead Zero Low Low
Best For Enterprise Production Dev/staging

Results and Lessons

Cost Savings:

  • Production (3 AZs): $945/year saved (81%)
  • Development (1 AZ): $1,092/year saved (94%)
  • Multi-environment deployment: $2,206/year total savings

What Worked:

✅ Dual-purpose EC2 instances reduce total infrastructure count

✅ Terraform automation eliminates manual iptables configuration

✅ Auto Scaling Groups provide adequate high availability

✅ t4g.micro sufficient for typical workload egress needs

✅ Single NAT acceptable for non-production environments

✅ Fixed EC2 cost vs variable NAT Gateway data processing charges

Pitfalls to Avoid:

❌ Forgetting source_dest_check = false (routing won't work)

❌ Missing iptables MASQUERADE rule (no outbound connectivity)

❌ No monitoring/alerting on NAT instance health

❌ Undersized instance type for traffic volume

❌ No Auto Scaling Group (manual recovery required)

❌ Inconsistent security group rules between NAT instances

Conclusion

EC2-based NAT instances reduce AWS networking costs by 81-94% compared to NAT Gateway while maintaining production-grade availability through Auto Scaling Groups and Terraform automation.

Implementation Path:

  1. Assess current NAT Gateway costs and calculate savings
  2. Implement EC2 NAT in development environment first
  3. Configure Auto Scaling Groups with health checks
  4. Validate failover scenarios
  5. Roll out to production with per-AZ NAT instances
  6. Monitor throughput and adjust instance types if needed

When This Approach Makes Sense:

  • Cost-sensitive AWS deployments
  • Teams with networking expertise
  • Terraform-managed infrastructure
  • Development/staging environments (single NAT acceptable)

Organizations prioritizing cost efficiency achieve substantial savings without compromising network reliability. Terraform automation eliminates manual configuration complexity, making EC2 NAT practical for teams managing Infrastructure as Code deployments.

Resources: