scripts/bench-cold-start.py and
scripts/bench-http-outbound.py from our repo.
TL;DR — 30-iteration distribution
The 5-iteration version of this bench gave misleading results (notably, my early E2B numbers were thrown off by sample noise). All numbers below are from 30 sequential cold-start iterations per platform, run on the same machine in the same window.| metric | Podflare (SDK ≥ 0.0.17) | E2B | Blaxel | Daytona |
|---|---|---|---|---|
| p50 (typical case) | 260 ms | 442 ms | 627 ms | 663 ms |
| p95 (1-in-20 worst) | 539 ms | 811 ms | 2,665 ms | 1,666 ms |
| p99 (1-in-100) | 814 ms | 4,135 ms | 3,719 ms | 7,770 ms |
| max (worst observed) | 862 ms | 5,460 ms | 4,096 ms | 10,063 ms |
| spread (p95 − p50) | 279 ms | 369 ms | 2,038 ms | 1,003 ms |
| errors (in 30 iter) | 0 | 0 | 0 | 0 |
| First exec inside sandbox (vsock vs in-VM HTTP) | ~46 ms | ~200 ms | ~225 ms | ~111 ms |
HTTP outbound to GitHub /zen (median curl) | 89 ms | 25 ms | 85 ms | 93 ms |
| HTTP outbound to Cloudflare trace | 25 ms | 38 ms | 82 ms | 21 ms |
| Sandbox isolation | Firecracker | Firecracker | Firecracker | Docker + Sysbox |
| Open-source license | proprietary; SDK MIT | Apache-2.0 | proprietary | AGPL-3.0 |
scripts/bench-reliability.py.
Each run takes 20–30 s wall-clock per platform, runs 30 sequential
Sandbox.create() → exec("echo ready") → close() cycles, and reports
the full distribution.
Cold-start distribution — head-to-head
The metric customers actually feel: time fromSandbox.create()
until the first echo ready returns. Each row below is 30
sequential iterations per platform — enough samples that the tail
percentiles mean something.
| platform | min | p50 | p90 | p95 | p99 | max | mean |
|---|---|---|---|---|---|---|---|
| 🥇 Podflare us-west (SDK ≥ 0.0.17) | 235 ms | 260 ms | 343 ms | 539 ms | 814 ms | 862 ms | 298 ms |
| 🥈 E2B us-east | 380 ms | 442 ms | 682 ms | 811 ms | 4,135 ms | 5,460 ms | 652 ms |
| 🥉 Blaxel us-pdx-1 | 553 ms | 627 ms | 1,741 ms | 2,665 ms | 3,719 ms | 4,096 ms | 924 ms |
| Daytona | 440 ms | 663 ms | 1,023 ms | 1,666 ms | 7,770 ms | 10,063 ms | 1,051 ms |
- Podflare wins every percentile. p50 is 1.7× faster than E2B (the next-best), and the tail is bounded — p99 is 814 ms versus E2B’s 4,135 ms (5×) and Daytona’s 7,770 ms (10×). The previous version of this bench, with SDK 0.0.16, showed a 3,050 ms p95 for Podflare driven by occasional dropped TCP SYNs on the public internet. SDK 0.0.17 caps the connect timeout at 800 ms and retries on a fresh socket, capping that pathological tail.
- All four platforms had 0 errors across 30 iterations. The differentiator is latency distribution, not reliability in the traditional uptime sense.
Pick the metric that matters to your workload
Different production agents care about different parts of the distribution. The “best” platform depends on which tail kills you faster.| your priority | winner | why |
|---|---|---|
| Fastest typical request (p50 — agent loop responsiveness) | 🏆 Podflare 260 ms | 1.7× faster than E2B, 2.5× faster than Daytona |
| Predictable latency (p95−p50 spread, “no surprises”) | 🏆 Podflare 279 ms | SDK 0.0.17’s SYN-drop retry caps the tail |
| Bounded worst case (max — circuit-breaker thresholds) | 🏆 Podflare 862 ms | 5× tighter than E2B, 12× tighter than Daytona |
| Lowest mean (cost-driven, billing-by-time workloads) | 🏆 Podflare 298 ms | Distribution shape: tight head, tight tail |
Why Podflare’s first_exec is 2–48× faster
The firstexec() after create() looks identical from the SDK side,
but the underlying paths are different:
| platform | first_exec stack |
|---|---|
| Podflare | hostd ↔ Firecracker vsock UDS ↔ in-VM agent (binary protocol over UNIX socket) |
| Daytona | proxy → runner → daemon HTTP gin server (TCP + TLS over container bridge) |
| Blaxel | SDK → orchestrator → in-VM agent (HTTP + TLS) |
| E2B | client-proxy → orchestrator → ConnectRPC HTTP/1.1+H2 to envd in-VM (TCP + TLS) |
HTTP outbound — what your agent actually feels
Inside-the-sandboxcurl against two reliable targets, 5 runs each.
(Earlier benches included httpbin.org but kept getting 10-second
outliers from all four platforms — that’s httpbin’s per-source-IP
rate limit, not platform speed.)
| target | Podflare us-west | Daytona | Blaxel us-pdx-1 | E2B us-east |
|---|---|---|---|---|
| Cloudflare trace | 25 ms | 21 ms | 82 ms | 38 ms |
GitHub API /zen | 89 ms | 93 ms | 85 ms | 25 ms |
- GitHub-flavored workloads (the long tail of agent traffic —
pip install,npm install, GitHub API, Hugging Face, etc.) all land within ~70 ms of each other. Geography is the whole story; pick a region close to GitHub’s Azure us-east peering. - Cloudflare trace shows raw network speed. Daytona’s 21 ms is fastest because their colo happens to be one hop from Cloudflare’s Ashburn PoP. Differences here are small absolute numbers.
Out-of-the-box experience
| platform | base image | curl pre-installed | apk add curl time |
|---|---|---|---|
| Podflare | Ubuntu 24.04 (full) | ✓ | n/a |
| Daytona | Ubuntu (full) | ✓ | n/a |
| E2B | Ubuntu/Debian (full) | ✓ | n/a |
| Blaxel | Alpine 3.23 (157 binaries total) | ✗ | 5 seconds |
apk add for
basics like curl, python, git before doing real work — every
fresh sandbox pays for that bootstrap.
Forking and persistent state
Cold-start isn’t the whole story. AI-agent workloads also fork (try N branches in parallel) and persist (resume a working session later). This is where the gap widens.| capability | Podflare | Daytona | Blaxel | E2B |
|---|---|---|---|---|
fork(n=5) from a running sandbox | ~80 ms p50 | not exposed | not exposed | not exposed |
| Persistent state across destroy | Spaces — full VM memory survives | container archive (storage only) | pause / resume | snapshot via docker commit |
| Diff snapshots (only dirty pages) | yes, ~24 ms | no | no | no |
| Multi-region edge router | 5 regions, haversine + failover | single region per cluster | 3 regions, manual pin | single region per cluster |
fork() is the genuinely differentiated primitive. Most LLM-agent
patterns (tree-of-thought, multi-attempt code synthesis) want N
children that all start from the parent’s exact mid-flight state. On
container platforms you’d docker commit (~seconds) and docker run N (~seconds × N). On Podflare that’s parent.fork(n=5) — a
copy-on-write diff snapshot + N parallel microVM spawns in 80 ms
p50, total.
See Performance for the breakdown of
what fork() does in those 80 ms.
Architecture comparison
| Podflare | Daytona | Blaxel | E2B | |
|---|---|---|---|---|
| Isolation | Firecracker microVM (KVM hypervisor) | Docker + Sysbox (kernel-shared) | Firecracker microVM | Firecracker microVM (KVM) |
| Bare-metal hosting | Hetzner + Latitude (5 regions) | unspecified cloud / k8s | unspecified | GCP + AWS bare-metal (single region) |
| Cold-start magic | Snapshot restore + warm pool + xfs reflink CoW | Pre-booted VMs in DB, atomic orgId-flip handoff | Minimal Alpine + ? | UFFD lazy mem + memory-prefetch + NBD rootfs |
| Warm pool primitive | pop_front() from a VecDeque of running VMs | DB UPDATE — flip orgId on a sentinel-org pre-booted sandbox | not documented | Snapshot resume per request |
| In-sandbox exec channel | vsock binary protocol | gin HTTP over TCP+TLS (port 2280) | HTTP over TCP+TLS | ConnectRPC over TCP+TLS (port 49983) |
| Edge router | Cloudflare Worker, haversine routing, 5 regions | Single regional proxy | Manual region pin | API gateway (single region) |
| Failover on origin 5xx | yes (next-nearest region) | no documented | no documented | no documented |
fork() primitive | yes, ~80 ms | no | no | no |
| Persistent state across destroy | yes (Spaces, full memory) | container archive | pause/resume | container commit |
License
| license | implication | |
|---|---|---|
| Podflare | proprietary platform; SDKs MIT | use commercially without restrictions |
| Blaxel | proprietary | not self-hostable |
| E2B | Apache-2.0 (entire stack) | self-hostable; no viral terms |
| Daytona | AGPL-3.0 (entire stack) | self-hostable, but modifications must be open-sourced if you run as a commercial service |
Free-tier limits
| Podflare free | E2B Hobby | Daytona Tier 1 | Blaxel free | |
|---|---|---|---|---|
| RAM per sandbox | 1 GB | 8 GB | 8 GB (4 vCPU) | varies |
| Max concurrent | 10 | 20 | dynamic (pool-shared 20 GB) | per workspace |
| Max session lifetime | 30 min | 1 hour | not stated | not stated |
| Idle timeout | 5 min | not stated | not stated | not stated |
| Starter credit | none | $100 | $200 | not stated |
Production-choice ranking — by axis (30-iter)
| axis | 1st | 2nd | 3rd | 4th |
|---|---|---|---|---|
| p50 (typical request) | Podflare 260 ms | E2B 442 ms | Blaxel 627 ms | Daytona 663 ms |
| p95 (1-in-20) | Podflare 539 ms | E2B 811 ms | Daytona 1,666 ms | Blaxel 2,665 ms |
| p99 (1-in-100) | Podflare 814 ms | Blaxel 3,719 ms | E2B 4,135 ms | Daytona 7,770 ms |
| max (worst observed) | Podflare 862 ms | Blaxel 4,096 ms | E2B 5,460 ms | Daytona 10,063 ms |
| spread (p95 − p50) | Podflare 279 ms | E2B 369 ms | Daytona 1,003 ms | Blaxel 2,038 ms |
| HTTP outbound to GitHub | E2B 25 ms | Blaxel 85 ms | Podflare 89 ms | Daytona 93 ms |
| Persistency primitives | Podflare (full VM memory survives) | E2B / Blaxel (filesystem) | Daytona (archive) | — |
| Unique features | Podflare (fork(), multi-region, Spaces) | E2B (Apache 2.0) | Daytona (AGPL self-host) | Blaxel (minimal Alpine) |
When each one wins
- Latency-sensitive interactive agents (default case) →
Podflare. Wins p50 (260 ms), p95 (539 ms), p99 (814 ms), and max
(862 ms) — the only platform under 1 s at every percentile. Native
fork()for tree-of-thought patterns. Persistent Spaces survive full VM memory across restarts. Requires SDK ≥ 0.0.17 — older versions have a 3 s p95 tail caused by TCP SYN-retry on the public internet. - GitHub-heavy workloads where outbound to Azure us-east matters → E2B. Their us-east colo wins HTTP outbound to GitHub at 25 ms (vs ours/others at 85–93 ms). Apache-2.0 lets you self-host for compliance.
- Self-hosted on existing Docker/k8s infrastructure → Daytona. Pay the AGPL toll only if you’ll never fork the runtime; the Docker/Sysbox isolation is meaningfully weaker than Firecracker if your threat model includes adversarial guest code.
- Minimum-image, minimum-RAM workloads with your own bootstrap → Blaxel. Alpine base + 627 ms p50 is fine if your workload pre-warms with its own deps. Smallest attack surface.
Reproduce these numbers
All bench scripts are in our repo. They take an SDK API key for each platform and run identical workloads.bench-reliability.py run does 30 sequential
Sandbox.create() → exec("echo ready") → close() cycles per platform
and prints the full distribution (min / p50 / p90 / p95 / p99 / max /
mean / spread). No special flags, no warmup-and-discard tricks.
If your numbers differ meaningfully from ours, send us the bench
output and the SDK version you ran — we treat regression reports as
P0. Our job is for these numbers to stay honest, not for our
marketing to claim things the bench can’t reproduce.
Methodology
- Date: April 2026
- Client: macOS laptop on residential wifi, west-coast US
- Podflare endpoint:
api.podflare.aiwith SDK 0.0.16 (client-side haversine — auto-routes to us-west from this machine) - E2B endpoint:
e2b_code_interpreterSDK default (single GCP/AWS region, likely us-east4) - Daytona endpoint:
daytonaSDK default (single region per account; ours landed near IAD) - Blaxel endpoint:
blaxelSDK withBL_REGION=us-pdx-1andimage=blaxel/base-image:latest - Sandbox spec: each platform’s default — 1 GB / 1 vCPU on all four
- Bench iterations: 30 cold starts per platform, sequential, no
parallelism. Each iteration is a complete
Sandbox.create() → exec("echo ready") → close()/kill()/delete()cycle. We report the full distribution because the 5-iteration version of this bench gave misleadingly noisy results — particularly for E2B, whose median moved from 2,504 ms (5 samples) to 442 ms (30 samples). Sample size matters.

