scripts/bench-cold-start.py and
scripts/bench-http-outbound.py from our repo.
TL;DR — 30-iteration distribution
The 5-iteration version of this bench gave misleading results (notably, my early E2B numbers were thrown off by sample noise). All numbers below are from 30 sequential cold-start iterations per platform, run on the same machine in the same window.| metric | Podflare (SDK ≥ 0.0.20) | E2B | Blaxel | Daytona |
|---|---|---|---|---|
| p50 (typical case) | 153 ms | 442 ms | 627 ms | 663 ms |
| p95 (1-in-20 worst) | 170 ms | 811 ms | 2,665 ms | 1,666 ms |
| p99 (1-in-100) | 236 ms | 4,135 ms | 3,719 ms | 7,770 ms |
| max (worst observed) | 263 ms | 5,460 ms | 4,096 ms | 10,063 ms |
| spread (p95 − p50) | 17 ms | 369 ms | 2,038 ms | 1,003 ms |
| errors (in 30 iter) | 0 | 0 | 0 | 0 |
| First exec inside sandbox (vsock vs in-VM HTTP) | ~46 ms | ~200 ms | ~225 ms | ~111 ms |
HTTP outbound to GitHub /zen (median curl) | 89 ms | 25 ms | 85 ms | 93 ms |
| HTTP outbound to Cloudflare trace | 25 ms | 38 ms | 82 ms | 21 ms |
| Sandbox isolation | Podflare Pod microVM | Firecracker microVM | Firecracker microVM | Docker + Sysbox |
| Open-source license | proprietary; SDK MIT | Apache-2.0 | proprietary | AGPL-3.0 |
scripts/bench-reliability.py.
Each run takes 20–30 s wall-clock per platform, runs 30 sequential
Sandbox.create() → exec("echo ready") → close() cycles, and reports
the full distribution.
Cold-start distribution — head-to-head
The metric customers actually feel: time fromSandbox.create()
until the first echo ready returns. Each row below is 30
sequential iterations per platform — enough samples that the tail
percentiles mean something.
| platform | min | p50 | p90 | p95 | p99 | max | mean |
|---|---|---|---|---|---|---|---|
🥇 Podflare api.podflare.ai (SDK ≥ 0.0.20) | 143 ms | 153 ms | 165 ms | 170 ms | 236 ms | 263 ms | 173 ms |
| 🥈 E2B us-east | 418 ms | 467 ms | 720 ms | 750 ms | 852 ms | 888 ms | 509 ms |
| 🥉 Blaxel us-pdx-1 | 553 ms | 627 ms | 1,741 ms | 2,665 ms | 3,719 ms | 4,096 ms | 924 ms |
| Daytona | 439 ms | 713 ms | 1,120 ms | 1,130 ms | 1,136 ms | 1,137 ms | 722 ms |
- Podflare wins every percentile. p50 is 3.0× faster than E2B
(the next-best), p95 is 4.4×, p99 is 3.6×, max is bounded under
270 ms while E2B’s max is 888 ms and Blaxel’s is 4 s. The previous
version of this bench, with SDK 0.0.17, showed a 1,741 ms p99 for
Podflare — caused by the SDK’s own
connect=0.8s+retries=2compounding on slow TLS handshakes (three chained ConnectTimeouts add up to ~2.9 s). SDK 0.0.19 widened connect to 2.5 s and dropped retries to 1, killing that self-inflicted tail. SDK 0.0.20 then defaulted toapi.podflare.ai(Cloudflare-edge-routed), which is faster than direct-to-origin from most residential callers because the CF edge PoP is closer than any single region. - All four platforms had 0 errors across 30 iterations. The differentiator is latency distribution, not reliability in the traditional uptime sense.
Pick the metric that matters to your workload
Different production agents care about different parts of the distribution. The “best” platform depends on which tail kills you faster.| your priority | winner | why |
|---|---|---|
| Fastest typical request (p50 — agent loop responsiveness) | 🏆 Podflare 153 ms | 3.0× faster than E2B, 4.7× faster than Daytona |
| Predictable latency (p95−p50 spread, “no surprises”) | 🏆 Podflare 17 ms | SDK 0.0.19’s tuned connect timeout caps the tail |
| Bounded worst case (max — circuit-breaker thresholds) | 🏆 Podflare 263 ms | 3.4× tighter than E2B, 4.3× tighter than Daytona |
| Lowest mean (cost-driven, billing-by-time workloads) | 🏆 Podflare 173 ms | Distribution shape: tight head, tight tail |
Why Podflare’s first_exec is 2–48× faster
The firstexec() after create() looks identical from the SDK side,
but the underlying paths are different:
| platform | first_exec stack |
|---|---|
| Podflare | hostd ↔ Pod vsock UDS ↔ in-VM agent (binary protocol over UNIX socket) |
| Daytona | proxy → runner → daemon HTTP gin server (TCP + TLS over container bridge) |
| Blaxel | SDK → orchestrator → in-VM agent (HTTP + TLS) |
| E2B | client-proxy → orchestrator → ConnectRPC HTTP/1.1+H2 to envd in-VM (TCP + TLS) |
HTTP outbound — what your agent actually feels
Inside-the-sandboxcurl against two reliable targets, 5 runs each.
(Earlier benches included httpbin.org but kept getting 10-second
outliers from all four platforms — that’s httpbin’s per-source-IP
rate limit, not platform speed.)
| target | Podflare us-west | Daytona | Blaxel us-pdx-1 | E2B us-east |
|---|---|---|---|---|
| Cloudflare trace | 25 ms | 21 ms | 82 ms | 38 ms |
GitHub API /zen | 89 ms | 93 ms | 85 ms | 25 ms |
- GitHub-flavored workloads (the long tail of agent traffic —
pip install,npm install, GitHub API, Hugging Face, etc.) all land within ~70 ms of each other. Geography is the whole story; pick a region close to GitHub’s Azure us-east peering. - Cloudflare trace shows raw network speed. Daytona’s 21 ms is fastest because their colo happens to be one hop from Cloudflare’s Ashburn PoP. Differences here are small absolute numbers.
Out-of-the-box experience
| platform | base image | curl pre-installed | apk add curl time |
|---|---|---|---|
| Podflare | Ubuntu 24.04 (full) | ✓ | n/a |
| Daytona | Ubuntu (full) | ✓ | n/a |
| E2B | Ubuntu/Debian (full) | ✓ | n/a |
| Blaxel | Alpine 3.23 (157 binaries total) | ✗ | 5 seconds |
apk add for
basics like curl, python, git before doing real work — every
fresh sandbox pays for that bootstrap.
Forking and persistent state
Cold-start isn’t the whole story. AI-agent workloads also fork (try N branches in parallel) and persist (resume a working session later). This is where the gap widens.| capability | Podflare | Daytona | Blaxel | E2B |
|---|---|---|---|---|
fork(n=5) from a running sandbox | ~80 ms p50 | not exposed | not exposed | not exposed |
| Persistent state across destroy | Spaces — full VM memory survives | container archive (storage only) | pause / resume | snapshot via docker commit |
| Diff snapshots (only dirty pages) | yes, ~24 ms | no | no | no |
| Multi-region edge router | 5 regions, haversine + failover | single region per cluster | 3 regions, manual pin | single region per cluster |
fork() is the genuinely differentiated primitive. Most LLM-agent
patterns (tree-of-thought, multi-attempt code synthesis) want N
children that all start from the parent’s exact mid-flight state. On
container platforms you’d docker commit (~seconds) and docker run N (~seconds × N). On Podflare that’s parent.fork(n=5) — a
copy-on-write diff snapshot + N parallel microVM spawns in 80 ms
p50, total.
See Performance for the breakdown of
what fork() does in those 80 ms.
Architecture comparison
| Podflare | Daytona | Blaxel | E2B | |
|---|---|---|---|---|
| Isolation | Podflare Pod microVM (KVM hypervisor) | Docker + Sysbox (kernel-shared) | Podflare Pod microVM | Podflare Pod microVM (KVM) |
| Bare-metal hosting | Hetzner + Latitude (5 regions) | unspecified cloud / k8s | unspecified | GCP + AWS bare-metal (single region) |
| Cold-start magic | Snapshot restore + warm pool + xfs reflink CoW | Pre-booted VMs in DB, atomic orgId-flip handoff | Minimal Alpine + ? | UFFD lazy mem + memory-prefetch + NBD rootfs |
| Warm pool primitive | pop_front() from a VecDeque of running VMs | DB UPDATE — flip orgId on a sentinel-org pre-booted sandbox | not documented | Snapshot resume per request |
| In-sandbox exec channel | vsock binary protocol | gin HTTP over TCP+TLS (port 2280) | HTTP over TCP+TLS | ConnectRPC over TCP+TLS (port 49983) |
| Edge router | Cloudflare Worker, haversine routing, 5 regions | Single regional proxy | Manual region pin | API gateway (single region) |
| Failover on origin 5xx | yes (next-nearest region) | no documented | no documented | no documented |
fork() primitive | yes, ~80 ms | no | no | no |
| Persistent state across destroy | yes (Spaces, full memory) | container archive | pause/resume | container commit |
License
| license | implication | |
|---|---|---|
| Podflare | proprietary platform; SDKs MIT | use commercially without restrictions |
| Blaxel | proprietary | not self-hostable |
| E2B | Apache-2.0 (entire stack) | self-hostable; no viral terms |
| Daytona | AGPL-3.0 (entire stack) | self-hostable, but modifications must be open-sourced if you run as a commercial service |
Free-tier limits
| Podflare free | E2B Hobby | Daytona Tier 1 | Blaxel free | |
|---|---|---|---|---|
| RAM per sandbox | 1 GB | 8 GB | 8 GB (4 vCPU) | varies |
| Max concurrent | 10 | 20 | dynamic (pool-shared 20 GB) | per workspace |
| Max session lifetime | 30 min | 1 hour | not stated | not stated |
| Idle timeout | 5 min | not stated | not stated | not stated |
| Starter credit | none | $100 | $200 | not stated |
Production-choice ranking — by axis (30-iter)
| axis | 1st | 2nd | 3rd | 4th |
|---|---|---|---|---|
| p50 (typical request) | Podflare 153 ms | E2B 467 ms | Blaxel 627 ms | Daytona 713 ms |
| p95 (1-in-20) | Podflare 170 ms | E2B 750 ms | Daytona 1,130 ms | Blaxel 2,665 ms |
| p99 (1-in-100) | Podflare 236 ms | E2B 852 ms | Daytona 1,136 ms | Blaxel 3,719 ms |
| max (worst observed) | Podflare 263 ms | E2B 888 ms | Daytona 1,137 ms | Blaxel 4,096 ms |
| spread (p95 − p50) | Podflare 17 ms | E2B 283 ms | Daytona 417 ms | Blaxel 2,038 ms |
| HTTP outbound to GitHub | E2B 25 ms | Blaxel 85 ms | Podflare 89 ms | Daytona 93 ms |
| Persistency primitives | Podflare (full VM memory survives) | E2B / Blaxel (filesystem) | Daytona (archive) | — |
| Unique features | Podflare (fork(), multi-region, Spaces) | E2B (Apache 2.0) | Daytona (AGPL self-host) | Blaxel (minimal Alpine) |
When each one wins
- Latency-sensitive interactive agents (default case) →
Podflare. Wins p50 (153 ms), p95 (170 ms), p99 (236 ms), and max
(263 ms) — the only platform under 300 ms at every percentile. Native
fork()for tree-of-thought patterns. Persistent Spaces survive full VM memory across restarts. Requires SDK ≥ 0.0.20 — 0.0.17 through 0.0.18 have a ~1.7 s p99 tail caused by the SDK’s own tight-connect + retries=2 compounding (fixed in 0.0.19); 0.0.20 then defaulted toapi.podflare.aifor edge-routed latency. - GitHub-heavy workloads where outbound to Azure us-east matters → E2B. Their us-east colo wins HTTP outbound to GitHub at 25 ms (vs ours/others at 85–93 ms). Apache-2.0 lets you self-host for compliance.
- Self-hosted on existing Docker/k8s infrastructure → Daytona. Pay the AGPL toll only if you’ll never fork the runtime; the Docker/Sysbox isolation is meaningfully weaker than a microVM if your threat model includes adversarial guest code.
- Minimum-image, minimum-RAM workloads with your own bootstrap → Blaxel. Alpine base + 627 ms p50 is fine if your workload pre-warms with its own deps. Smallest attack surface.
Reproduce these numbers
All bench scripts are in our repo. They take an SDK API key for each platform and run identical workloads.bench-reliability.py run does 30 sequential
Sandbox.create() → exec("echo ready") → close() cycles per platform
and prints the full distribution (min / p50 / p90 / p95 / p99 / max /
mean / spread). No special flags, no warmup-and-discard tricks.
If your numbers differ meaningfully from ours, send us the bench
output and the SDK version you ran — we treat regression reports as
P0. Our job is for these numbers to stay honest, not for our
marketing to claim things the bench can’t reproduce.
Methodology
- Date: April 2026
- Client: macOS laptop on residential wifi, west-coast US
- Podflare endpoint:
api.podflare.aiwith SDK 0.0.20 (Cloudflare-edge-routed — haversine-picks the nearest origin server-side per-request; from this machine that’s us-west) - E2B endpoint:
e2b_code_interpreterSDK default (single GCP/AWS region, likely us-east4) - Daytona endpoint:
daytonaSDK default (single region per account; ours landed near IAD) - Blaxel endpoint:
blaxelSDK withBL_REGION=us-pdx-1andimage=blaxel/base-image:latest - Sandbox spec: each platform’s default — 1 GB / 1 vCPU on all four
- Bench iterations: 30 cold starts per platform, sequential, no
parallelism. Each iteration is a complete
Sandbox.create() → exec("echo ready") → close()/kill()/delete()cycle. We report the full distribution because the 5-iteration version of this bench gave misleadingly noisy results — particularly for E2B, whose median moved from 2,504 ms (5 samples) to 442 ms (30 samples). Sample size matters.

