I help companies get AI agents to production — and keep them there.

I was the Founding AI Engineer at AutoGPT, where I built the product intelligence, evaluation, and agent orchestration systems for one of the most widely-used open-source AI agent platforms ever built. I've co-authored AI research with Yann LeCun at Meta and Adam Tauman Kalai at Microsoft — on the exact problems I now solve for clients: benchmarking AI agents and testing them safely in the real world.

Before agents, I built and led engineering teams across cybersecurity (BAE Applied Intelligence, OSCP certified), quantitative trading (Dataffirm — EQT-backed, team of 30, TB-scale ML infrastructure), search (DeepCrawl), and distributed systems (Minima Global — scaled to 600K users).

That's not a typical background. It means I've shipped production AI systems across security, finance, search, and infrastructure — and I understand the compliance, performance, and reliability constraints that kill most agent deployments.

My background is not like many others

I've spent years at the center of the AI agent ecosystem. Not as a researcher publishing papers from a lab — as an engineer shipping product while also publishing the research.

At AutoGPT I owned the full AI product lifecycle: prompt design, model integration, agent orchestration, and the evaluation systems that told us whether any of it actually worked. I built the feedback loops, the regression testing, the experiment infrastructure. I watched thousands of real-world agent use cases succeed and fail, and I learned to tell the difference in the first five minutes.

Before that, I built a governed ML lifecycle for a quantitative trading firm — data ingestion, feature engineering, model training, evaluation, and promotion into live trading strategies. The kind of pipeline where “it mostly works” isn't good enough, because the models are making real bets with real money.

Sound familiar?

*“We built a great demo but can't get it past security review”
*“Our agent works in testing but hallucinates in production”
*“We're spending $40K/month on API calls for an agent that could be smarter”
*“We don't know how to evaluate whether our agent is actually working”
*“We need this in production in 8 weeks and our team has never built agents before”

I don't do science projects. I ship production systems. I'll push you toward simple architectures, measurable outcomes, and the fastest path to something that works reliably with real users and real data.

What I actually do

Agent architecture and deployment

I design and build AI agent systems that connect to messy real-world environments — legacy APIs, internal tools, compliance workflows, enterprise systems nobody wants to touch. Proper resilience patterns, monitoring, and graceful failure handling. The boring, critical work that separates demos from production.

Evaluation and monitoring

This is my specialism. I built the evaluation systems at AutoGPT and co-authored two papers on agent benchmarking and safety testing. Most teams have no systematic way to know if their agent is working. I build evaluation frameworks that catch failures before your users do, and monitoring systems that detect silent behavioral drift before it costs you money.

Fine-tuning and agent training

I distill expensive frontier model calls into small, self-hosted specialist models that cut inference costs by 60–80%. For teams that want agents that improve over time, I build reinforcement learning pipelines with verifiable rewards — the same techniques behind DeepSeek-R1 and GPT-5, applied to your specific domain.

Security and compliance

OSCP certified, with years at BAE Applied Intelligence. I build AI systems that pass enterprise security review on the first try. Per-user access controls, audit trails, PII detection, and compliance-ready documentation — baked in from day one, not bolted on after. Particularly valuable in regulated industries, but useful anywhere a security team has veto power over AI deployments.

An investment in not wasting the next 6 months

I've watched companies burn quarters and budgets on AI agent projects that were doomed from the architecture decision in week one. I've watched teams hire five engineers when they needed one person who'd seen the problem before.

I've built engineering teams from scratch three times — to 30 at Dataffirm, to 15 at Minima. I've migrated infrastructure, cut cloud costs by seven figures, and scaled systems to hundreds of thousands of users. I know what production pressure feels like, and I know the difference between architecture that compounds and architecture that collapses.

I can't promise you'll find product-market fit. I can promise I'll tell you which parts are genuinely hard and which are easier than you think. I'll keep your team focused on what matters instead of what's interesting. And I'll make sure you don't repeat the mistakes I've already watched a thousand teams make.

My job is to compress your learning curve from months to weeks.

How to work with me

Discovery Sprint

I start every relationship here. A 1–2 week paid engagement where I audit your current AI infrastructure, identify the real blockers, and deliver a concrete implementation roadmap. You walk away with a plan regardless of whether we continue.

£4,000–£8,000 / $6,000–$12,000

Then, three ways to continue:

1. Advisory

Weekly strategy call, architecture review on anything before you build it, and async access for quick technical questions. Best for teams that have engineering capacity but need expert direction.

£5,000/month / $8,000/month

2. Embedded Expert

I'm in your codebase 2–3 days a week. Architecture, code review, pair programming, security design. Your team gets better while we build. 3-month minimum.

£12,000–£18,000/month / $18,000–$25,000/month

3. Full Delivery

I own the AI agent workstream end-to-end. Architecture, build, testing, deployment, monitoring. I bring in specialists as needed. You get an outcome, not hours. 6-month minimum.

£25,000–£35,000/month / $35,000–$50,000/month

I don't bill hourly. I don't do free trials. The Discovery Sprint is how we figure out if there's a fit — it's designed to be valuable even if we never work together again.

If that's too much right now

I publish technical writeups on the problems I solve: agent evaluation, production monitoring, RAG architecture, fine-tuning economics. Follow me and read those first. When the timing is right, you'll know where to find me.

I work remotely from Spain and overlap well with UK, EU, and US East Coast hours. I've led distributed teams my entire career — including an engineering org of 30 and an open-source project with contributors across every timezone.