How we translate talk into a ledger
The conviction rubric, the deterministic engine, the timing model, and the audit hall — all of it published. Nobody asked. We ledger anyway.
PunditPoorly takes every public-market call from the podcasts we cover, scores each one against a rubric, sizes the resulting synthetic position by conviction × specificity, and runs all of it through a deterministic portfolio engine. Performance compounds (or doesn't) over time. Every record is auditable. Nothing is investment advice.
Two scores per call
Each prediction gets two published scores from 0 to 100, scored independently because they capture different things:
- Claim Strength — how forcefully the view was stated. Personal stake (“I bought”), commitment language (“screaming buy”), explicit horizons, and contrarian framing all push it up. Heavy hedging and disclaimer bombs push it down.
- Specificity — how testable the call is. Named instrument is worth most; explicit time horizon, named catalyst, price target, and position-size signals add on top.
Combining them into a single “conviction score” loses information. A loud-and-vague call (cs 85, sp 10) and a quiet-and- specific call (cs 55, sp 85) have different shapes. Position size multiplies them: position_size = (cs × sp / 10000) × unit_size. Vague-but-loud is naturally suppressed.
Rubric v2.3 weights
v2.3 update (2026-05-22): bear-direction calls now require explicit short language (“I'm short”, “puts on”, “shorted”) in the personal-stake or transcript-excerpt before they can become tradeable INITIATEs. Closes a v2.2 hole where panel-wide bear theses (“growth stocks will get hammered”) were spawning phantom shorts that subsequently lost money in a +240% SPY bull run. See §12 for impact.
Claim Strength — 7 factors
Captures how forcefully a view was stated. The factors, ranked roughly by how much they contribute to the final score:
- Directional clarity — explicit recommendation, forecast, avoidance, or relative call (vs vague gesture)
- Commitment language — “absolute layup” vs “could be”
- Personal stake — “I bought” outranks “we own” outranks “considering”. We have an explicit anti-shield rule that penalizes “we”-framing relative to “I”-framing — talking about a fund position is editorially weaker than personal conviction.
- Voluntariness — volunteered outranks prompted outranks reactive agreement
- Counter-arg acknowledged — engaging the bear thesis (“I know X but Y because Z”) signals real conviction. Distinct from disclaimer bombs (next section).
- Time-frame committed — explicit horizons signal accountability
- Contrarian framing — “everyone's wrong” framing
The exact point allocations, the anti-shield discount factor, and the historical reweighting decisions (e.g., why we dropped the cross-episode persistence factor in v2.1) live in the Trader tier's methodology research bundle. Personal stake and commitment language are the two biggest single contributors; contrarian framing is the smallest.
Specificity — 5 factors
Captures how testable the call is:
- Instrument named — specific ticker outranks sector ETF outranks vague theme (no points)
- Time horizon stated — specific date/window outranks quarter/year outranks vague (“eventually”)
- Catalyst named — “before Q3 earnings” or “if Fed cuts” outranks none
- Price target / threshold — “buy under $40” or “$200 PT”
- Position-size signal — “$100M” or “X% of NAV”
Time horizon and instrument carry most of the specificity weight — they're the two signals that consistently appear in real pod commentary. Price target was downweighted after empirical analysis showed it was extremely rare across the first 100+ scored predictions. Exact weights are gated to Analyst tier.
Counter-arg vs disclaimer bomb
These look similar but point in opposite directions. The test: does the speaker engage the bear thesis itself, or just punt on accountability?
- Counter-arg acknowledged — speaker engages the opposing thesis and refutes/contextualizes it. Signals thoughtful conviction; gets a credit to claim strength. “I know everyone's worried about Meta's capex, but I think the AdTech moat absorbs it.”
- Disclaimer bomb — speaker hedges responsibility/liability without engaging the opposing thesis. Reduces actionability without reducing conviction; gets a penalty (the symmetric counterpart to the counter-arg credit). “Not investment advice, but I bought a ton of NVDA last week.”
Both can co-occur in the same sentence. We score both.
Trade execution timing
The engine simulates a realistic retail follower: someone who hears a call after-hours, can't trade in the dark, sets a market order for the next session, and gets filled at the openof the next available trading day. Updated in v5: was “next day's close” under v4, switched to next day's open because that's what a market-on-open order actually executes at.
- If the episode released before the open on a trading day → that day's open
- If released after market open, after-hours, on weekends, or on holidays → the next trading day's open
In practice, episodes that drop Friday evening execute at Monday's open; mid-week evening pods at the following day's open. We don't publish the exact ET cutoff or per-pod release-pattern calibration table — those are operational details a replicator would need.
We use raw OHLC prices (yfinance auto_adjust=False) for execution and price display (matches what shows on Yahoo Finance / thinkorswim). Mark-to- market valuations use Adj Close ratios to capture dividend reinvestment without distorting historical execution prices. Storing both prevents the “every dividend retroactively shifts every historical price” problem.
Live-event two-date tracking
Some predictions are made on live shows that publish to podcast/YouTube days or weeks later — All-In Summit, “Live from X”, “All-In DC”. For these, the call was knowable to the live audience on the LIVE date, but to the public-podcast audience only on the PUBLISH date. We track both:
call_date_live— when the call was actually made (live broadcast date OR pod release date if not a live event)call_date_published— when the public could realistically extract the call from a transcribable source
The engine uses call_date_published for the model portfolio — a realistic follower couldn't have known about a Summit call until the pod dropped weeks later. Trading at the live-date price would be cheating the simulation.
The UI shows BOTH dates with a small 📡 LIVE badge on rows where they diverge. The live-date price is shown for comparison only — it's editorially interesting (which hosts have the biggest pre-publication price moves on their calls?) but doesn't enter the engine.
Position lifecycle + decay
- INITIATE — first explicit position event for a given (speaker, ticker). Sizing depends on portfolio mode (see §09 + §13).
- ADD — re-affirmation that escalates the position (host says “loading up more” or similar). Increases the position; in concentrated/equal-weighted modes triggers a full rebalance.
- HOLD — same-direction re-mention that doesn't escalate. Resets the decay clock ONLY if the re-mention is directionally aligned with the existing position (v5.3+ fix). A neutral or contrary HOLD does not reset.
- REAFFIRM — a bullish OBSERVE on a long position (or bearish OBSERVE on a short). No trade event, but logs as proof of continued conviction and resets the decay clock. Stale chatter that mentions the ticker without taking a side does NOT reset.
- REDUCE / EXIT / REVERSE — explicit trim / sell / direction-flip language.
- STOP_LOSS — when configured, intraday Low touching the stop price triggers a sell-to-SPY at the stop price. Overnight gap-down below stop fills at the next day's open.
- TRIM_AT_GAIN — when configured, position hits +X% milestone triggers a fractional trim. Each milestone fires once.
- TIME_DECAY_HARVEST — when configured, linear unwind over the last N months of the hold window (e.g. 1/9 of original shares per month over the last 9 of an 18-month hold).
- Hard decay (EXPIRY) — at
decay_expiry_daysfrom last directionally-aligned mention, position auto-closes at last known price. Default 365d in the raw engine; preset-dependent (Conservative 365d, Balanced 540d, Let It Ride 540d with 3-month soft tail). - Re-up after decay — new INITIATE or directionally-aligned ADD re-opens at current market price and resets the clock.
Aggregate vs per-host
We compute aggregates two ways and surface both:
- Sum-of-host-books — each host gets their own $100K bucket; we sum the final values. The conservative, apples-to-apples framing — every host's outperformance is measured against their own SPY mirror, then totaled. Used on the homepage hero because the math ties directly to the host cards below it.
- Shared bucket — all hosts compete for one pooled bucket (4-bestie $400K, 5-bestie $500K, 6-host cross-pod $600K). Capital flows where the highest-conviction calls land. Under realistic risk management (BALANCED preset) this beats sum-of-books substantially because freed capital from trimmed winners gets redeployed into newer high-conviction calls. The BYOB chooser headline number is shared-bucket; the host cards below it show each host's individual book.
Both views are robust to attribution errors: if we wrongly attribute a Sacks call to Chamath, both hosts contribute to the same ticker bucket and the aggregate is unchanged. So:
- Aggregate = mostly accurate even under attribution noise
- Per-host = best read of who said what; flag corrections at any 🚩 button
The audit hall
Every record has a 🚩 button. Pick from a structured correction type — speaker, direction, ticker, sarcasm, not-a-real-call, conviction too high/low, position intent, wrong horizon, live event missing/wrong. Paste a 50–300 char quote from the transcript that supports the fix.
Submitted corrections go through a moderator queue. Merged corrections credit the contributor and update the relevant Book overnight. Every record has a full edit-history — even rejected corrections leave a breadcrumb. Top contributors get rep, badges, and our gratitude.
The trend line of attribution accuracy converging from ~85% → 99% over time is itself the brand asset. We publish the misses too.
Portfolio engine evolution
The portfolio engine is the result of five major revisions since v2. We don't publish the full architectural changelog or the per-version sizing formulas — those are part of the Analyst-tier methodology research bundle. What we do publish is the problem each revision was solving, so you can audit whether the engine is honest about what it's modeling:
- v2: wire up the real transcription + extraction pipeline. Naive sizing, fixed dollar amounts, no decay — just enough to ship a portfolio at all.
- v3: synthetic shorts (so bearish calls can lose up to ~100% of position size, matching the typical retail-short experience) and a decay clock so “called once and never mentioned again” positions don't haunt the book forever.
- v4: on-demand funding + cycling SPY mirror. Solved the “how do you fund years of reinvestment without a fixed bankroll” problem, then exposed a deeper one: when host and benchmark each accumulate external capital based on their own performance, you're no longer comparing apples-to-apples. That motivated v5.
- v5.0: the engine we still use today. SPY-as-cash — every pundit starts with a SPY base, each call sells some SPY and buys the host's pick, each close sells the pick and buys SPY back. No external capital injection, no benchmark drift, no idle cash. Same $100K starting capital, same time period, same dividend treatment for both host and benchmark — the gap is pure stock-selection alpha.
- v5.1 → v5.4 (incremental): aggregate-cohort fixes, a more honest active-window clock per host, composable risk-management knobs (stops, gain harvests, time-decay glides), direction-aware decay (the clock resets only when the host re-affirms direction), and the portfolio modes that power the BYOB concentrated/equal-weighted presets.
The full v5.4 spec — sizing formula, conviction-tier breakpoints, proportional-shrink derivation, strict-bears filter, and the complete commit-by-commit changelog — is part of the Analyst tier research bundle.
Sizing alternatives we tested
Conviction-tier percentages are universal defaults — every host gets the same sizing curve regardless of how often they speak. We explored alternatives:
- Per-host calibrated tiers — scale each host's tier %s by their typical deployment so prolific speakers get smaller per-position weight than highly selective ones. Available as a comparison panel on host detail pages but superseded by the BYOB presets (particularly BALANCED), which capture the same intuition more directly via conviction floor + risk-management knobs rather than per-host scaling.
- Half-Kelly per tier — computed as a sanity check. Sample sizes per tier are too small to trust literally, but the shape of the half-Kelly answer was consistent with the universal defaults we adopted.
- Flat sizing — discarded because it dilutes high-conviction calls against low-conviction ones at the same weight.
- Full Kelly — mathematically optimal under accurate p/payoff but ruinous if those parameters drift. Rejected as too aggressive for synthetic tracking.
The exact conviction-tier breakpoints and the per-host calibration formulae are part of the Analyst tier's methodology research bundle.
Aggregate cohort treatments
For “what does the panel as a whole look like” we compute multiple parallel variants and surface them across pages:
- A. Shared bucket — all hosts in the cohort compete for one pooled bucket ($400K for 4-bestie, $500K for 5-bestie, $600K for 6-host cross-pod). Positions keyed by
(speaker, ticker)so two hosts on the same ticker stay separate. Headline number on the BYOB chooser and cross-pod pages. - B. Sum-of-independent-host-books — each host gets their own $100K; we sum the final values. No competition for capital. Headline number on the homepage (math ties to the 4 host cards displayed there).
- C. Per-host calibrated tiers (legacy) — each host's tier %s scale by their peak-deployment ratio. Available as a toggle on host detail pages but no longer the default; superseded by the BALANCED preset.
Finding (v5.0 → v5.4 shift): under the original v5 SPY-as-cash methodology (no stop loss, no harvest), shared bucket DID underperform sum-of-books — the constant Y-variant shrinking eroded capital across competing calls. That's no longer true. Under BALANCED methodology (stop loss + gain harvest + time-decay glide), shared bucket dramatically outperforms sum-of-books because freed capital from trimmed winners gets redeployed into newer high-conviction calls.
The exact shared-bucket vs sum-of-books deltas per cohort drift as new episodes ingest, so we don't freeze the current numbers into the methodology copy. Live deltas show on each cohort's overview page: All-In, BG2, cross-pod.
Sum-of-books remains the more conservative framing (no assumed rebalance authority across speakers) which is why the homepage leads with it. Shared bucket is the “PM with discretion to allocate across hosts” framing — bigger numbers, requires active management.
Strict bears filter (2026-05-21 fix)
The bug we found: v2.2 rubric tagged 70 panel-wide bear theses as SHORT-INITIATE events when the speaker was just making a macro argument (“growth stocks will get hammered by rates”) without disclosing an actual short position. Real-life retail followers don't short on commentary — they short when the speaker explicitly says “I'm short.”
The fix: at engine load time, anydirection="bear" +position_intent="INITIATE" pred WITHOUT explicit short language ("I'm short","shorted", "puts on", etc) in either personal_stake ortranscript_excerpt is reclassified toOBSERVE — commentary, not a trade. 60 of 70 bear-INITIATEs (86%) got reclassified.
Impact: dramatic story shift. Chamath went from appearing to trail SPY by ~$213K (−213pp under v4 + phantom shorts) to outperforming under v5: +33pp under our default BALANCED methodology, and +95pp under the data-best conviction-floor 60 + INITIATE promotion variant from the autoresearch sweep. Every host gained dozens of percentage points because we stopped manufacturing fake short positions that lost money in a +240% SPY bull market. The full ladder under BALANCED today:
- Calacanis +62pp · Chamath +33pp · Friedberg +1pp · Sacks +1pp · Brad +13pp (on All-In appearances only)
- 4-bestie aggregate: +143pp shared bucket · 5-bestie: +149pp · 6-host cross-pod: +143pp ($2.28M from $600K)
- Brad cross-pod solo (his unified All-In + BG2 book): +79pp default engine, +89pp under CONCENTRATED
The picture is no longer “everyone roughly tied with SPY” — under realistic risk-managed methodologies the besties hand- ily beat SPY at the cohort level; individual hosts vary based on which preset suits their style (see the BYOB chooser).
Lesson: bear-direction calls need a higher classification bar than bull-direction calls because the universe of "I'm worried about X" commentary vastly exceeds the universe of actual short positions. This is now enforced two ways: rubric v2.3 bakes the explicit-short- language requirement directly into the extractor prompt (so new episodes get clean classifications from the start), and the engine's strict-bears load-time filter remains as a defense- in-depth for any older extractions that pre-date the rubric update.
Methodology presets (BYOB)
The engine's parameters (hold window, stop loss, harvest rules, conviction floor, portfolio mode) are independently composable. We ship 5 named presets representing distinct investor philosophies. Readers can swap between them on the BYOB chooser to see the same data interpreted through different rules.
- Conservative — risk-averse: tighter loss control, shorter hold window, faster time-decay harvest, higher conviction floor. The follower who wants stops to do real work.
- Balanced (default) — data-driven optimum from our trim sweep. Moderate stop, medium hold, stepped gain harvest, time decay in the final stretch. Wins for the 4-bestie and 5-bestie aggregate cohorts. +143pp vs SPY on the 4-bestie shared bucket.
- Let It Ride — the besties' own mantra. No stop. No gain harvest. Only unwind if the host has gone silent long enough that the position is effectively orphaned. Wins for long-runway individual hosts (Friedberg, Brad on BG2). Data-best for BG2.
- Concentrated — single-position max conviction. 100% of bucket in the host's first INITIATE; new picks trim existing positions by conviction weight. High variance: rewards selective hosts (Brad +94pp, Calacanis +80pp); destroys prolific ones (Chamath −98pp because his second-ever INITIATE was MILE, which got delisted at total loss). Educational, not a recommendation.
- Equal-Weighted — every INITIATE = 1/N of bucket, ignoring conviction. The “what if you couldn't tell which calls were strong?” counterfactual. Same first-pick risk as Concentrated.
The exact numeric recipe for each preset (stop %, hold window in months, gain milestones, conviction floor, decay glide) lives in the Analyst-tier methodology research bundle. The qualitative shape above is what drives the cohort numbers you see on the chooser; the underlying parameters are what would let a competitor replicate the engine end-to-end.
Per-host calibration: methodology choices vary by cohort. Aggregates win with BALANCED (harvest the winners, free capital). Individual long-runway hosts win with LET IT RIDE (don't cap your compounders). Brad specifically wins concentrated; Chamath specifically loses concentrated. The methodology page on each host surfaces their best-fit preset alongside the universal default.
Shipped post-v5.0
Roadmap items now live (formerly “future research”):
- Auto-research scenario sweep ✓ shipped. 4,032 engine runs across (hold window × conviction floor × INITIATE threshold × cs/sp weighting). Outputs at
data/allin/autoresearch/v1/. Refined to a trim- parameter-only sweep (scripts/allin/trim_sweep.py, 182 configs × 8 cohorts) which derived the BALANCED preset's +50/+100 sell-50% start-12mo recipe as the universal winner. - Stop loss + gain harvest + time-decay glide ✓ shipped in engine v5.3. All three independently composable; BALANCED uses gain+time-decay together (combined mode beats gain-only by +8pp on aggregates).
- Direction-aware last_mention reset ✓ shipped in v5.3+. Decay clock only resets when the host is still directionally aligned (a bullish OBSERVE on a long position resets; a neutral mention does not).
- Portfolio modes ✓ shipped in v5.4: concentrated (conviction-weighted rebalance, no SPY base) and equal-weighted (1/N rebalance ignoring conviction). Powers the 2 new presets.
- BG2 podcast integration ✓ shipped at /pundit-poorly/bg2. 43 episodes, 64 graded calls. Separate trim sweep confirmed: NO trim wins on BG2 (Brad's book is long-runway AI compounders).
- Cross-pod Brad Gerstner ✓ shipped at /pundit-poorly/cross-pod/gerstner. Combines 15 All-In + 58 BG2 calls (minus 2 dedup) into one unified book: +78.8pp vs SPY default-engine, +88.8pp under CONCENTRATED preset, over 4 years.
What's deferred for future research
Items scoped but not yet built:
- Real-time pipeline — scripts ready (
scripts/poll_rss_and_dispatch.py) for Windows Task Scheduler to run every 15 min, fire pipeline when new episode detected. Today's latency: ~90-120 min on local CPU whisper. With OpenAI Whisper API swap: ~20 min episode-to-live. - Friedberg “implicit beneficiary” detector — map his (and others') theme-level excitement about scientific breakthroughs to public-equity beneficiaries. Requires editorial vetting workflow (multi-LLM council with citation verification) before publish.
- Conference transcript miner — Sohn / Liquidity Summit / Milken / JPM Healthcare etc. Public YouTube only (no paid transcript subscriptions). Two layers: explicit pitches (easy, plug into existing engine) and implicit theme-to-equity mapping (depends on the vetting workflow above).
- Per-host alerts (paid tier) — stop hits, trim fires, INITIATE detected, time-decay unwind starting. Email / SMS / webhook delivery configured to user's selected preset.
- Synthetic portfolio tracker (paid tier) — “If you'd been following METHODOLOGY since you signed up, you'd be at $X today.” Requires Supabase auth + per-user preset persistence.
- BYO podcast backtest (paid tier) — user submits a podcast / YouTube link, LLM extracts predictions in our schema, we run the engine on theirs. Approval required to avoid junk submissions. Insights surface to all readers post-vetting.
- Position-lifecycle area chart — current composition pie shows a snapshot; an area chart showing position weights stacked over time would visualize how the book evolved.
What we won't do
- Recommend you buy or sell anything. Nothing here is investment advice.
- Republish full transcripts. Receipt-style quotes only (per copyright / fair use).
- Hide the misses. Failed positions, expired theses, wrong-direction calls — all visible.
- Score sarcasm or humor framing as real conviction. Jokes are excluded entirely (
is_joke=true). - Trade off live-event prices the public couldn't access yet. Engine waits for the published version.
- Promise accuracy beyond what we've measured. Cohen's κ targets are published; quarterly revalidation runs include the audit results.
Nobody asked. We ledger anyway. Rubric v2.3 / engine v5.4 (SPY-as-cash + concentrated + equal-weighted portfolio modes, stop-loss + gain-harvest + time-decay glide, direction- aware decay reset) / 5 methodology presets. Last revised 2026-05-23. Rubric versions and engine versions evolve; old extractions are re-scored when methodology changes — we publish the diff. Source on GitHub.