datasetscase studyhealthcare

Conference Data to Model Features: Mining JPM Presentations to Build Domain Embeddings

UUnknown

2026-02-16

11 min read

Practical guide to extracting, cleaning, and embedding JPM conference slides and transcripts into searchable corpora for healthcare intelligence.

Hook: Stop hunting slides — turn JPM decks and transcripts into actionable model features

If you've ever spent weeks cleaning PDFs, aligning slide timestamps with noisy transcripts, or tuning search relevance for domain queries, this guide is for you. Teams building searchable corpora from conference data (think: J.P. Morgan Healthcare decks and session transcripts) face repeatable engineering and ML problems: noisy inputs, multimodal content, regulatory constraints, and ambiguous relevance signals. Here I show a reproducible pipeline, real-world trade-offs, and 2026 best practices for turning conference assets into high-value domain embeddings and features for competitive intelligence and research.

Executive summary — what you’ll get

This walkthrough gives a practical, production-ready path to:

Ingest slides, slide images, and session transcripts (including vendor tools and open-source options).
Clean and normalize text, extract structured features (tables, speakers, company mentions), and de-identify sensitive data for healthcare.
Choose embedding strategies (text, image, multimodal), chunking methods, and vector DB architecture for scalable search.
Evaluate retrieval (recall/precision/MRR) and implement hybrid reranking for high-precision results.
Implement domain-specific NLP (scispaCy/UMLS) and privacy controls for health-related corpora.

Why conference data matters in 2026

Conference decks and transcripts are gold for competitive intelligence: first mentions of modalities, deal signals, slides full of roadmap timelines, and verbatim Q&A capture sentiment and positioning. In late 2025 and early 2026, the industry trend is clear — multimodal analysis and domain-adaptive embeddings outperform generic approaches for healthcare research. Regulatory scrutiny around sensitive health information also rose in 2025, so building compliant pipelines (and the ability to audit or purge records) is non-negotiable.

High-level pipeline

Acquire: collect slide decks (PDF, PPTX), event programs, and transcript audio/text.
Parse: extract slide text, notes, tables, and images; transcribe audio & align timestamps.
Clean & normalize: fix OCR errors, remove boilerplate, de-identify PHI.
Feature-extract: NER, keyphrases, slide structure, tables → structured rows.
Embed & index: choose embeddings (text/image/multimodal) and vector DB + hybrid index.
Evaluate & tune: build queries, measure recall/precision, tune chunk sizes & reranker.

Step 1 — Acquisition: sources and practical considerations

Sources for JPM-style conference data include public investor decks, speaker slides posted on company sites, recorded sessions from conference platforms, and official transcripts (where available). Your acquisition strategy should:

Respect terms of use and copyright — use public materials or obtain explicit permission.
Collect metadata: event, session, speaker, company, slide number, and presentation time.
Prefer original PPTX when available — it preserves structure (text boxes, tables, notes).

Tools & tips

For PDFs/PPTX: PyMuPDF / python-pptx for deterministic extraction.
For recordings: Open-source speech-to-text like WhisperX or commercial APIs for higher accuracy if budget allows.
When only videos exist: frame capture + OCR for embedded text using Tesseract or PaddleOCR; consider pre-filtering frames by scene change.

Step 2 — Parsing and cleaning: fix the ugly stuff

Slides and transcripts are noisy. Common issues: fragmented bullet text, OCR misreads (drug names are often mangled), and transcript drift in Q&A. Clean early — cleaner inputs yield far better embeddings.

Key cleaning tasks

Normalize whitespace & punctuation and fix broken bullet concatenations.
Spell-correct domain terms using a whitelist (company names, drug names, modality terms). Consider a finite dictionary built from domain resources (FDA labels, UMLS).
Align transcripts to slides using timestamps or dynamic time warping on slide-change events; keep 1–2 sentence overlap windows.
De-identify PHI using rule-based and model-based filters (names, IDs). Log redaction actions for audits.

Example: transcript alignment snippet (Python)

from bisect import bisect_right

# timestamps: list of slide start times (seconds)
# words: list of (time, word) tuples from STT

def assign_words_to_slides(timestamps, words):
    slides = [[] for _ in timestamps]
    for t, w in words:
        idx = bisect_right(timestamps, t) - 1
        if idx >= 0:
            slides[idx].append(w)
    return [' '.join(s) for s in slides]

Step 3 — Feature extraction: beyond plain text

Slide text is valuable, but tables, figure captions, and images contain structured signals. Extract these as features to improve retrieval and enable structured queries (e.g., "Which companies mentioned CAR-T timelines in 2026?").

What to extract

Slide title and section headers — often the strongest relevance signal.
Tables → structured rows — capture column names and units; convert to CSV-like rows for indexing.
Figures & captions — extract image and caption text; compute image embeddings.
Speaker & company mentions — index as metadata for faceted search.
Keywords & entities — use scispaCy or domain NER to extract drugs, targets, modalities.

Tools for biomedical NER

scispaCy (UMLS linking) — maps terms to concepts.
Custom gazetteers — include company names, drug candidates, and modality terms.

Step 4 — Chunking & indexing strategies

Chunking controls recall/precision. Too large and you dilute matches; too small and you lose context. For conference data a hybrid approach works best.

Recommended chunk types

Slide-level chunk — use title + bullets + short transcript window (best for slide-centric queries).
Timestamped passage — 30–60 second transcript windows aligned to slide changes (good for Q&A search).
Table rows — index as tiny chunks with structured fields.
Image+caption — multimodal chunk storing both image and text.

Chunk metadata

session_id, slide_number, speaker, company, timestamp, modality_tags, confidence_scores

Step 5 — Embedding strategies: text, image, multimodal

In 2026 the top gains for healthcare corpora come from domain-adaptive and multimodal embeddings. Options:

Pretrained text embeddings — OpenAI/Anthropic embeddings for throughput, or open models (SentenceTransformers, local fintuned models) for cost control and privacy.
Domain-adapted embeddings — fine-tune or adapter-tune on biomedical corpora (PubMed, internal slide text) to capture domain semantics.
Image embeddings — CLIP / ViT-based models for figures, combined with OCR captions.
Multimodal fusion — concatenate or learn a cross-modal encoder to produce unified embeddings for slide+image chunks.

Practical embedding recipe

Compute text embeddings for slide-title + content using an SBERT or API-backed model.
Compute image embeddings for figure using CLIP; build caption embedding separately.
For multimodal chunk, produce a weighted average: 0.7*text + 0.3*image (tune weights) or train a fusion model on labeled pairs.

Step 6 — Indexing and retrieval architecture

Choose a vector DB that fits your scale and SLA. In 2026, product choices are mature: FAISS (fast, self-hosted), Milvus, Qdrant, Weaviate, and managed vendors (Pinecone, RedisVector). For healthcare, plan for encrypted-at-rest and VPC deployments.

Hybrid search (best practice)

Combine vector search with a sparse model (BM25) for exact token matches. This improves precision for queries with exact drug names or identifiers. Typical flow:

Run BM25 to get a top-K sparse candidate set.
Run vector search across the same index or pre-filtered ids to get dense candidates.
Merge candidate sets and pass to a cross-encoder re-ranker for final scoring.

Cross-encoder reranker

Use a cross-encoder (sentence pair classifier) fine-tuned on labeled QA data to order results. This reduces false positives from approximate nearest neighbors.

Step 7 — Evaluation and tuning

Measure retrieval outcomes with curated ground truth:

Recall@K — did the relevant slide appear in the top-K?
Precision@K — are top results relevant?
MRR (Mean Reciprocal Rank) — ranks matter for analyst workflows.

Build small test sets: 200–500 queries that reflect analyst intents (e.g., competitive intel, technology mentions, timeline extraction). Tune chunk size, embedding model, and reranker to optimize recall while keeping top-5 precision high.

Concrete JPM Healthcare case study — end-to-end

Below is a condensed, reproducible pipeline used for a JPM Healthcare corpus of ~1,200 decks + 400 session recordings collected during the 2026 season.

Ingest & parse

Downloaded PPTX where available, fell back to PDF. Used python-pptx to get slide text and speaker notes; PyMuPDF for PDFs.
For recordings, used WhisperX for base STT and aligned tokens to slide-change timestamps via audio cues.

Cleaning & features

Applied a domain whitelist of ~25k terms (company names, approved drug candidates). Spell-corrected using a fast trie-based matcher.
Extracted table rows into structured rows using Camelot + heuristics for multi-line cells.
Run scispaCy to tag biomedical entities and map to UMLS concepts.

Embeddings & index

Text: SBERT-base tuned on PubMed + internal slide text (reduced hallucination on domain tokens).
Images: CLIP ViT embeddings for figures; concatenated with text embedding and L2-normalized.
Index: FAISS HNSW for speed in self-hosted cluster; daily bulk refreshes with delta updates for new decks. For larger teams, consider managed auto-sharding blueprints and scale patterns like Mongoose.Cloud auto-sharding blueprints to reduce ops overhead.

Retrieval & evaluation

Hybrid search: BM25 (Elasticsearch) pre-filter + FAISS dense search
Reranker: DistilRoBERTa cross-encoder fine-tuned on 2k labeled QA pairs from analysts.
Results: Recall@10 improved 18% after domain adaptation; top-3 precision > 82% for target queries.

Code example — minimal embedding & FAISS index (Python)

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer('all-mpnet-base-v2')
texts = ['Slide: CAR-T timelines for 2026', 'Company X pipeline summary']
emb = model.encode(texts, convert_to_numpy=True)

# build FAISS index
d = emb.shape[1]
index = faiss.IndexHNSWFlat(d, 32)
faiss.normalize_L2(emb)
index.add(emb)

# query
q = model.encode(['CAR-T timeline'], convert_to_numpy=True)
faiss.normalize_L2(q)
D, I = index.search(q, k=5)
print(I)

Scalability & cost — practical guidance for 2026

Plan for growth: conference corpora grow each year. Key levers:

Batch vs streaming ingestion — use daily batch loads for decks and streaming ingestion for live transcripts during conferences.
GPU usage — embedding at scale: 100k slides → ~1–2 GPU-hours using optimized SBERT; image embeddings add cost.
Storage — vector indices (float32) are heavy; use PQ or quantization to reduce footprint (~8–16x savings) with small recall loss.
Managed vs self-hosted — managed vendors reduce ops but add per-API costs. For 1M vectors, managed vendors often cost in the low thousands/month vs self-hosted infra amortized costs. Consider distributed file-system tradeoffs (see reviews of distributed file systems) and auto-sharding strategies when sizing your cluster.

Security, privacy, and regulatory checks for healthcare

Healthcare conference data often contains PHI in case studies or Q&As. Key controls:

De-identification — remove or hash names, IDs, and PHI before embedding. Keep raw audio/text in an auditable, restricted store.
Encryption & access controls — VPC, IAM-based role controls; audit all model calls.
Data retention — implement retention policies aligned with legal/compliance teams. If you anticipate high media volume, review edge storage for media-heavy one-pagers and edge-native storage patterns for cost-effective retention.

Evaluation recipes & troubleshooting

If relevance is low, try these steps in order:

Increase chunk size to restore context for ambiguous queries.
Add domain-adaptive fine-tuning or adapter tuning on a small set of annotated slide pairs.
Introduce BM25 filtering to handle exact token matches (drug names, gene IDs).
Use a cross-encoder reranker; sometimes a light cross-encoder gives the biggest lift per CPU minute. Also consider local, resilient inference patterns in case of cloud interruptions — see guidance on Edge AI reliability.

2026 trends & future-proofing

As of 2026, expect the following trends to shape conference-data pipelines:

Multimodal domain embeddings will become standard — models that natively fuse text, tables, and images outperform concatenation heuristics.
Privacy-preserving embeddings — homomorphic techniques and on-device embeddings will reduce data egress risks for sensitive health data.
Standardized evaluation datasets for conference corpora will appear, enabling cross-team benchmarking of recall for event-based queries.
Regulatory controls around clinical claims and data handling will tighten; build audit logs and redaction capabilities now. If you operate at scale, evaluate automating legal & compliance checks as part of your CI and data ingestion pipelines.

Actionable takeaways

Start with structure: preserve slide titles and table rows as first-class indexable features.
Combine sparse + dense search: BM25 + vector search reduces false positives on exact tokens.
Domain adapt: fine-tune or adapter-tune your embedding model on a small in-domain dataset for big quality gains.
Invest in reranking: a lightweight cross-encoder often beats costly embedding experimentation.
Comply early: put de-identification and audit trails in ingestion; retrofitting is expensive.

"In 2026, the competitive edge will be earned by teams that turn ephemeral conference signals into indexed, auditable, and semantically rich corpora."

Final checklist before production

Acquisition pipeline with metadata extraction
OCR + transcript alignment with manual QA sampling
Domain NER and whitelist-based corrections
Embeddings (text + image) and tuned chunking strategy
Hybrid index + cross-encoder reranker + evaluation harness
Security & compliance controls and retention policy

Call to action

Ready to prototype? Start with a 2-week spike: ingest 50 decks, transcribe 10 sessions, and run the hybrid retrieval flow above. If you want a reproducible repo and a checklist tailored to your infra (FAISS vs Pinecone, GPU budget, compliance needs), I can prepare a starter project with scripts, docker configs, and a small labeled test set based on JPM-style queries. Reach out to request the repo or a consulting session — let’s turn those slides into strategic features.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.