The Clinical Index

Why we're building this differently

Most AI tools applied to supplement analysis use large language models — the same technology behind consumer chatbots — to read studies and produce scores. We tried this approach first. It didn't meet our standard.

General models hallucinate citations. They produce different scores on different runs. They confuse salt weight with elemental weight, miss standardized extract percentages, and can't reliably distinguish a dose used in an animal study versus a human RCT. When your business decisions depend on clinical accuracy, "usually right" doesn't cut it.

So we started building something purpose-made. It's not finished — it's an active engineering effort that improves measurably every week. Here's how it works and where it stands.

Six specialized models, each under continuous refinement

Instead of one general-purpose AI, we're training a pipeline of six specialized biomedical language models. Each is trained on clinical literature and optimized for a single analytical task. Some are further along than others — we're transparent about that because the trajectory matters more than a snapshot.

Study Relevance ClassificationActive

Determines which published studies are relevant to a specific ingredient at a specific dose. Trained on biomedical retrieval benchmarks. Replacing our previous third-party API approach with a model we control.

Clinical Data ExtractionActive

Extracts structured data from clinical text — doses, populations, outcomes, effect sizes. Currently combining trained models with rule-based extraction for supplement-specific dose formats. Custom fine-tuning is ongoing.

Chemical Entity RecognitionActive

Resolves ingredient names — branded extracts, chemical synonyms, salt forms — to canonical identities. Trained on multiple biomedical NER datasets. Expanding coverage for supplement-specific branded terminology.

Evidence Quality AssessmentIn development

Evaluates study methodology to weight stronger evidence more heavily, built on established frameworks — GRADE certainty, RoB 2, and ROBINS-I. The deterministic scoring substrate is live; the model-assisted extraction that feeds it is one of the harder tasks, with no off-the-shelf solution, so we're still refining it.

Claim VerificationActive

Determines whether health claims are supported, contradicted, or unaddressed by clinical evidence, scored against the FTC Health Products Compliance Guidance (2022) and FDA Qualified Health Claim (A/B/C/D) standards. Trained on scientific fact-checking datasets. Adapting for supplement-specific claim language.

Ingredient NormalizationActive

Maps ingredient variants to canonical forms — salt conversions, elemental weight calculations, branded extract resolution. Combines learned embeddings with a curated reference dictionary we're growing with each analysis.

Each model runs on our own serving infrastructure — not third-party API calls. This means no data leaves our environment for model inference, latency is measured in milliseconds, and there's no per-call cost that scales with your usage. We're continuously expanding which tasks are handled by our own models versus external APIs.

Deterministic scoring: math, not language models

This is a foundational architectural decision we made early and won't change: every numeric score in The Clinical Index is computed by a deterministic mathematical formula. Not approximated by AI. Not predicted by a language model. Calculated.

The Supplement Clinical Score

Each ingredient receives a score from 0–100 based on three fixed-weight components: evidence quality and consistency (40%), dose adequacy against clinically meaningful effects — MCID and Cohen's d thresholds from human trials (40%), and the bioavailability of the specific ingredient form used (20%). The weights are fixed. The formula is the same for every product. As our models improve, the data feeding into this formula gets more accurate — but the formula itself is stable and auditable.

The AI models extract structured data from studies. The scoring engine applies fixed formulas to that data — the same formula for every product, fully auditable. The extraction step uses language models, so the structured inputs can vary slightly between runs; the math applied to them never does. The models are improving. The formula doesn't change.

How we train, test, and ship improvements

Building a model is the straightforward part. Knowing whether it actually works on real supplement data — and proving it works better than what it's replacing — is where most of our engineering effort goes.

Training on clinical literature

Our models are trained on peer-reviewed biomedical literature — tens of millions of abstracts and full-text articles from clinical journals. This isn't general web text. The models learn the structure and vocabulary of clinical research: how RCTs report dose-response data, how meta-analyses describe pooled effects, how study authors characterize limitations.

Fine-tuning for supplement-specific tasks

Clinical NLP models are a starting point. We fine-tune each model on task-specific datasets — annotated RCT abstracts, labeled chemical entities, expert-curated evidence quality judgments. Where public benchmarks exist, we validate against them. Where they don't — and for supplements, they often don't — we're building our own annotated datasets.

Benchmarking against real products

Every model update is tested against a curated benchmark of 100+ supplement products with expert-verified scores. This benchmark is growing weekly as we verify more products and add more edge cases — proprietary blends, branded extract forms, underdosed formulas, products with strong evidence. A model that improves average accuracy but regresses on any known edge case does not ship.

Automated validation gates

No model reaches production without passing automated testing. We track a composite accuracy metric that combines entity recognition precision, evidence scoring correlation, and end-to-end pipeline accuracy. If the new model doesn't measurably beat the current production model, it doesn't deploy. Every model version is tagged and logged.

This process runs continuously, not quarterly. We're running hundreds of fine-tuning experiments per optimization cycle — testing configurations a human engineer would never have time to try manually. When an improvement is validated, it deploys. When it isn't, we log the failure and the system moves to the next hypothesis.

Evidence depth, not evidence shortcuts

For every ingredient in a formulation, our pipeline retrieves and analyzes published clinical research from PubMed — the U.S. National Library of Medicine's database of over 36 million biomedical citations.

We don't skim abstracts. Our pipeline retrieves full-text articles when available, targeting the sections that matter: methods, intervention details, and results. This is where the actual evidence lives, and it's what most automated tools skip entirely.

15–30

Studies per primary ingredient

increasing as our retrieval improves

8–15

Studies per supporting ingredient

coverage expanding

Live evidence sources

PubMed / NCBI + ClinicalTrials.gov

Citations verifiable in PubMed

every PMID, always

Our evidence index updates on a regular cadence. Newly published studies are retrieved and processed automatically, which means analyses run next month draw from a larger and more current body of evidence than analyses run today.

The improvement flywheel

The most important thing about this system isn't where it is today — it's the rate at which it improves. We've designed the architecture around a self-reinforcing cycle:

Continuous improvement cycle

New clinical data flows in

Our collection pipeline indexes newly published studies from PubMed and retrieves full-text where available. This runs on a regular schedule. The evidence base grows automatically.

Each analysis sharpens the system

Every product we analyze generates data about what the models got right and what they missed. Ingredient aliases we hadn't seen before get added to our registry. Edge cases become benchmark test cases.

Hundreds of experiments, run autonomously

An automated optimization loop runs through model configurations — training parameters, architectural variations, data augmentation strategies — each evaluated against the same immutable benchmark. This happens regularly, not annually.

Only validated improvements deploy

Improvements that pass our benchmark gates ship to production. Improvements that don't are logged and inform the next round. Every deployed model version is tagged for traceability.

This means the platform you sign up for isn't a static tool. The analysis quality improves week over week, the evidence base grows daily, and the ingredient knowledge base expands with every product we process. Early customers get the steepest part of the improvement curve.

What we're working on now

We believe in being transparent about where the platform stands today and where it's heading. Here's our current focus:

Migrating to fully self-hosted model inferenceIn progress

Replacing remaining third-party API dependencies with our own trained models running on our infrastructure. This reduces cost, improves latency, and gives us full control over data handling.

Expanding our validation benchmarkOngoing

Growing from 100+ to 500+ expert-verified products, with emphasis on categories where evidence is contested or formulation practices vary widely.

Risk-of-bias assessment modelIn development

Building a model that evaluates study methodology quality — blinding, randomization, conflict of interest — to weight stronger evidence more heavily. No off-the-shelf solution exists for this task; we're constructing it from research-stage components.

Branded supplement entity coverageOngoing

Expanding our ingredient recognition to handle the long tail of branded extract names (standardized extracts, patented forms, proprietary blends) that general biomedical models don't know about.

This page is a living document. As capabilities ship, we update it. If you want to follow our progress more closely, our changelog tracks every meaningful improvement to the analysis engine.

What this means for your brand

Analysis that improves without action

As our models get more accurate and our evidence base grows, every analysis you run benefits — automatically, at no additional cost.

Scores you can defend

Deterministic, reproducible, traceable to specific studies. Built for the conversation with the retailer, the regulator, or the legal team.

Formulation data protected

Model inference runs on our infrastructure. We're actively reducing external API dependencies to keep more of the analysis pipeline in-house.

Early-adopter advantage

Customers who work with us now help shape what we build. Your edge cases become our benchmark tests. Your feedback drives our model priorities.

Details people ask about

Yes. The core analysis pipeline — evidence retrieval, scoring, and dossier generation — is in production and serving customers. What's actively evolving is the depth and accuracy of the analysis: we're training better models, expanding our evidence index, and growing our validation benchmark. The architecture is stable. The intelligence is improving.

Peer-reviewed biomedical literature indexed by the U.S. National Library of Medicine. Our training data includes tens of millions of abstracts and full-text articles from clinical journals. We do not train on web scrapes, Wikipedia, or user-generated content.

We maintain a curated benchmark dataset of 100+ (and growing) supplement products with expert-verified scores. Every model update is tested against this benchmark using a composite metric combining entity recognition accuracy, evidence scoring correlation, and end-to-end pipeline accuracy.

No. Large language models are used for specific tasks — evidence synthesis, narrative generation, complex reasoning about study findings — but every numeric score is computed by deterministic formulas. This is a deliberate architectural choice for reproducibility. We're also actively migrating more of the pipeline to our own trained models to reduce external AI dependencies.

Continuously. Our evidence index updates on a regular cadence with newly published studies. Our model optimization loop runs regularly, testing hundreds of configurations per cycle. When an improvement passes our validation gates, it deploys automatically. We track every model version for traceability.

Model inference for our trained models runs on our own infrastructure. We currently use external AI APIs for some natural language tasks (evidence synthesis, narrative generation) — those calls contain published study text, not your proprietary formulation details. We're progressively moving more of the pipeline in-house.

Every score traces back to specific published studies via PubMed identifiers. You can look up any cited study directly. We also provide the specific data points extracted from each study — dose, population, outcome, effect size — so you can verify our interpretation.

Try the analysis yourself

Upload a supplement label or enter a formulation. See the evidence retrieved, the data extracted, and the scores computed — with full citations.

Start a verification

Custom-Trained Models for Clinical Evidence Analysis