Why we're building this differently
Most AI tools applied to supplement analysis use large language models — the same technology behind consumer chatbots — to read studies and produce scores. We tried this approach first. It didn't meet our standard.
General models hallucinate citations. They produce different scores on different runs. They confuse salt weight with elemental weight, miss standardized extract percentages, and can't reliably distinguish a dose used in an animal study versus a human RCT. When your business decisions depend on clinical accuracy, "usually right" doesn't cut it.
So we started building something purpose-made. It's not finished — it's an active engineering effort that improves measurably every week. Here's how it works and where it stands.
Six specialized models, each under continuous refinement
Instead of one general-purpose AI, we're training a pipeline of six specialized biomedical language models. Each is trained on clinical literature and optimized for a single analytical task. Some are further along than others — we're transparent about that because the trajectory matters more than a snapshot.
Determines which published studies are relevant to a specific ingredient at a specific dose. Trained on biomedical retrieval benchmarks. Replacing our previous third-party API approach with a model we control.
Extracts structured data from clinical text — doses, populations, outcomes, effect sizes. Currently combining trained models with rule-based extraction for supplement-specific dose formats. Custom fine-tuning is ongoing.
Resolves ingredient names — branded extracts, chemical synonyms, salt forms — to canonical identities. Trained on multiple biomedical NER datasets. Expanding coverage for supplement-specific branded terminology.
Evaluates study methodology to weight stronger evidence more heavily. Built from established risk-of-bias frameworks. This is one of the harder tasks — no off-the-shelf solution exists, so we're building from research-stage components.
Determines whether health claims are supported, contradicted, or unaddressed by clinical evidence. Trained on scientific fact-checking datasets. Adapting for supplement-specific claim language.
Maps ingredient variants to canonical forms — salt conversions, elemental weight calculations, branded extract resolution. Combines learned embeddings with a curated reference dictionary we're growing with each analysis.
Each model runs on our own serving infrastructure — not third-party API calls. This means no data leaves our environment for model inference, latency is measured in milliseconds, and there's no per-call cost that scales with your usage. We're continuously expanding which tasks are handled by our own models versus external APIs.
Deterministic scoring: math, not language models
This is a foundational architectural decision we made early and won't change: every numeric score in The Clinical Index is computed by a deterministic mathematical formula. Not approximated by AI. Not predicted by a language model. Calculated.
The Supplement Clinical Score
Each ingredient receives a score from 0–100 based on three weighted components: the quality and consistency of published evidence, whether the formulated dose matches doses shown effective in human clinical trials, and the bioavailability characteristics of the specific ingredient form used. The weights are fixed. The formula is the same for every product. As our models improve, the data feeding into this formula gets more accurate — but the formula itself is stable and auditable.
The AI models extract structured data from studies. The scoring engine applies fixed formulas to that data. Same inputs, same outputs, every time. The models are improving. The math doesn't change.
How we train, test, and ship improvements
Building a model is the straightforward part. Knowing whether it actually works on real supplement data — and proving it works better than what it's replacing — is where most of our engineering effort goes.
Training on clinical literature
Our models are trained on peer-reviewed biomedical literature — tens of millions of abstracts and full-text articles from clinical journals. This isn't general web text. The models learn the structure and vocabulary of clinical research: how RCTs report dose-response data, how meta-analyses describe pooled effects, how study authors characterize limitations.
Fine-tuning for supplement-specific tasks
Clinical NLP models are a starting point. We fine-tune each model on task-specific datasets — annotated RCT abstracts, labeled chemical entities, expert-curated evidence quality judgments. Where public benchmarks exist, we validate against them. Where they don't — and for supplements, they often don't — we're building our own annotated datasets.
Benchmarking against real products
Every model update is tested against a curated benchmark of 100+ supplement products with expert-verified scores. This benchmark is growing weekly as we verify more products and add more edge cases — proprietary blends, branded extract forms, underdosed formulas, products with strong evidence. A model that improves average accuracy but regresses on any known edge case does not ship.
Automated validation gates
No model reaches production without passing automated testing. We track a composite accuracy metric that combines entity recognition precision, evidence scoring correlation, and end-to-end pipeline accuracy. If the new model doesn't measurably beat the current production model, it doesn't deploy. Every model version is tagged and logged.
This process runs continuously, not quarterly. We're running hundreds of fine-tuning experiments per optimization cycle — testing configurations a human engineer would never have time to try manually. When an improvement is validated, it deploys. When it isn't, we log the failure and the system moves to the next hypothesis.
Evidence depth, not evidence shortcuts
For every ingredient in a formulation, our pipeline retrieves and analyzes published clinical research from PubMed — the U.S. National Library of Medicine's database of over 36 million biomedical citations.
We don't skim abstracts. Our pipeline retrieves full-text articles when available, targeting the sections that matter: methods, intervention details, and results. This is where the actual evidence lives, and it's what most automated tools skip entirely.
15–30
Studies per primary ingredient
increasing as our retrieval improves
8–15
Studies per supporting ingredient
coverage expanding
3
Full-text retrieval sources
cascaded for maximum coverage
100%
Citations verifiable in PubMed
always
Our evidence index updates on a regular cadence. Newly published studies are retrieved and processed automatically, which means analyses run next month draw from a larger and more current body of evidence than analyses run today.
The improvement flywheel
The most important thing about this system isn't where it is today — it's the rate at which it improves. We've designed the architecture around a self-reinforcing cycle:
Continuous improvement cycle
New clinical data flows in
Our collection pipeline indexes newly published studies from PubMed and retrieves full-text where available. This runs on a regular schedule. The evidence base grows automatically.
Each analysis sharpens the system
Every product we analyze generates data about what the models got right and what they missed. Ingredient aliases we hadn't seen before get added to our registry. Edge cases become benchmark test cases.
Hundreds of experiments, run autonomously
An automated optimization loop runs through model configurations — training parameters, architectural variations, data augmentation strategies — each evaluated against the same immutable benchmark. This happens regularly, not annually.
Only validated improvements deploy
Improvements that pass our benchmark gates ship to production. Improvements that don't are logged and inform the next round. Every deployed model version is tagged for traceability.
This means the platform you sign up for isn't a static tool. The analysis quality improves week over week, the evidence base grows daily, and the ingredient knowledge base expands with every product we process. Early customers get the steepest part of the improvement curve.
What we're working on now
We believe in being transparent about where the platform stands today and where it's heading. Here's our current focus:
Replacing remaining third-party API dependencies with our own trained models running on our infrastructure. This reduces cost, improves latency, and gives us full control over data handling.
Growing from 100+ to 500+ expert-verified products, with emphasis on categories where evidence is contested or formulation practices vary widely.
Building a model that evaluates study methodology quality — blinding, randomization, conflict of interest — to weight stronger evidence more heavily. No off-the-shelf solution exists for this task; we're constructing it from research-stage components.
Expanding our ingredient recognition to handle the long tail of branded extract names (standardized extracts, patented forms, proprietary blends) that general biomedical models don't know about.
This page is a living document. As capabilities ship, we update it. If you want to follow our progress more closely, our changelog tracks every meaningful improvement to the analysis engine.
What this means for your brand
Analysis that improves without action
As our models get more accurate and our evidence base grows, every analysis you run benefits — automatically, at no additional cost.
Scores you can defend
Deterministic, reproducible, traceable to specific studies. Built for the conversation with the retailer, the regulator, or the legal team.
Formulation data protected
Model inference runs on our infrastructure. We're actively reducing external API dependencies to keep more of the analysis pipeline in-house.
Early-adopter advantage
Customers who work with us now help shape what we build. Your edge cases become our benchmark tests. Your feedback drives our model priorities.
Details people ask about
Yes. The core analysis pipeline — evidence retrieval, scoring, and dossier generation — is in production and serving customers. What's actively evolving is the depth and accuracy of the analysis: we're training better models, expanding our evidence index, and growing our validation benchmark. The architecture is stable. The intelligence is improving.
Peer-reviewed biomedical literature indexed by the U.S. National Library of Medicine. Our training data includes tens of millions of abstracts and full-text articles from clinical journals. We do not train on web scrapes, Wikipedia, or user-generated content.
We maintain a curated benchmark dataset of 100+ (and growing) supplement products with expert-verified scores. Every model update is tested against this benchmark using a composite metric combining entity recognition accuracy, evidence scoring correlation, and end-to-end pipeline accuracy.
No. Large language models are used for specific tasks — evidence synthesis, narrative generation, complex reasoning about study findings — but every numeric score is computed by deterministic formulas. This is a deliberate architectural choice for reproducibility. We're also actively migrating more of the pipeline to our own trained models to reduce external AI dependencies.
Continuously. Our evidence index updates on a regular cadence with newly published studies. Our model optimization loop runs regularly, testing hundreds of configurations per cycle. When an improvement passes our validation gates, it deploys automatically. We track every model version for traceability.
Model inference for our trained models runs on our own infrastructure. We currently use external AI APIs for some natural language tasks (evidence synthesis, narrative generation) — those calls contain published study text, not your proprietary formulation details. We're progressively moving more of the pipeline in-house.
Every score traces back to specific published studies via PubMed identifiers. You can look up any cited study directly. We also provide the specific data points extracted from each study — dose, population, outcome, effect size — so you can verify our interpretation.
Try the analysis yourself
Upload a supplement label or enter a formulation. See the evidence retrieved, the data extracted, and the scores computed — with full citations.
Start a verification