Evidence

SDS Extraction Accuracy Methodology

Scoring framework, cohort design, adjudication process, and error taxonomy used to evaluate SDS extraction quality before production release.

Last updated: 2026-03-10

Methodology objective

This methodology exists to produce repeatable, audit-ready quality measurements for SDS extraction across document layouts, languages, and scan conditions. Scores are computed at field level and rolled up with explicit weighting so release decisions are comparable across benchmark cycles.

Scoring definitions

Outcome class Rule Score contribution
Exact matchNormalized output equals adjudicated ground truth.1.00
Partial matchSemantic match with minor token/unit deviation after normalization.0.50 to 0.90
Wrong valueField extracted but semantically incorrect.0.00
Missing valueRequired field absent with no valid null reason.0.00
Correct nullGround truth confirms field is absent in source document.Excluded from denominator

Primary and secondary metrics

  • Primary: Weighted field-level F1 by section.
  • Secondary: Section coverage rate and low-confidence field rate.
  • Operational: Warning rate per document and mean fields requiring human review.
  • Stability: Drift delta versus previous benchmark run at cohort and section levels.

Cohort design and sampling rules

Design element Method
Sampling windowRolling 6-month document set with fixed cohort holdout.
Supplier diversityCap of 12% per supplier family to prevent skew.
Language balanceMinimum 40 files per supported language in benchmark set.
Scan quality quotasSeparate quotas for native PDFs, clean scans, and noisy scans.
Domain balanceChemicals, coatings, agrochemicals, cleaners, and industrial intermediates.

Normalization and comparison rules

  1. Whitespace, punctuation, and case normalization are applied before matching.
  2. Units are converted to canonical forms for concentration, flash point, and exposure limits.
  3. Controlled vocabularies are applied for GHS classes, hazard statements, and transport modes.
  4. Date values are normalized to ISO format before scoring.
  5. Table rows are matched with row-level alignment checks, not plain text concatenation.

Error taxonomy

Error class Definition Typical root cause
E1 Extraction missField not returned even though present in source.Layout fragmentation or OCR dropout.
E2 Classification confusionValue mapped to wrong controlled category.Ambiguous wording or locale variants.
E3 Normalization errorCorrect text extracted but normalized incorrectly.Unit parsing or date conversion defect.
E4 Table structure errorRows/columns collapsed or shifted in output.Complex multi-line table formatting.
E5 Context leakageHeader/footer or adjacent section text captured as value.Template noise in scanned documents.

Release gates and acceptance thresholds

Gate Threshold Action if failed
Global weighted F1>= 0.94Block release and trigger root-cause review.
Critical section F1 (2, 3, 8, 14)>= 0.92Block release for affected cohorts.
Low-confidence field rate<= 4.5%Enable stricter human-review routing.
Noisy-scan warning rate<= 15%Tune OCR and table parser before rollout.

Adjudication and audit process

  • Each sampled file is labeled independently by two reviewers.
  • Disagreements are escalated to a domain adjudicator with written rationale.
  • All adjudication outcomes are versioned and linked to benchmark run IDs.
  • Rule changes are logged and applied prospectively to prevent silent metric drift.

Report output format

{
  "run_id": "bench_2026_q1_r2",
  "sample_size": 720,
  "global_weighted_f1": 0.952,
  "section_scores": {
    "hazard_identification": 0.972,
    "composition": 0.951,
    "transport_information": 0.931
  },
  "low_confidence_field_rate": 0.036,
  "error_taxonomy_share": {
    "E1_extraction_miss": 0.27,
    "E2_classification_confusion": 0.21,
    "E3_normalization_error": 0.18,
    "E4_table_structure_error": 0.22,
    "E5_context_leakage": 0.12
  }
}

Update log

  • 2026-03-10: Replaced generic governance copy with formal scoring rules, cohort design, and release gates.
  • 2026-03-10: Added explicit error taxonomy and adjudication workflow for auditability.
  • 2026-03-10: Added canonical benchmark report schema with run-level metrics.

FAQ

Why use weighted field-level F1 instead of document-level pass rate only?

Document-level pass rates can hide critical section failures. Weighted field-level scoring keeps high-risk sections visible and makes release gates more defensible.

How are partial matches treated for multilingual fields?

Partial matches are scored with language-aware normalization rules, then reviewed through adjudication when semantic meaning is uncertain.

Can a customer run this methodology on a private holdout set?

Yes. The same scoring and gating rules can be applied to customer-owned datasets without changing the benchmark rubric.

Related pages

Need this methodology applied to your own data? Request a validation plan and receive cohort-level scoring with release recommendations.