What metric is used as the primary accuracy score?

The primary metric is weighted field-level F1, combined with section coverage and warning-rate metrics for operational readiness.

How are partial matches handled?

Partial matches receive fractional credit based on normalization and token overlap rules and are tracked separately from exact matches.

Is there a formal error taxonomy?

Yes. Errors are classified into extraction miss, classification confusion, normalization error, table-structure error, and context-leakage groups.

Evidence

SDS Extraction Accuracy Methodology

Scoring framework, cohort design, adjudication process, and error taxonomy used to evaluate SDS extraction quality before production release.

Last updated: 2026-03-10

Methodology objective

This methodology exists to produce repeatable, audit-ready quality measurements for SDS extraction across document layouts, languages, and scan conditions. Scores are computed at field level and rolled up with explicit weighting so release decisions are comparable across benchmark cycles.

Scoring definitions

Outcome class	Rule	Score contribution
Exact match	Normalized output equals adjudicated ground truth.	1.00
Partial match	Semantic match with minor token/unit deviation after normalization.	0.50 to 0.90
Wrong value	Field extracted but semantically incorrect.	0.00
Missing value	Required field absent with no valid null reason.	0.00
Correct null	Ground truth confirms field is absent in source document.	Excluded from denominator

Primary and secondary metrics

Primary: Weighted field-level F1 by section.
Secondary: Section coverage rate and low-confidence field rate.
Operational: Warning rate per document and mean fields requiring human review.
Stability: Drift delta versus previous benchmark run at cohort and section levels.

Cohort design and sampling rules

Design element	Method
Sampling window	Rolling 6-month document set with fixed cohort holdout.
Supplier diversity	Cap of 12% per supplier family to prevent skew.
Language balance	Minimum 40 files per supported language in benchmark set.
Scan quality quotas	Separate quotas for native PDFs, clean scans, and noisy scans.
Domain balance	Chemicals, coatings, agrochemicals, cleaners, and industrial intermediates.

Normalization and comparison rules

Whitespace, punctuation, and case normalization are applied before matching.
Units are converted to canonical forms for concentration, flash point, and exposure limits.
Controlled vocabularies are applied for GHS classes, hazard statements, and transport modes.
Date values are normalized to ISO format before scoring.
Table rows are matched with row-level alignment checks, not plain text concatenation.

Error taxonomy

Error class	Definition	Typical root cause
E1 Extraction miss	Field not returned even though present in source.	Layout fragmentation or OCR dropout.
E2 Classification confusion	Value mapped to wrong controlled category.	Ambiguous wording or locale variants.
E3 Normalization error	Correct text extracted but normalized incorrectly.	Unit parsing or date conversion defect.
E4 Table structure error	Rows/columns collapsed or shifted in output.	Complex multi-line table formatting.
E5 Context leakage	Header/footer or adjacent section text captured as value.	Template noise in scanned documents.

Release gates and acceptance thresholds

Gate	Threshold	Action if failed
Global weighted F1	>= 0.94	Block release and trigger root-cause review.
Critical section F1 (2, 3, 8, 14)	>= 0.92	Block release for affected cohorts.
Low-confidence field rate	<= 4.5%	Enable stricter human-review routing.
Noisy-scan warning rate	<= 15%	Tune OCR and table parser before rollout.

Adjudication and audit process

Each sampled file is labeled independently by two reviewers.
Disagreements are escalated to a domain adjudicator with written rationale.
All adjudication outcomes are versioned and linked to benchmark run IDs.
Rule changes are logged and applied prospectively to prevent silent metric drift.

Report output format

{
  "run_id": "bench_2026_q1_r2",
  "sample_size": 720,
  "global_weighted_f1": 0.952,
  "section_scores": {
    "hazard_identification": 0.972,
    "composition": 0.951,
    "transport_information": 0.931
  },
  "low_confidence_field_rate": 0.036,
  "error_taxonomy_share": {
    "E1_extraction_miss": 0.27,
    "E2_classification_confusion": 0.21,
    "E3_normalization_error": 0.18,
    "E4_table_structure_error": 0.22,
    "E5_context_leakage": 0.12
  }
}

Update log

2026-03-10: Replaced generic governance copy with formal scoring rules, cohort design, and release gates.
2026-03-10: Added explicit error taxonomy and adjudication workflow for auditability.
2026-03-10: Added canonical benchmark report schema with run-level metrics.

FAQ

Why use weighted field-level F1 instead of document-level pass rate only?

Document-level pass rates can hide critical section failures. Weighted field-level scoring keeps high-risk sections visible and makes release gates more defensible.

How are partial matches treated for multilingual fields?

Partial matches are scored with language-aware normalization rules, then reviewed through adjudication when semantic meaning is uncertain.

Can a customer run this methodology on a private holdout set?

Yes. The same scoring and gating rules can be applied to customer-owned datasets without changing the benchmark rubric.

Need this methodology applied to your own data? Request a validation plan and receive cohort-level scoring with release recommendations.

SDS Extraction Accuracy Methodology

Methodology objective

Scoring definitions

Primary and secondary metrics

Cohort design and sampling rules

Normalization and comparison rules

Error taxonomy

Release gates and acceptance thresholds

Adjudication and audit process

Report output format

Update log

FAQ

Related pages