Evidence
SDS Extraction Accuracy Methodology
Scoring framework, cohort design, adjudication process, and error taxonomy used to evaluate SDS extraction quality before production release.
Last updated: 2026-03-10
Methodology objective
This methodology exists to produce repeatable, audit-ready quality measurements for SDS extraction across document layouts, languages, and scan conditions. Scores are computed at field level and rolled up with explicit weighting so release decisions are comparable across benchmark cycles.
Scoring definitions
| Outcome class | Rule | Score contribution |
|---|---|---|
| Exact match | Normalized output equals adjudicated ground truth. | 1.00 |
| Partial match | Semantic match with minor token/unit deviation after normalization. | 0.50 to 0.90 |
| Wrong value | Field extracted but semantically incorrect. | 0.00 |
| Missing value | Required field absent with no valid null reason. | 0.00 |
| Correct null | Ground truth confirms field is absent in source document. | Excluded from denominator |
Primary and secondary metrics
- Primary: Weighted field-level F1 by section.
- Secondary: Section coverage rate and low-confidence field rate.
- Operational: Warning rate per document and mean fields requiring human review.
- Stability: Drift delta versus previous benchmark run at cohort and section levels.
Cohort design and sampling rules
| Design element | Method |
|---|---|
| Sampling window | Rolling 6-month document set with fixed cohort holdout. |
| Supplier diversity | Cap of 12% per supplier family to prevent skew. |
| Language balance | Minimum 40 files per supported language in benchmark set. |
| Scan quality quotas | Separate quotas for native PDFs, clean scans, and noisy scans. |
| Domain balance | Chemicals, coatings, agrochemicals, cleaners, and industrial intermediates. |
Normalization and comparison rules
- Whitespace, punctuation, and case normalization are applied before matching.
- Units are converted to canonical forms for concentration, flash point, and exposure limits.
- Controlled vocabularies are applied for GHS classes, hazard statements, and transport modes.
- Date values are normalized to ISO format before scoring.
- Table rows are matched with row-level alignment checks, not plain text concatenation.
Error taxonomy
| Error class | Definition | Typical root cause |
|---|---|---|
| E1 Extraction miss | Field not returned even though present in source. | Layout fragmentation or OCR dropout. |
| E2 Classification confusion | Value mapped to wrong controlled category. | Ambiguous wording or locale variants. |
| E3 Normalization error | Correct text extracted but normalized incorrectly. | Unit parsing or date conversion defect. |
| E4 Table structure error | Rows/columns collapsed or shifted in output. | Complex multi-line table formatting. |
| E5 Context leakage | Header/footer or adjacent section text captured as value. | Template noise in scanned documents. |
Release gates and acceptance thresholds
| Gate | Threshold | Action if failed |
|---|---|---|
| Global weighted F1 | >= 0.94 | Block release and trigger root-cause review. |
| Critical section F1 (2, 3, 8, 14) | >= 0.92 | Block release for affected cohorts. |
| Low-confidence field rate | <= 4.5% | Enable stricter human-review routing. |
| Noisy-scan warning rate | <= 15% | Tune OCR and table parser before rollout. |
Adjudication and audit process
- Each sampled file is labeled independently by two reviewers.
- Disagreements are escalated to a domain adjudicator with written rationale.
- All adjudication outcomes are versioned and linked to benchmark run IDs.
- Rule changes are logged and applied prospectively to prevent silent metric drift.
Report output format
{
"run_id": "bench_2026_q1_r2",
"sample_size": 720,
"global_weighted_f1": 0.952,
"section_scores": {
"hazard_identification": 0.972,
"composition": 0.951,
"transport_information": 0.931
},
"low_confidence_field_rate": 0.036,
"error_taxonomy_share": {
"E1_extraction_miss": 0.27,
"E2_classification_confusion": 0.21,
"E3_normalization_error": 0.18,
"E4_table_structure_error": 0.22,
"E5_context_leakage": 0.12
}
}
Update log
- 2026-03-10: Replaced generic governance copy with formal scoring rules, cohort design, and release gates.
- 2026-03-10: Added explicit error taxonomy and adjudication workflow for auditability.
- 2026-03-10: Added canonical benchmark report schema with run-level metrics.
FAQ
Why use weighted field-level F1 instead of document-level pass rate only?
Document-level pass rates can hide critical section failures. Weighted field-level scoring keeps high-risk sections visible and makes release gates more defensible.
How are partial matches treated for multilingual fields?
Partial matches are scored with language-aware normalization rules, then reviewed through adjudication when semantic meaning is uncertain.
Can a customer run this methodology on a private holdout set?
Yes. The same scoring and gating rules can be applied to customer-owned datasets without changing the benchmark rubric.
Related pages
- Benchmark results by section and cohort
- Schema versioning and migration policy
- Field coverage matrix
- SDS extraction API
- API docs
- Request a validation plan for your SDS corpus