Evidence

SDS Extraction Benchmark by Section, Language, and Scan Quality

Public benchmark results for extraction quality, split by section, language cohort, and scan quality so implementation teams can set realistic controls before go-live.

Last updated: 2026-03-10

Benchmark scope and sample composition

The current benchmark run covers 720 SDS documents from chemical manufacturing, distribution, paints/coatings, and industrial cleaning categories. Files are sampled across recurring supplier layouts to avoid single-template overfitting.

Cohort dimension Composition
Total files720 SDS files
RegionsUS (37%), EU (41%), APAC (22%)
LanguagesEnglish, German, French, Spanish, Italian, Portuguese
File typesNative PDFs (56%), scanned PDFs (44%)
Scan quality splitClean 29%, moderate noise 11%, noisy/low DPI 4%
Labeling processDual human annotation with adjudication for conflicts

Methodology summary

  1. Every file is mapped to a controlled section/field schema before scoring.
  2. Scores are computed at field level (exact, partial, missing, wrong) and aggregated by section.
  3. Section scores are weighted by operational importance for downstream compliance workflows.
  4. Final benchmark reports include confidence distribution and warning-rate cut by cohort.

Section-level results

SDS section Field F1 Coverage Warning rate
1. Identification0.98899.4%1.1%
2. Hazard identification0.97298.9%2.4%
3. Composition0.95196.8%5.3%
4. First aid measures0.96497.7%3.6%
7. Handling and storage0.95897.2%4.1%
8. Exposure controls/PPE0.94695.9%6.0%
9. Physical properties0.93995.1%6.7%
14. Transport information0.93194.3%7.2%
15. Regulatory information0.92893.8%7.8%
16. Other information0.94496.0%5.5%

Language slices

Language Files Weighted field F1 Low-confidence fields (<0.80)
English2660.9622.2%
German1440.9562.8%
French1100.9493.3%
Spanish920.9473.6%
Italian630.9433.9%
Portuguese450.9384.2%

Scan-quality cohorts

Scan cohort Definition Weighted field F1 Warning rate
Native PDFDigital text layer intact0.9682.1%
Clean scan300 DPI+, low skew, low blur0.9524.4%
Moderate noiseRotation/table bleed or partial blur0.9367.3%
Noisy scan<200 DPI, handwritten marks, heavy artifacting0.90113.8%

Observed failure patterns

Failure pattern Share of flagged errors Mitigation rule
Composition table row merges31%Require row-count validation against source table boundaries.
Transport code confusion (ADR/IMDG)23%Enable transport-specific cross-field checks before write-back.
Localized synonym drift19%Use language-specific synonym dictionaries at parse time.
Header/footer bleed into fields15%Apply layout denoising and confidence-based warning escalation.
Unit normalization mismatch12%Normalize units before downstream matching and policy checks.

How to use this benchmark in deployment planning

  • Set section-specific acceptance thresholds instead of one global accuracy target.
  • Treat noisy scan cohorts as separate QA lanes with stricter human review routing.
  • Monitor language-slice drift monthly to catch regional template shifts early.
  • Tie warning-rate SLOs to operational queues so reviewers are not overloaded.

Update log

  • 2026-03-10: Published full cohort composition, section scores, language slices, and scan-quality cohorts.
  • 2026-03-10: Added failure-pattern taxonomy with mitigation controls used in implementation reviews.
  • 2026-03-10: Added deployment guidance for thresholding and reviewer routing.

FAQ

Why are Section 14 and Section 15 scores lower than Section 1 or Section 2?

Transport and regulatory sections have higher layout variability across suppliers and jurisdictions, so they generate more low-confidence cases in multilingual and noisy-scan cohorts.

Are these benchmark values field-level or document-level?

The reported values are weighted field-level metrics aggregated by section and cohort. Document-level pass rates are tracked separately for implementation gating.

Can teams reproduce this benchmark for their own suppliers?

Yes. The same scoring rules can be run on customer-specific datasets to produce a custom benchmark slice before production rollout.

Related pages

Need a benchmark run on your own SDS corpus? Request a benchmark review and receive a cohort-level readiness report.