How many documents are in this benchmark cohort?

The current benchmark cohort includes 720 SDS files across 6 language groups and 3 scan quality tiers.

Do the results include scanned PDFs?

Yes. The benchmark is split across native PDFs, clean scans, and noisy scans so teams can plan OCR-related quality controls.

How often is the benchmark refreshed?

The benchmark is refreshed quarterly and after major parser or schema releases that could shift extraction behavior.

Evidence

SDS Extraction Benchmark by Section, Language, and Scan Quality

Public benchmark results for extraction quality, split by section, language cohort, and scan quality so implementation teams can set realistic controls before go-live.

Last updated: 2026-03-10

Benchmark scope and sample composition

The current benchmark run covers 720 SDS documents from chemical manufacturing, distribution, paints/coatings, and industrial cleaning categories. Files are sampled across recurring supplier layouts to avoid single-template overfitting.

Cohort dimension	Composition
Total files	720 SDS files
Regions	US (37%), EU (41%), APAC (22%)
Languages	English, German, French, Spanish, Italian, Portuguese
File types	Native PDFs (56%), scanned PDFs (44%)
Scan quality split	Clean 29%, moderate noise 11%, noisy/low DPI 4%
Labeling process	Dual human annotation with adjudication for conflicts

Methodology summary

Every file is mapped to a controlled section/field schema before scoring.
Scores are computed at field level (exact, partial, missing, wrong) and aggregated by section.
Section scores are weighted by operational importance for downstream compliance workflows.
Final benchmark reports include confidence distribution and warning-rate cut by cohort.

Section-level results

SDS section	Field F1	Coverage	Warning rate
1. Identification	0.988	99.4%	1.1%
2. Hazard identification	0.972	98.9%	2.4%
3. Composition	0.951	96.8%	5.3%
4. First aid measures	0.964	97.7%	3.6%
7. Handling and storage	0.958	97.2%	4.1%
8. Exposure controls/PPE	0.946	95.9%	6.0%
9. Physical properties	0.939	95.1%	6.7%
14. Transport information	0.931	94.3%	7.2%
15. Regulatory information	0.928	93.8%	7.8%
16. Other information	0.944	96.0%	5.5%

Language slices

Language	Files	Weighted field F1	Low-confidence fields (<0.80)
English	266	0.962	2.2%
German	144	0.956	2.8%
French	110	0.949	3.3%
Spanish	92	0.947	3.6%
Italian	63	0.943	3.9%
Portuguese	45	0.938	4.2%

Scan-quality cohorts

Scan cohort	Definition	Weighted field F1	Warning rate
Native PDF	Digital text layer intact	0.968	2.1%
Clean scan	300 DPI+, low skew, low blur	0.952	4.4%
Moderate noise	Rotation/table bleed or partial blur	0.936	7.3%
Noisy scan	<200 DPI, handwritten marks, heavy artifacting	0.901	13.8%

Observed failure patterns

Failure pattern	Share of flagged errors	Mitigation rule
Composition table row merges	31%	Require row-count validation against source table boundaries.
Transport code confusion (ADR/IMDG)	23%	Enable transport-specific cross-field checks before write-back.
Localized synonym drift	19%	Use language-specific synonym dictionaries at parse time.
Header/footer bleed into fields	15%	Apply layout denoising and confidence-based warning escalation.
Unit normalization mismatch	12%	Normalize units before downstream matching and policy checks.

How to use this benchmark in deployment planning

Set section-specific acceptance thresholds instead of one global accuracy target.
Treat noisy scan cohorts as separate QA lanes with stricter human review routing.
Monitor language-slice drift monthly to catch regional template shifts early.
Tie warning-rate SLOs to operational queues so reviewers are not overloaded.

Update log

2026-03-10: Published full cohort composition, section scores, language slices, and scan-quality cohorts.
2026-03-10: Added failure-pattern taxonomy with mitigation controls used in implementation reviews.
2026-03-10: Added deployment guidance for thresholding and reviewer routing.

FAQ

Why are Section 14 and Section 15 scores lower than Section 1 or Section 2?

Transport and regulatory sections have higher layout variability across suppliers and jurisdictions, so they generate more low-confidence cases in multilingual and noisy-scan cohorts.

Are these benchmark values field-level or document-level?

The reported values are weighted field-level metrics aggregated by section and cohort. Document-level pass rates are tracked separately for implementation gating.

Can teams reproduce this benchmark for their own suppliers?

Yes. The same scoring rules can be run on customer-specific datasets to produce a custom benchmark slice before production rollout.

Need a benchmark run on your own SDS corpus? Request a benchmark review and receive a cohort-level readiness report.

SDS Extraction Benchmark by Section, Language, and Scan Quality

Benchmark scope and sample composition

Methodology summary

Section-level results

Language slices

Scan-quality cohorts

Observed failure patterns

How to use this benchmark in deployment planning

Update log

FAQ

Related pages