Evidence
SDS Extraction Benchmark by Section, Language, and Scan Quality
Public benchmark results for extraction quality, split by section, language cohort, and scan quality so implementation teams can set realistic controls before go-live.
Last updated: 2026-03-10
Benchmark scope and sample composition
The current benchmark run covers 720 SDS documents from chemical manufacturing, distribution, paints/coatings, and industrial cleaning categories. Files are sampled across recurring supplier layouts to avoid single-template overfitting.
| Cohort dimension | Composition |
|---|---|
| Total files | 720 SDS files |
| Regions | US (37%), EU (41%), APAC (22%) |
| Languages | English, German, French, Spanish, Italian, Portuguese |
| File types | Native PDFs (56%), scanned PDFs (44%) |
| Scan quality split | Clean 29%, moderate noise 11%, noisy/low DPI 4% |
| Labeling process | Dual human annotation with adjudication for conflicts |
Methodology summary
- Every file is mapped to a controlled section/field schema before scoring.
- Scores are computed at field level (exact, partial, missing, wrong) and aggregated by section.
- Section scores are weighted by operational importance for downstream compliance workflows.
- Final benchmark reports include confidence distribution and warning-rate cut by cohort.
Section-level results
| SDS section | Field F1 | Coverage | Warning rate |
|---|---|---|---|
| 1. Identification | 0.988 | 99.4% | 1.1% |
| 2. Hazard identification | 0.972 | 98.9% | 2.4% |
| 3. Composition | 0.951 | 96.8% | 5.3% |
| 4. First aid measures | 0.964 | 97.7% | 3.6% |
| 7. Handling and storage | 0.958 | 97.2% | 4.1% |
| 8. Exposure controls/PPE | 0.946 | 95.9% | 6.0% |
| 9. Physical properties | 0.939 | 95.1% | 6.7% |
| 14. Transport information | 0.931 | 94.3% | 7.2% |
| 15. Regulatory information | 0.928 | 93.8% | 7.8% |
| 16. Other information | 0.944 | 96.0% | 5.5% |
Language slices
| Language | Files | Weighted field F1 | Low-confidence fields (<0.80) |
|---|---|---|---|
| English | 266 | 0.962 | 2.2% |
| German | 144 | 0.956 | 2.8% |
| French | 110 | 0.949 | 3.3% |
| Spanish | 92 | 0.947 | 3.6% |
| Italian | 63 | 0.943 | 3.9% |
| Portuguese | 45 | 0.938 | 4.2% |
Scan-quality cohorts
| Scan cohort | Definition | Weighted field F1 | Warning rate |
|---|---|---|---|
| Native PDF | Digital text layer intact | 0.968 | 2.1% |
| Clean scan | 300 DPI+, low skew, low blur | 0.952 | 4.4% |
| Moderate noise | Rotation/table bleed or partial blur | 0.936 | 7.3% |
| Noisy scan | <200 DPI, handwritten marks, heavy artifacting | 0.901 | 13.8% |
Observed failure patterns
| Failure pattern | Share of flagged errors | Mitigation rule |
|---|---|---|
| Composition table row merges | 31% | Require row-count validation against source table boundaries. |
| Transport code confusion (ADR/IMDG) | 23% | Enable transport-specific cross-field checks before write-back. |
| Localized synonym drift | 19% | Use language-specific synonym dictionaries at parse time. |
| Header/footer bleed into fields | 15% | Apply layout denoising and confidence-based warning escalation. |
| Unit normalization mismatch | 12% | Normalize units before downstream matching and policy checks. |
How to use this benchmark in deployment planning
- Set section-specific acceptance thresholds instead of one global accuracy target.
- Treat noisy scan cohorts as separate QA lanes with stricter human review routing.
- Monitor language-slice drift monthly to catch regional template shifts early.
- Tie warning-rate SLOs to operational queues so reviewers are not overloaded.
Update log
- 2026-03-10: Published full cohort composition, section scores, language slices, and scan-quality cohorts.
- 2026-03-10: Added failure-pattern taxonomy with mitigation controls used in implementation reviews.
- 2026-03-10: Added deployment guidance for thresholding and reviewer routing.
FAQ
Why are Section 14 and Section 15 scores lower than Section 1 or Section 2?
Transport and regulatory sections have higher layout variability across suppliers and jurisdictions, so they generate more low-confidence cases in multilingual and noisy-scan cohorts.
Are these benchmark values field-level or document-level?
The reported values are weighted field-level metrics aggregated by section and cohort. Document-level pass rates are tracked separately for implementation gating.
Can teams reproduce this benchmark for their own suppliers?
Yes. The same scoring rules can be run on customer-specific datasets to produce a custom benchmark slice before production rollout.
Related pages
- Accuracy methodology and scoring rules
- Schema versioning and migration policy
- Field coverage matrix
- SDS extraction API
- API docs
- Request a benchmark review for your corpus