Integration Guide

SDS Extraction API Docs

SafetyDataSheetAPI converts SDS and MSDS PDFs into structured JSON, XML, and CSV with confidence scores, warnings, and schema versioning for downstream compliance systems.

Bearer auth JSON + XML + CSV Confidence + warnings Bulk + webhook flows

Jump to Quickstart OpenAPI Spec Talk to Implementation

Maintained by the SafetyDataSheetAPI team at DscvryAI.

Last updated: 2026-03-11

What it does

Turns supplier SDS files into structured records instead of plain OCR text blocks.

Who it is for

Integration teams connecting SDS workflows to EHS, ERP, PLM, product stewardship, or shared services systems.

When bulk matters

Use asynchronous bulk queues for supplier backfills, nightly syncs, or multi-file onboarding projects.

What to review first

Authentication, schema version, exception routing, residency requirements, and downstream field mapping.

Quick answers to common integration questions

These answer blocks are written for procurement, engineering, and implementation teams that need the short version before reading the full reference.

What does the SDS extraction API return?

Each successful request returns a request ID, confidence score, warning metadata, and synchronized JSON, XML, and CSV outputs built from the same extraction run.

How do I authenticate requests?

Production requests use a bearer token in the Authorization header. Keys are scoped by environment, throughput tier, and deployment model.

Does it handle scanned and multilingual SDS PDFs?

Yes. OCR-assisted extraction supports scanned SDS files, and multilingual SDS inputs can be supplied with an optional language_hint.

How do low-confidence fields surface?

Low-certainty extraction surfaces through the confidence score plus warnings so teams can route exceptions to QA or regulatory review instead of silently accepting weak data.

When should I use bulk extraction and webhooks?

Use bulk plus webhooks for supplier migrations, nightly refreshes, and queued ingestion where synchronous single-file calls would create unnecessary wait states.

How do schema versions protect integrations?

Schema versioning lets teams pin output contracts, plan mapping changes deliberately, and roll forward without breaking downstream ERP, EHS, or PLM workflows.

Quickstart

A minimal production-oriented integration has four parts: authenticate, upload the SDS, store the structured outputs, and route exceptions deliberately.

Authenticate every request. Send a bearer token in the Authorization header.
Upload the SDS file. Send the document as multipart form-data to POST /extract-sds.
Store the response artifacts. Persist the request ID, structured JSON, XML, CSV, and any warning metadata together.
Route exceptions. Treat confidence thresholds and warnings as workflow signals, not as optional decoration.

curl -X POST "https://api.safetydatasheetapi.com/v1/extract-sds" \
  -H "Authorization: Bearer <api_key>" \
  -F "file=@acetone-sds.pdf" \
  -F "language_hint=en" \
  -F "schema_version=2026-01"

Endpoints

The production API, public trial endpoints, and analytics collector serve different purposes. Keep evaluation traffic and production ingestion logic separate.

Method	Path	Purpose	Use when
`POST`	`/extract-sds`	Extract structured SDS data from a single document.	Primary synchronous production ingestion.
`POST`	`/extract-sds/bulk`	Submit multiple SDS files and receive asynchronous results.	Queued ingestion, backfills, and large supplier batches.
`POST`	`/api/sample-upload`	Website sample upload endpoint with Turnstile and rate limiting.	Controlled public evaluation, not production integration.
`POST`	`/api/sample-output-access`	Captures lead details before copy or download from the public sample output viewer.	Website evaluation flow only.
`POST`	`/api/contact`	Enterprise implementation inquiry endpoint.	Requesting onboarding, architecture, or procurement review.
`POST`	`/api/analytics-event`	Public CTA analytics collector.	Tracking commercial page interactions, not extraction workflows.

Authentication header

Authorization: Bearer <api_key>

Output contract

Every successful extraction response is designed to be both machine-usable and reviewable. Teams typically store the raw response envelope alongside their preferred format.

Artifact	What it contains	Why teams keep it
JSON output	Normalized SDS fields and nested sections for application logic.	Primary contract for EHS, ERP, PLM, and stewardship integrations.
XML output	Structured representation for systems that prefer XML interchange.	Supports legacy enterprise interfaces and governed data exchange patterns.
CSV output	Flat field-value export generated from the same extraction run.	Useful for review operations, audits, or spreadsheet-based reconciliation.
`confidence_score`	Overall extraction confidence for the request.	Supports automated acceptance thresholds and exception routing.
`warnings`	Signals about ambiguous text, low OCR quality, or incomplete fields.	Keeps uncertain output visible instead of silently blending into production data.
`schema_version`	Versioned contract marker for the structured response model.	Lets downstream teams pin mappings and manage upgrades deliberately.
`request_id`	Trace identifier for the extraction request.	Helps support, retries, and audit-oriented reconciliation.

{
  "request_id": "req_8hy2n3",
  "output_formats": ["JSON", "XML", "CSV"],
  "outputs": {
    "json": {
      "schema_version": "2026-01",
      "product_identification": {
        "product_name": "Acetone",
        "supplier_name": "Example Chemicals Ltd."
      },
      "hazards_identification": {
        "ghs_classification": ["Flammable Liquid - Category 2"],
        "h_statements": ["H225 Highly flammable liquid and vapour"]
      },
      "transport_information": {
        "un_number": "UN1090"
      },
      "exposure_controls_ppe": {
        "ppe": ["Protective gloves", "Eye protection"]
      },
      "revision_metadata": {
        "revision_date": "2024-01-15"
      }
    },
    "xml": "<sds_extraction>...</sds_extraction>",
    "csv": "field,value"
  },
  "confidence_score": 0.97,
  "warnings": [],
  "processing_ms": 2410
}

Need version-aware mapping rules? Review the schema versioning guide.

Bulk processing and webhooks

Bulk ingestion is the safer operating model once SDS extraction becomes a queueing problem instead of a user-click problem.

Flow	Best for	Operational pattern
Synchronous single-file extraction	Interactive validation, low-volume upload flows, and immediate review.	Client uploads one file and receives structured output directly in the response.
Bulk extraction	Supplier migrations, nightly refreshes, and large backlog cleanup.	Client submits multiple files, tracks the batch, and consumes results asynchronously.
Webhook completion	Event-driven systems that do not want polling loops.	Receive completion and warning summaries as extraction jobs finish.

{
  "event": "extraction.completed",
  "request_id": "req_8hy2n3",
  "status": "success",
  "confidence_score": 0.97,
  "warnings": []
}

Production throughput, queue sizing, and SLA-backed lanes are aligned during implementation rather than hard-coded into the public docs.

Errors and review routing

Error handling is only one part of operational safety. The other part is deciding when apparently successful output still deserves human review.

Status	Code	Meaning	Recommended response
`400`	`invalid_document`	File is unsupported or not parseable as an SDS input.	Reject the file and request a new source document.
`401`	`unauthorized`	Missing or invalid API token.	Rotate or correct credentials before retrying.
`422`	`low_text_quality`	Extraction is incomplete because text quality is too weak.	Route the file to manual review or request a cleaner source copy.
`429`	`rate_limited`	Plan throughput or endpoint policy was exceeded.	Retry with backoff or move the workload into the agreed bulk lane.

Store the request_id even when the extraction looks clean.
Route documents with warnings into a defined QA or regulatory workflow.
Use schema version checks before shipping parsed fields downstream.
Treat the public sample endpoint and production API as separate operating models.

FAQ

These questions mirror the schema markup and are phrased for assistant-style extraction.

What does the SDS extraction API return?

Each successful request returns a request ID, confidence score, warning metadata, and synchronized JSON, XML, and CSV outputs built from the same extraction run.

How do I authenticate requests?

All production API requests use a bearer token in the Authorization header. Keys are scoped by environment, throughput tier, and deployment model.

When should I use bulk extraction and webhooks?

Use bulk extraction and webhook completion for supplier backlogs, nightly refresh jobs, and queued ingestion where synchronous single-file requests would create unnecessary wait states.

How do low-confidence fields surface?

Low-certainty extraction surfaces through the confidence score plus warnings so teams can route documents or specific fields into QA, regulatory review, or exception handling queues.

Does the API support scanned and multilingual SDS PDFs?

Yes. The API supports scanned SDS files with OCR-assisted extraction and multilingual SDS inputs, with warning metadata available when text quality or layout affects confidence.

How do schema versions protect integrations?

Schema versioning lets integration teams pin output contracts, plan downstream mappings deliberately, and roll forward to new field models without breaking existing ERP, EHS, or PLM workflows.

Update Log

Recent changes that affect implementation, extraction review, or evaluation paths.

2026-03-11: Reworked this page into answer-first integration guidance with Q&A blocks.
2026-03-11: Added bulk, webhook, schema, and review-routing sections for assistant extraction.
2026-03-09: Repositioned docs as the canonical technical hub for API evaluation.

Implementation review set

This docs hub is strongest when it is read together with the trust and evidence pages below.

Benchmark See section, language, and scan-quality performance framing. Accuracy methodology Review how extraction quality is measured and interpreted. Security Map transport, access, isolation, and retention controls. Data residency Review regional processing, retention, and deletion options. Schema versioning Plan field mapping upgrades without breaking downstream systems. Implementation plan Discuss architecture, throughput, and rollout sequencing.