AI Data by CreamWorks Inc.
Ethical AI datasets and evaluation suites for safer, more capable models.
We provide domain-specific corpora, custom labeling, and red-team style evaluation data. Small, clean, and compliant by design.
Documented sources • author signals • collection logs
PII scrubbing • policy filters • eval coverage
JP / EN (+KR roadmap)
Offerings
1) Curated Datasets Licensable
Clean domain corpora for pretraining, SFT, RAG, and evaluation.
- Consumer reviews (beauty/health) — balanced & de-duplicated
- How-to & troubleshooting knowledge — stepwise, procedural
- Travel & local guides — structured POI attributes
2) Custom Labeling
Human-in-the-loop annotation with clear rubrics.
- Safety labels (toxicity, bias, harm categories)
- Intent, entities, aspects, sentiment
- Evaluation item authoring (MCQ, rubric-graded freeform)
3) Evaluation Suites
Benchmarks aligned to practical use cases.
- Factuality & retrieval grounding (RAG)
- Actionability & stepwise reasoning
- Safety guardrail adversarials (red-team prompts)
Example Datasets (Snapshot)
| Name | Language | Size (approx.) | Schema | Primary Use |
|---|---|---|---|---|
| Consumer-Reviews-Beauty-JP | JA | ~120K docs | { product, claim, aspect, sentiment, source_meta } | SFT / Aspect-RAG / Sentiment |
| HowTo-Procedural-EN | EN | ~80K steps | { task, steps[], constraints, risk_flags } | Planning / Toolformer-style supervision |
| Travel-POI-Guides-JPEN | JA/EN | ~45K entries | { poi, geo, hours, tags[], narrative, tips } | RAG grounding / Multilingual eval |
Numbers are indicative; final specs shared upon request.
Compliance & Ethics
- GDPR/CCPA-aligned collection and processing principles
- PII removal, sensitive attribute minimization, and blocklists
- Robots.txt & website ToS respected; opt-out honored
- Provenance & consent documentation available under NDA
- Research & safety use accommodations for accredited institutions
Licensing Models
| Model | Best for | Notes |
|---|---|---|
| Flat License | One-time integration | Per-dataset fee; update packs optional |
| Subscription | Ongoing updates | Quarterly refresh; SLA for data health |
| Usage-Based | API/eval calls | Per-token / per-call metering for hosted access |
How to Engage
- Tell us your use case (pretrain, SFT, eval, RAG, safety).
- Choose licensing (flat / subscription / usage-based).
- We share specs & sample; you review quality and fit.
- Sign terms → deliver via S3/GCS or hosted endpoint.
Interested in working with us?
For data access, partnerships, or any other inquiries,
please reach out through our contact form.
FAQ
What license terms do you support?
Commercial licenses, research-only licenses, and custom terms with field-of-use restrictions.
Can you create safety evaluation sets for our policy?
Yes. We author adversarial prompts and rubric-scored outputs aligned to your safety policy and red-team goals.
Do you host the data?
We can deliver via S3/GCS or provide a hosted read-only API with access logs and metering.