Add Data Storage in Cloud Native AI whitepaper by xing-yang · Pull Request #2124 · cncf/toc

xing-yang · 2026-04-19T21:03:45Z

No description provided.

xing-yang · 2026-06-09T01:30:40Z

Hi @angellk, thanks for reviewing! I've addressed your comments.

angellk

minor additional nits that I missed. looks great @xing-yang !

xing-yang · 2026-06-09T03:03:10Z

Thanks @angellk!

kevin-wangzefeng

Thanks @xing-yang and team. Read through the full set — LGTM, nice work carrying this across the line.

Signed-off-by: xing-yang <xingyang105@gmail.com>

Added comprehensive reference links to both inference.md and training.md files in the AI storage documentation. The references include relevant tools, frameworks, and platforms for model inference and training workflows in cloud-native environments. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: alexagriffith <agriffith96@gmail.com> Signed-off-by: xing-yang <xingyang105@gmail.com>

Signed-off-by: xing-yang <xingyang105@gmail.com>

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Karena Angell <kangell@redhat.com> Signed-off-by: xing-yang <xingyang105@gmail.com>

Signed-off-by: xing-yang <xingyang105@gmail.com>

Co-authored-by: Karena Angell <kangell@redhat.com> Signed-off-by: Xing Yang <xingyang105@gmail.com> Signed-off-by: xing-yang <xingyang105@gmail.com>

Signed-off-by: xing-yang <xingyang105@gmail.com>

xing-yang · 2026-06-23T14:51:21Z

/triage accepted

pmady · 2026-06-24T11:55:55Z

The inference section nails the cold start problem. "new replicas may simultaneously fetch large model artifacts from remote storage, increasing startup latency and storage contention." and the common pitfalls list calls out "insufficient bandwidth for concurrent model downloads during scale-out."

One thing missing from the data-cache-locality and inference sections is p2p distribution. when you scale from 2 to 20 replicas of a 30gb model, every pod hitting the registry at the same time is the bottleneck. dragonfly (cncf graduated) handles this. First pod pulls from the registry, subsequent pods pull from peers. I have worked on dragonfly's model distribution backends (hugging face, modelscope) and this is the exact use case it was built for.

The data-cache-locality section covers fluid/alluxio well for the local+distributed caching layer. dragonfly would fit alongside that as the distribution layer, different problem (fan-out at scale vs repeated reads from the same node).

Minor thing: might also be worth mentioning OCI-based model packaging (modelcar, kitops) in the inference section since that's how a lot of teams are starting to version and distribute model artifacts now.

xing-yang requested a review from a team as a code owner April 19, 2026 21:03

xing-yang mentioned this pull request Apr 19, 2026

[Initiative]: Data Storage in Cloud Native AI #1901

Open

xing-yang force-pushed the ai_storage branch from 520f39b to 3d1cf3f Compare April 19, 2026 21:06

riaankleinhans closed this Apr 21, 2026

riaankleinhans reopened this Apr 21, 2026

github-actions Bot added needs-triage Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied) needs-kind Indicates an issue or PR that is missing an issue type or kind (a kind/foo label) labels Apr 21, 2026

github-project-automation Bot added this to TAG Operational Resilience, TAG Infrastructure, TAG Security and Compliance, TAG Developer Experience, TAG Workloads Foundation and CNCF TOC Board Apr 21, 2026

github-project-automation Bot moved this to New in CNCF TOC Board Apr 21, 2026

github-actions Bot added the needs-group Indicates an issue or PR that has not been assigned a group (toc or tag/foo label applied) label Apr 21, 2026

xing-yang force-pushed the ai_storage branch from 3d1cf3f to 9309a9c Compare April 26, 2026 14:36

xing-yang changed the title ~~WIP: Add Data Storage in Cloud Native AI whitepaper~~ Add Data Storage in Cloud Native AI whitepaper Apr 26, 2026

github-actions Bot requested review from GenPage and kashifest April 30, 2026 14:35

alexagriffith force-pushed the ai_storage branch from 74fc3c1 to db686a7 Compare April 30, 2026 14:36

xing-yang force-pushed the ai_storage branch 2 times, most recently from 866338c to 80c854a Compare June 3, 2026 02:32

github-project-automation Bot added this to Initiatives Jun 3, 2026

github-project-automation Bot moved this to status/new in Initiatives Jun 3, 2026

GenPage moved this from status/new to status/in-progress in Initiatives Jun 3, 2026

GenPage moved this to In Progress in TAG Infrastructure Jun 3, 2026

xing-yang force-pushed the ai_storage branch from 6a9599d to 3d66390 Compare June 9, 2026 01:29

angellk requested changes Jun 9, 2026

View reviewed changes

xing-yang force-pushed the ai_storage branch 2 times, most recently from fb3718e to 30204bc Compare June 9, 2026 02:53

angellk approved these changes Jun 9, 2026

View reviewed changes

Comment thread tags/tag-infrastructure/data-storage-in-cloud-native-ai/topics/databases.md

salaboy reviewed Jun 11, 2026

View reviewed changes

Comment thread tags/tag-infrastructure/data-storage-in-cloud-native-ai/topics/training.md

salaboy reviewed Jun 11, 2026

View reviewed changes

Comment thread tags/tag-infrastructure/data-storage-in-cloud-native-ai/topics/inference.md

salaboy reviewed Jun 11, 2026

View reviewed changes

Comment thread tags/tag-infrastructure/data-storage-in-cloud-native-ai/topics/inference.md Outdated

xing-yang force-pushed the ai_storage branch 3 times, most recently from ecce3f2 to c9a8186 Compare June 14, 2026 14:40

kevin-wangzefeng approved these changes Jun 16, 2026

View reviewed changes

riaankleinhans added kind/publication Item related to a publication (blog, tech-paper, etc.) pub/tech-paper Technical paper / whitepaper publication labels Jun 16, 2026

riaankleinhans mentioned this pull request Jun 16, 2026

[TOC Meeting][Public] 2026-06-16 #2069

Closed

xing-yang and others added 9 commits June 17, 2026 20:59

Add Data Storage in Cloud Native AI whitepaper

8d47518

Signed-off-by: xing-yang <xingyang105@gmail.com>

Update data-pipelines topic

30ac67e

Signed-off-by: xing-yang <xingyang105@gmail.com>

Apply suggestions from code review

238a79d

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Karena Angell <kangell@redhat.com> Signed-off-by: xing-yang <xingyang105@gmail.com>

Fix references

98410b9

Signed-off-by: xing-yang <xingyang105@gmail.com>

Apply suggestions from code review

4aa846b

Co-authored-by: Karena Angell <kangell@redhat.com> Signed-off-by: Xing Yang <xingyang105@gmail.com> Signed-off-by: xing-yang <xingyang105@gmail.com>

Address review comments

8c55098

Signed-off-by: xing-yang <xingyang105@gmail.com>

Address review comments

ed9603e

Signed-off-by: xing-yang <xingyang105@gmail.com>

xing-yang force-pushed the ai_storage branch from c9a8186 to ed9603e Compare June 18, 2026 01:00

xing-yang removed the needs-triage Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied) label Jun 23, 2026

Uh oh!

Conversation

xing-yang commented Apr 19, 2026

Uh oh!

xing-yang commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

angellk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

xing-yang commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevin-wangzefeng left a comment

Choose a reason for hiding this comment

Uh oh!

xing-yang commented Jun 23, 2026

Uh oh!

pmady commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

pmady commented Jun 24, 2026 •

edited

Loading