Skip to content

Add Data Storage in Cloud Native AI whitepaper#2124

Open
xing-yang wants to merge 9 commits into
cncf:mainfrom
xing-yang:ai_storage
Open

Add Data Storage in Cloud Native AI whitepaper#2124
xing-yang wants to merge 9 commits into
cncf:mainfrom
xing-yang:ai_storage

Conversation

@xing-yang

Copy link
Copy Markdown
Contributor

No description provided.

@xing-yang xing-yang requested a review from a team as a code owner April 19, 2026 21:03
@github-actions github-actions Bot added needs-triage Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied) needs-kind Indicates an issue or PR that is missing an issue type or kind (a kind/foo label) labels Apr 21, 2026
@github-actions github-actions Bot added the needs-group Indicates an issue or PR that has not been assigned a group (toc or tag/foo label applied) label Apr 21, 2026
@xing-yang xing-yang changed the title WIP: Add Data Storage in Cloud Native AI whitepaper Add Data Storage in Cloud Native AI whitepaper Apr 26, 2026
@github-actions github-actions Bot requested review from GenPage and kashifest April 30, 2026 14:35
@xing-yang xing-yang force-pushed the ai_storage branch 2 times, most recently from 866338c to 80c854a Compare June 3, 2026 02:32
@GenPage GenPage added kind/initiative An initiative or an item related to imitative processes tag/infrastructure TAG Infrastructure and removed needs-group Indicates an issue or PR that has not been assigned a group (toc or tag/foo label applied) needs-kind Indicates an issue or PR that is missing an issue type or kind (a kind/foo label) labels Jun 3, 2026
@github-project-automation github-project-automation Bot moved this to status/new in Initiatives Jun 3, 2026
@GenPage GenPage moved this from status/new to status/in-progress in Initiatives Jun 3, 2026
@GenPage GenPage moved this to In Progress in TAG Infrastructure Jun 3, 2026
@xing-yang

Copy link
Copy Markdown
Contributor Author

Hi @angellk, thanks for reviewing! I've addressed your comments.

Comment thread tags/tag-infrastructure/data-storage-in-cloud-native-ai/topics/data-pipelines.md Outdated
Comment thread tags/tag-infrastructure/data-storage-in-cloud-native-ai/topics/data-pipelines.md Outdated
Comment thread tags/tag-infrastructure/data-storage-in-cloud-native-ai/topics/training.md Outdated
@xing-yang xing-yang force-pushed the ai_storage branch 2 times, most recently from fb3718e to 30204bc Compare June 9, 2026 02:53

@angellk angellk left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor additional nits that I missed. looks great @xing-yang !

@xing-yang

Copy link
Copy Markdown
Contributor Author

Thanks @angellk!

Comment thread tags/tag-infrastructure/data-storage-in-cloud-native-ai/topics/inference.md Outdated
@xing-yang xing-yang force-pushed the ai_storage branch 3 times, most recently from ecce3f2 to c9a8186 Compare June 14, 2026 14:40

@kevin-wangzefeng kevin-wangzefeng left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xing-yang and team. Read through the full set — LGTM, nice work carrying this across the line.

@riaankleinhans riaankleinhans added kind/publication Item related to a publication (blog, tech-paper, etc.) pub/tech-paper Technical paper / whitepaper publication labels Jun 16, 2026
xing-yang and others added 9 commits June 17, 2026 20:59
Signed-off-by: xing-yang <xingyang105@gmail.com>
Added comprehensive reference links to both inference.md and training.md files in the AI storage documentation. The references include relevant tools, frameworks, and platforms for model inference and training workflows in cloud-native environments.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: alexagriffith <agriffith96@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
Added comprehensive reference links to both inference.md and training.md files in the AI storage documentation. The references include relevant tools, frameworks, and platforms for model inference and training workflows in cloud-native environments.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: alexagriffith <agriffith96@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Karena Angell <kangell@redhat.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
Co-authored-by: Karena Angell <kangell@redhat.com>
Signed-off-by: Xing Yang <xingyang105@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
@xing-yang

Copy link
Copy Markdown
Contributor Author

/triage accepted

@xing-yang xing-yang removed the needs-triage Indicates an issue or PR that has not been triaged yet (has a 'triage/foo' label applied) label Jun 23, 2026
@pmady

pmady commented Jun 24, 2026

Copy link
Copy Markdown

The inference section nails the cold start problem. "new replicas may simultaneously fetch large model artifacts from remote storage, increasing startup latency and storage contention." and the common pitfalls list calls out "insufficient bandwidth for concurrent model downloads during scale-out."

One thing missing from the data-cache-locality and inference sections is p2p distribution. when you scale from 2 to 20 replicas of a 30gb model, every pod hitting the registry at the same time is the bottleneck. dragonfly (cncf graduated) handles this. First pod pulls from the registry, subsequent pods pull from peers. I have worked on dragonfly's model distribution backends (hugging face, modelscope) and this is the exact use case it was built for.

The data-cache-locality section covers fluid/alluxio well for the local+distributed caching layer. dragonfly would fit alongside that as the distribution layer, different problem (fan-out at scale vs repeated reads from the same node).

Minor thing: might also be worth mentioning OCI-based model packaging (modelcar, kitops) in the inference section since that's how a lot of teams are starting to version and distribute model artifacts now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/initiative An initiative or an item related to imitative processes kind/publication Item related to a publication (blog, tech-paper, etc.) pub/tech-paper Technical paper / whitepaper publication tag/infrastructure TAG Infrastructure

Projects

Status: New
Status: status/in-progress
Status: No status
Status: In Progress
Status: No status

Development

Successfully merging this pull request may close these issues.

9 participants