Add Data Storage in Cloud Native AI whitepaper#2124
Conversation
74fc3c1 to
db686a7
Compare
866338c to
80c854a
Compare
|
Hi @angellk, thanks for reviewing! I've addressed your comments. |
fb3718e to
30204bc
Compare
angellk
left a comment
There was a problem hiding this comment.
minor additional nits that I missed. looks great @xing-yang !
|
Thanks @angellk! |
ecce3f2 to
c9a8186
Compare
kevin-wangzefeng
left a comment
There was a problem hiding this comment.
Thanks @xing-yang and team. Read through the full set — LGTM, nice work carrying this across the line.
Signed-off-by: xing-yang <xingyang105@gmail.com>
Added comprehensive reference links to both inference.md and training.md files in the AI storage documentation. The references include relevant tools, frameworks, and platforms for model inference and training workflows in cloud-native environments. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: alexagriffith <agriffith96@gmail.com> Signed-off-by: xing-yang <xingyang105@gmail.com>
Added comprehensive reference links to both inference.md and training.md files in the AI storage documentation. The references include relevant tools, frameworks, and platforms for model inference and training workflows in cloud-native environments. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: alexagriffith <agriffith96@gmail.com> Signed-off-by: xing-yang <xingyang105@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Karena Angell <kangell@redhat.com> Signed-off-by: xing-yang <xingyang105@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
Co-authored-by: Karena Angell <kangell@redhat.com> Signed-off-by: Xing Yang <xingyang105@gmail.com> Signed-off-by: xing-yang <xingyang105@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
Signed-off-by: xing-yang <xingyang105@gmail.com>
|
/triage accepted |
|
The inference section nails the cold start problem. "new replicas may simultaneously fetch large model artifacts from remote storage, increasing startup latency and storage contention." and the common pitfalls list calls out "insufficient bandwidth for concurrent model downloads during scale-out." One thing missing from the data-cache-locality and inference sections is p2p distribution. when you scale from 2 to 20 replicas of a 30gb model, every pod hitting the registry at the same time is the bottleneck. dragonfly (cncf graduated) handles this. First pod pulls from the registry, subsequent pods pull from peers. I have worked on dragonfly's model distribution backends (hugging face, modelscope) and this is the exact use case it was built for. The data-cache-locality section covers fluid/alluxio well for the local+distributed caching layer. dragonfly would fit alongside that as the distribution layer, different problem (fan-out at scale vs repeated reads from the same node). Minor thing: might also be worth mentioning OCI-based model packaging (modelcar, kitops) in the inference section since that's how a lot of teams are starting to version and distribute model artifacts now. |
No description provided.