<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://thakicloud.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://thakicloud.github.io/" rel="alternate" type="text/html" /><updated>2026-07-02T07:49:11+09:00</updated><id>https://thakicloud.github.io/feed.xml</id><title type="html">Thaki Cloud Tech Blog | ThakiCloud | 다키클라우드 기술 블로그</title><subtitle>Thaki Cloud (ThakiCloud, 다키클라우드, thaki cloud, THAKI CLOUD, ثاكي كلاود)는 AI/ML Engineering, LLMOps, DevOps 분야의 최신 기술과 실무 경험을 공유하는 전문 기술 블로그입니다. 머신러닝 모델 운영, 쿠버네티스, 클라우드 인프라, AI 엔지니어링 커리어, 인공지능 기술 블로그, 다키클라우드 개발 팀의 깊이 있는 인사이트를 제공합니다. مدونة تقنية متخصصة في هندسة الذكاء الاصطناعي والحوسبة السحابية.</subtitle><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><entry xml:lang="ar"><title type="html">خمسة نماذج أصيلة تبقى حين تذوب حدود الوظائف: من المبتكر حتى الصائن</title><link href="https://thakicloud.github.io/ar/culture/five-product-role-archetypes/" rel="alternate" type="text/html" title="خمسة نماذج أصيلة تبقى حين تذوب حدود الوظائف: من المبتكر حتى الصائن" /><published>2026-07-01T00:00:00+09:00</published><updated>2026-07-01T00:00:00+09:00</updated><id>https://thakicloud.github.io/ar/culture/five-product-role-archetypes</id><content type="html" xml:base="https://thakicloud.github.io/ar/culture/five-product-role-archetypes/"><![CDATA[<p><img src="/assets/images/five-product-role-archetypes-hero.png" alt="تصور تجريدي يجسد تلاشي حدود الوظائف وبروز نماذج أصيلة جديدة للأدوار" /></p>

<h2 id="نظرة-عامة">نظرة عامة</h2>

<p>يتكرر مشهد بات مألوفا: المسمى الوظيفي لا يصف بعد الآن ما يفعله صاحبه فعلا. المصمم يكتب نماذج أولية بالكود، والمهندس يجري مقابلات مع المستخدمين، وعالم البيانات يحسم اتجاه المنتج. مع امتصاص أدوات الذكاء الاصطناعي للجانب الميكانيكي من كل وظيفة، تتداخل حدود الهندسة والمنتج والتصميم والتحليل وتذوب في كتلة واحدة.</p>

<p>أمام هذا التحول، رصد Boris Cherny صانع Claude Code ملاحظة لافتة: حين أمعن النظر في فريق Claude Code الذي ينتمي إليه، وجد خمسة نماذج أصيلة للأدوار تتشكل بمعزل عن الوظائف الرسمية. وأهمية هذه الملاحظة بسيطة: إنها تطرح فرضية مفادها أن منظمات المستقبل قد تبني فرقها على أساس هذه النماذج لا على أساس الوظائف التقليدية.</p>

<p>يتناول هذا المقال ماهية النماذج الخمسة، وسبب انفصالها عن الوظائف الرسمية، والتركيبة اللازمة منها في كل مرحلة من مراحل نضج المنتج. هذا ليس ملخصا تقنيا، بل مقال ثقافي يتساءل: كيف نبني الفرق وكيف ننظر إلى التوظيف؟ وهو سؤال مباشر بصفة خاصة لمنظمات كـ ThakiCloud حيث يعمل البشر والوكلاء الآليون جنبا إلى جنب.</p>

<h2 id="النماذج-الأصيلة-الخمسة">النماذج الأصيلة الخمسة</h2>

<p>النماذج التي صاغها Cherny هي كالتالي، مع توضيح كيف يظهر كل نموذج في الفرق الفعلية.</p>

<p><strong>المبتكر (Prototyper)</strong> هو من يتصور أفكارا جديدة كليا. يطرح أفكارا بكثافة، لكن معظمها لا يصل إلى الإطلاق. قيمة هذا النموذج ليست في معدل نجاحه، بل في كثافة الأفكار التي ينتجها. حتى لو رُفض تسعة من كل عشرة أفكار، فإن غياب من يفتح آفاقا جديدة يعني توقف المنظمة عن التقدم إلى أراض مجهولة.</p>

<p><strong>المنفذ (Builder)</strong> هو من يحول النماذج الأولية والأفكار بسرعة إلى منتجات أو بنية تحتية جاهزة للإنتاج. دوره تضييق المسافة بين الفكرة والإطلاق. إن كان المبتكر يرسم المخططات، فالمنفذ يحول تلك المخططات إلى مبانٍ قائمة.</p>

<p><strong>المنظف (Sweeper)</strong> هو المرتب بامتياز: يصقل الواجهات المبعثرة، ويبسط الكود والأنظمة، ويزيل الميزات غير المستخدمة، ويرفع الأداء. عمله الحذف لا الإضافة. قرار إلغاء ميزة (unship) يستدعي شجاعة لا تقل عن شجاعة بنائها.</p>

<p><strong>المنمي (Grower)</strong> يأخذ منتجا قائما ويحسنه باستمرار لرفع مستوى الملاءمة مع السوق (PMF). لا يعيد رسم اللوحة من الصفر، بل يعمل على الصورة الموجودة ليرفع معدلات التحويل ويخفض الاضطراب ويراكم تحسينات صغيرة.</p>

<p><strong>الصائن (Maintainer)</strong> هو من يتملك الأنظمة الناضجة. يحافظ على الأمن والاستقرار والسرعة والكفاءة مع تنامي الأنظمة. لا بريق في عمله، لكن من دونه ينهار المنتج الناجح تحت ثقله.</p>

<pre><code class="language-mermaid">flowchart TB
    P["المبتكر (Prototyper)&lt;br/&gt;يولد أفكارا جديدة"]
    B["المنفذ (Builder)&lt;br/&gt;يحول إلى منتج جاهز للإنتاج"]
    S["المنظف (Sweeper)&lt;br/&gt;التبسيط والترتيب والأداء"]
    G["المنمي (Grower)&lt;br/&gt;تحسين PMF بصفة مستمرة"]
    M["الصائن (Maintainer)&lt;br/&gt;الأمن والاستقرار والتوسع"]
    P --&gt; B
    B --&gt; S
    S --&gt; G
    G --&gt; M
    M -.الصيانة وإعادة الاختراع.-&gt; B
</code></pre>

<h2 id="النموذج-ليس-وظيفة">النموذج ليس وظيفة</h2>

<p>جوهر هذه الملاحظة ليس القائمة في حد ذاتها، بل حقيقة أن هذه النماذج لا ترتبط بالوظائف الرسمية. يقول Cherny إنه حين ينظر إلى Anthropic في مجملها يجد بعض المصممين ينتمون إلى النموذج الأول (المبتكر)، وآخرين إلى الثاني (المنفذ)، وغيرهم إلى الثالث (المنظف). والأمر ذاته ينطبق على المهندسين ومديري المنتجات وعلماء البيانات.</p>

<p>بمعنى آخر، تفقد عبارة “نوظف مصمما” من معناها يوما بعد يوم. فالمصمم المبتكر الذي يفتح آفاقا جديدة يختلف اختلافا جذريا في طريقة إسهامه عن المصمم المنظف الذي يصقل ويكمل. المسمى الوظيفي يخبرك بالأدوات التي تعلمها، لكنه لا يخبرك بالحظة التي يتألق فيها.</p>

<p>كثيرون يجمعون بين نموذجين، وأحيانا ثلاثة. من يجمع بين المبتكر والمنفذ نادر وثمين في الشركات الناشئة المبكرة. ومن يجمع بين المنظف والصائن يشكل عمود فقري فرق البنية التحتية الناضجة. بدلا من حشر كل شخص في صندوق واحد، الأدق أن ننظر إلى الطيف الذي يقع عليه في هذه النماذج.</p>

<h2 id="تشكيل-الفرق-وفق-دورة-حياة-المنتج">تشكيل الفرق وفق دورة حياة المنتج</h2>

<p>السبب الحقيقي لأهمية هذه النماذج هو أنها تصبح صيغة لتشكيل الفرق. يرى Cherny أن الفريق الصحي يحتاج إلى تركيبة مختلفة من النماذج وفق درجة نضج المنتج.</p>

<p>المنتج الجديد الذي لم يجد بعد ملاءمته مع السوق يحتاج إلى أشخاص أقوياء في النماذج الأول والثاني والثالث (المبتكر + المنفذ + المنظف). في هذه المرحلة لا أحد يعرف ما الصواب، لذا القدرة على البناء السريع والتخلي السريع وتغيير الاتجاه باستمرار هي ما يهم. تجميع أشخاص ذوي ميول صون قوية في هذه المرحلة يعني صون ما لم يُبن بعد.</p>

<p>المنتج في طور النمو بعد تحقيق الملاءمة مع السوق يحتاج إلى النماذج الثاني والثالث والرابع (المنفذ + المنظف + المنمي) مع جرعة من النموذج الخامس (الصائن). الاتجاه معروف الآن، فالمهمة رفع الجودة وتحسين التحويل مع قدر أدنى من الاستقرار لاستيعاب المستخدمين المتزايدين.</p>

<p>المنتج الناضج ذو الملاءمة القوية مع السوق يحتاج إلى النماذج الثالث والرابع والخامس (المنظف + المنمي + الصائن) مع جرعة من النموذج الثاني (المنفذ). المهمة إبقاء النظام بسيطا، والتحسين المستمر، والحفاظ على الأمن والسرعة في مستويات التوسع، مع البناء الجديد حين يلزم فحسب.</p>

<pre><code class="language-mermaid">flowchart TB
    PRE["قبل PMF&lt;br/&gt;منتج جديد"]
    GROW["مرحلة النمو&lt;br/&gt;تحقق PMF"]
    MATURE["مرحلة النضج&lt;br/&gt;PMF قوي"]
    PRE --&gt;|"المبتكر + المنفذ + المنظف"| GROW
    GROW --&gt;|"المنفذ + المنظف + المنمي (+ جرعة الصائن)"| MATURE
    MATURE --&gt;|"المنظف + المنمي + الصائن (+ جرعة المنفذ)"| MATURE
</code></pre>

<p>الدلالة العملية لهذه الصيغة واضحة: حين تضيف شخصا إلى الفريق، السؤال الأول ليس “هل يعاني الفريق من نقص في المهندسين؟” بل “أي نموذج يغيب عن فريقنا في هذه المرحلة؟” إشباع فريق منتج ناضج بالمبتكرين يعني فيضا من الأفكار الجديدة دون من يصون النظام. والعكس، جمع الصائنين في منتج لم يجد ملاءمته بعد يعني التحصن لحماية ما لا وجود له أصلا.</p>

<h2 id="منظور-thakicloud-إعادة-رسم-الأدوار-في-عصر-الوكلاء">منظور ThakiCloud: إعادة رسم الأدوار في عصر الوكلاء</h2>

<p>الملاحظة القائلة بأن حدود الوظائف تذوب تصبح أحد المشهدية في المنظمات التي يعمل فيها البشر والوكلاء جنبا إلى جنب. حين تستوعب وكلاء الذكاء الاصطناعي حصة وافرة من عمليات البناء الميكانيكية، ينجرف البشر تلقائيا نحو النماذج الأصيلة الأكثر أهمية في كل مرحلة من مراحل المنتج. العنق الزجاجي لن يكون الأيدي التي تكتب الكود، بل العقول التي تشخص أي نموذج تحتاجه اللحظة.</p>

<p>Paxis، الحوسبة السحابية Native للوكلاء التي تشغلها ThakiCloud، تجسد هذا التحول على مستوى طبقة النظام. تعامل Paxis المهارات والأدوات والسياسات وسجلات التدقيق بوصفها موارد من الدرجة الأولى، وتختار أكثر من 960 مهارة عبر BM25 وتنفذها في بيئات معزولة. كما قال Cherny إن الأشخاص تُعاد صياغتهم وفق لحظات المنتج لا وفق مسمياتهم الوظيفية، كذلك تُجمع Paxis قدرات الوكلاء ديناميكيا وفق متطلبات المهمة لا وفق خطوط أنابيب جامدة. المبتكر يطرح الأفكار، فيحولها وكيل بدور المنفذ إلى كود جاهز للإنتاج، ثم يرتب بوابة التحقق بدور المنظف المخرجات، وكل ذلك يتكرر داخل حزمة المهارات.</p>

<p>على صعيد البنية التحتية، يضطلع ai-platform من ThakiCloud بالعمل الكامل للنموذج الصائن. جدولة وحدات GPU عبر Kueue، وتقديم النماذج عبر vLLM، والوفاء بمتطلبات الخصوصية والسيادة في بيئات K8s متعددة المستأجرين: كل ذلك هو بالضبط عمل الصائن الذي يحفظ الأمن والاستقرار والكفاءة في الأنظمة الناضجة. تفويض منظمات العملاء لهذا الجانب إلى المنصة يتيح لفرقهم الانتشار أكثر في اتجاه المبتكرين والمنميين.</p>

<p>هذا المنظور مفيد للتوظيف أيضا. تنظر ThakiCloud إلى المتقدمين من زاوية أي نموذج يمثلون، لا من زاوية مسمياتهم في السيرة الذاتية. الشخص الذي يملأ النموذج الغائب عن مرحلتنا الحالية هو من يخلق أكبر قدر من الرافعة للفريق. السؤال ليس “ماذا تحسن؟” فحسب، بل “أي لحظة تتألق فيها؟”</p>

<h2 id="حدود-الإطار-والحجج-المضادة">حدود الإطار والحجج المضادة</h2>

<p>قبل قبول هذا الإطار دون نقد، تستحق الحجج المقابلة الاستماع. أشار Ben Vinegar في السياق ذاته إلى أن “الناس يكتشفون كيف تعمل منظمات البرمجيات للتو، ثم يخطئون في عزو ديناميكيات الفرق القديمة إلى الذكاء الاصطناعي.” اعتراض حاد ومشروع: التمييز بين المبتكر والصائن موجود منذ ما قبل الذكاء الاصطناعي، وأن درجة نضج المنتج تحدد نوع الموهبة المطلوبة ليست فكرة جديدة.</p>

<p>ثمة حدود للتصنيف في حد ذاته. كل محاولة لوضع الناس في خمسة صناديق تعاني من خطر تبسيط الأفراد تبسيطا مفرطا. في الواقع، يتنقل الشخص الواحد بين عدة نماذج من مشروع لآخر، بل خلال اليوم الواحد. الخطأ هو النظر إلى النماذج بوصفها هويات ثابتة، فيصدر حكم من قبيل “أنت منظف إذن لا تقترح أفكارا جديدة”، وهذا عكس الغرض تماما. لهذا شدد Cherny نفسه على أن كثيرين يجمعون بين نماذج متعددة.</p>

<p>ومع ذلك، تبقى قيمة هذا الإطار في اللغة التي يمنحها لا في قدرته التنبؤية. حين يصبح بإمكانك القول “يعاني فريقنا من نقص في المنمين” بدلا من “نحتاج مزيدا من المهندسين”، تنتقل محادثات التوظيف وتشكيل الفرق إلى مستوى أكثر دقة وجدوى. كلما جردت الذكاء الاصطناعي الوظائف من طبقتها الميكانيكية، كلما كان ما يبقى هو الأحكام على مستوى هذه النماذج. أدوار المنتج في المستقبل قد تشبه هذه النماذج أكثر مما تشبه المسميات الوظيفية اليوم.</p>

<h2 id="خاتمة">خاتمة</h2>

<p>ذوبان حدود الوظائف ليس أزمة، بل إعادة تشكيل. النماذج الخمسة: المبتكر والمنفذ والمنظف والمنمي والصائن تكشف ما يبقى حين تختفي المسميات الوظيفية. ما يبقى ليس الأدوات، بل جوهر السؤال: في أي لحظة وبأي طريقة يقدم الشخص إسهامه؟</p>

<p>تبني ThakiCloud منظمة يتقاسم فيها البشر والوكلاء هذه النماذج. كلما تولت الوكلاء قدرا أكبر من عمليات البناء والصون المتكررة، كلما تركزت قدرة البشر على قراءة أي نموذج تحتاجه مرحلة المنتج الراهنة. تلك القراءة ستكون أثمن القدرات في العقد القادم.</p>

<h2 id="المصادر">المصادر</h2>

<ul>
  <li>Boris Cherny, X(@bcherny), 2026-06-29: <a href="https://x.com/bcherny/status/2071379474277613732">التغريدة الأصلية</a></li>
  <li>Ben Vinegar, X(@bentlegen): <a href="https://x.com/bentlegen/status/2071576459538567463">تغريدة الرد والاعتراض</a></li>
</ul>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><category term="culture" /><category term="مستقبل-العمل" /><category term="ثقافة-تنظيمية" /><category term="فريق-المنتج" /><category term="توظيف" /><category term="Boris Cherny" /><category term="Claude Code" /><summary type="html"><![CDATA[في عصر تتشابك فيه الهندسة والمنتج والتصميم والبيانات في كتلة واحدة، يستعرض هذا المقال النماذج الأصيلة الخمسة التي رصدها Boris Cherny صانع Claude Code، وصيغة تشكيل الفرق وفق مرحلة نضج المنتج.]]></summary></entry><entry xml:lang="ar"><title type="html">‏Qwen3.6-27B بدقة 4 بت: لماذا نزل تكميم NVFP4 إلى معمارية Hopper</title><link href="https://thakicloud.github.io/ar/owm/qwen3-6-27b-nvfp4-onprem-serving/" rel="alternate" type="text/html" title="‏Qwen3.6-27B بدقة 4 بت: لماذا نزل تكميم NVFP4 إلى معمارية Hopper" /><published>2026-07-01T00:00:00+09:00</published><updated>2026-07-01T00:00:00+09:00</updated><id>https://thakicloud.github.io/ar/owm/qwen3-6-27b-nvfp4-onprem-serving</id><content type="html" xml:base="https://thakicloud.github.io/ar/owm/qwen3-6-27b-nvfp4-onprem-serving/"><![CDATA[<p>⏱️ <strong>وقت القراءة المتوقع</strong>: 11 دقيقة</p>

<p><img src="/assets/images/qwen3-6-27b-nvfp4-onprem-serving-hero.png" alt="مخطط مفاهيمي لتكميم Qwen3.6-27B NVFP4 رباعي البت" /></p>

<h2 id="نظرة-عامة">نظرة عامة</h2>

<p>أصدرت NVIDIA النموذج <code class="language-plaintext highlighter-rouge">nvidia/Qwen3.6-27B-NVFP4</code>، وهو نسخة مكمّمة بدقة NVFP4 رباعية البت من نموذج Qwen3.6-27B الخاص بشركة Alibaba. يضغط نموذج استدلال بـ 27 مليار معامل ذا انتباه هجين إلى 4 بت، فيخفض ذاكرة الأوزان بنحو 2.5 ضعف مع إبقاء الفجوة عن خط أساس FP8 ضمن نقطة واحدة عبر المعايير التسعة كلها. والرخصة هي Apache 2.0.</p>

<p>هناك ثلاث نقاط جديرة بالتوضيح. أولاً، خلافاً لإصدار <code class="language-plaintext highlighter-rouge">Gemma-4-26B-A4B-NVFP4</code> السابق الذي لم يحصل على تسريع 4 بت عملياً إلا على Blackwell، تذكر بطاقة هذا الإصدار <strong>معماريتَي Hopper وBlackwell معاً ضمن العتاد المدعوم</strong>. أي أن الفريق الذي يشغّل H100 أو H200 يستطيع تجربته اليوم دون شراء عتاد جديد. ثانياً، هذا ليس نموذجاً لغوياً نصياً فقط بل <strong>نموذج استدلال متعدد الوسائط يستقبل مدخلات نصية وصورية وفيديو</strong>. ثالثاً، تتسع نافذة السياق حتى <strong>262 ألف رمز</strong>، فتستوعب المستندات الطويلة والمحادثات الممتدة دفعة واحدة.</p>

<p>تشغّل ThakiCloud منصة تدير حصص وحدات GPU عبر Kueue وتخدم النماذج بأسلوب متعدد المستأجرين عبر vLLM على Kubernetes. لذا فإن سؤال “كم نموذجاً أكبر، وكم مستأجراً إضافياً، يمكننا وضعه على وحدات GPU التي نملكها أصلاً؟” ليس خبراً طريفاً بل يغذّي نموذج التكلفة مباشرة. يستعرض هذا المقال حقائق النموذج، ويحلل سبب نزول NVFP4 إلى Hopper، ثم يقيّم بصراحة مسار الخدمة وفائدته على منصتنا.</p>

<h2 id="ما-هذا-النموذج">ما هذا النموذج</h2>

<p><code class="language-plaintext highlighter-rouge">nvidia/Qwen3.6-27B-NVFP4</code> هو نموذج <code class="language-plaintext highlighter-rouge">Qwen3.6-27B</code> من Alibaba مكمّماً بدقة NVFP4 عبر NVIDIA Model Optimizer (nvidia-modelopt v0.45.0). وفيما يلي المواصفات الأساسية حسب بطاقة النموذج.</p>

<table>
  <thead>
    <tr>
      <th>العنصر</th>
      <th>القيمة</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>النموذج الأساسي</td>
      <td>Alibaba Qwen3.6-27B</td>
    </tr>
    <tr>
      <td>المعمارية</td>
      <td>انتباه هجين (Gated DeltaNet + Gated Attention)</td>
    </tr>
    <tr>
      <td>إجمالي المعاملات</td>
      <td>27 مليار</td>
    </tr>
    <tr>
      <td>السياق</td>
      <td>262 ألف رمز</td>
    </tr>
    <tr>
      <td>وسائط الإدخال</td>
      <td>نص + صورة + فيديو</td>
    </tr>
    <tr>
      <td>الإخراج</td>
      <td>نص</td>
    </tr>
    <tr>
      <td>التكميم</td>
      <td>NVFP4 (Model Optimizer v0.45.0)</td>
    </tr>
    <tr>
      <td>العتاد المستهدف</td>
      <td>NVIDIA Hopper، Blackwell</td>
    </tr>
    <tr>
      <td>الرخصة</td>
      <td>Apache 2.0</td>
    </tr>
  </tbody>
</table>

<p>الجزء اللافت هو معمارية <strong>الانتباه الهجين</strong>. فـ Gated DeltaNet مسار من فئة الانتباه الخطي، مصمَّم لمعالجة المتتاليات الطويلة بكفاءة، خلافاً للانتباه المعتاد الذي تنمو كلفته مع طول المتتالية. ومزجه مع Gated Attention الذي يحمل القدرة التعبيرية يمنح توازناً يستوعب سياقاً بطول 262 ألف رمز مع الحفاظ على الجودة. كما أن اشتراط <code class="language-plaintext highlighter-rouge">--reasoning-parser qwen3</code> عند الخدمة يؤكد أن هذا <strong>نموذج استدلال</strong> يولّد أثر التفكير قبل الإجابة النهائية.</p>

<p>ونذكر بصراحة أمراً واحداً: تذكر بطاقة النموذج الانتباه الهجين لكنها لا تفصح عن عدد الطبقات الدقيق أو تكوين الخبراء أو المعاملات النشطة لكل رمز. لذا يقتصر هذا المقال على الحقائق المذكورة في البطاقة ولا يقدّر الأرقام غير المعلنة.</p>

<h2 id="تكميم-nvfp4-ماذا-يُضغط-وكيف">تكميم NVFP4: ماذا يُضغط وكيف</h2>

<p>‏NVFP4 هو صيغة الفاصلة العائمة رباعية البت التي تدفع بها NVIDIA. وخلافاً لـ INT4 الذي يقتطع الأوزان إلى أعداد صحيحة رباعية البت ببساطة، فهو أسلوب قياس مصغّر يضع مقياس FP8 لكل كتلة صغيرة، فينعم بتوفير الذاكرة على مستوى 4 بت مع إبقاء فقدان الدقة صغيراً.</p>

<p>في هذا الإصدار، أهداف التكميم هي <strong>أوزان وقيم تنشيط المعاملات الخطية داخل كتل المحوّل</strong>. أما الطبقات غير الخطية فتُترك دون مساس. وتذكر البطاقة أن خفض عدد البتات لكل معامل من 16 إلى 4 يقلّص متطلبات القرص وذاكرة GPU بنحو <strong>2.5 ضعف</strong>. فتحميل 27 مليار معامل بدقة BF16 يحتاج نحو 54 جيجابايت، وبتطبيق الخفض بنحو 2.5 ضعف تنزل نقطة التفتيش إلى نحو 20 جيجابايت. وهذا يفتح مجالاً لوضع أكثر من ضعف النموذج على وحدة GPU نفسها، أو لتحويل الذاكرة المحرَّرة إلى مخزن KV لرفع التزامن.</p>

<p>وهنا يفترق الأمر عن مراجعة Gemma NVFP4 السابقة. فقد كان لدى إصدار Gemma نواة NVFP4 لنماذج MoE معطّلة على Blackwell الاستهلاكي والاحترافي (SM120)، فكان المسار الاستهلاكي الوحيد الذي يعمل فعلاً هو DGX Spark. أما إصدار Qwen3.6 هذا فتذكر بطاقته <strong>معماريتَي Hopper وBlackwell معاً ضمن العتاد المدعوم</strong>، وتستخدم الخدمة مسار <code class="language-plaintext highlighter-rouge">--quantization modelopt</code> في vLLM. ومع تكميم قيم التنشيط إلى جانب الأوزان ووجود مسار خدمة modelopt، يمكن تشغيل هذا النموذج رباعي البت على وحدات H100 وH200 المثبتة أصلاً في مراكز البيانات. لقد تراخى هذه المرة بشكل ملموس قيد “يجب شراء Blackwell جديد لرؤية مكاسب 4 بت”.</p>

<pre><code class="language-mermaid">flowchart TB
    A["Qwen3.6-27B&lt;br/&gt;BF16 نحو 54GB"] --&gt; B["NVIDIA Model Optimizer&lt;br/&gt;v0.45.0"]
    B --&gt; C["تكميم NVFP4&lt;br/&gt;أوزان المعاملات الخطية + التنشيط&lt;br/&gt;من 16 بت إلى 4 بت"]
    C --&gt; D["نقطة تفتيش NVFP4&lt;br/&gt;نحو 20GB · خفض ~2.5 ضعف"]
    D --&gt; E["خدمة vLLM&lt;br/&gt;--quantization modelopt"]
    E --&gt; F["NVIDIA Hopper&lt;br/&gt;H100 / H200"]
    E --&gt; G["NVIDIA Blackwell&lt;br/&gt;B200 وغيرها"]
</code></pre>

<h2 id="المعايير-كم-تكلّف-الدقة-الرباعية">المعايير: كم تكلّف الدقة الرباعية</h2>

<p>تعرض بطاقة النموذج النسخة المكمّمة بـ NVFP4 جنباً إلى جنب مع خط أساس FP8 عبر تسعة معايير.</p>

<table>
  <thead>
    <tr>
      <th>المعيار</th>
      <th>FP8</th>
      <th>NVFP4</th>
      <th>مجال القياس</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>MMLU Pro</td>
      <td>86.1</td>
      <td>86.3</td>
      <td>المعرفة العامة والاستدلال</td>
    </tr>
    <tr>
      <td>GPQA Diamond</td>
      <td>86.0</td>
      <td>85.5</td>
      <td>الاستدلال العلمي للدراسات العليا</td>
    </tr>
    <tr>
      <td>HLE</td>
      <td>21.7</td>
      <td>21.8</td>
      <td>الاستدلال العام الصعب</td>
    </tr>
    <tr>
      <td>τ²-Bench Telecom</td>
      <td>95.2</td>
      <td>95.4</td>
      <td>استخدام الوكيل للأدوات</td>
    </tr>
    <tr>
      <td>MMMU Pro</td>
      <td>74.6</td>
      <td>74.3</td>
      <td>الاستدلال متعدد الوسائط</td>
    </tr>
    <tr>
      <td>SciCode</td>
      <td>44.8</td>
      <td>44.5</td>
      <td>البرمجة العلمية</td>
    </tr>
    <tr>
      <td>AIME 2025</td>
      <td>93.1</td>
      <td>92.7</td>
      <td>مسابقة الرياضيات</td>
    </tr>
    <tr>
      <td>AA-LCR</td>
      <td>68.8</td>
      <td>68.3</td>
      <td>الاستدلال ذو السياق الطويل</td>
    </tr>
    <tr>
      <td>IFBench</td>
      <td>65.1</td>
      <td>65.5</td>
      <td>اتباع التعليمات</td>
    </tr>
  </tbody>
</table>

<p>جميع البنود التسعة ضمن نقطة واحدة من FP8. وفي MMLU Pro وHLE وτ²-Bench Telecom وIFBench يتفوق إصدار NVFP4 بفارق ضئيل، والأسلم قراءة ذلك ضمن تباين القياس. الاتجاه واضح: <strong>الجودة محفوظة عملياً تحت 4 بت</strong>، وهنا تظهر ميزة NVFP4 على INT4.</p>

<p>كما يشير تكوين المعايير نفسه إلى طابع النموذج. فـ τ²-Bench Telecom يقيس وكيلاً يستدعي الأدوات لإنجاز المهام، وAA-LCR يقيس الاستدلال ذا السياق الطويل، وMMMU Pro يقيس الفهم متعدد الوسائط. أي أن هذا النموذج يستهدف <strong>استخدام الأدوات لدى الوكلاء، والسياق الطويل، وتعدد الوسائط</strong>، لا مجرد أسئلة المعرفة. ومع ذلك، لا تظهر مهام النطاق الكوري في المعايير العامة، لذا نوصي بتحقق منفصل عبر مجموعة تقييم داخلية قبل التبني.</p>

<h2 id="دليل-الخدمة">دليل الخدمة</h2>

<p>المسار الموصى به في بطاقة النموذج هو vLLM. وأمر التشغيل كالآتي.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vllm serve nvidia/Qwen3.6-27B-NVFP4 <span class="se">\</span>
  <span class="nt">--port</span> 8000 <span class="se">\</span>
  <span class="nt">--quantization</span> modelopt <span class="se">\</span>
  <span class="nt">--max-model-len</span> 262144 <span class="se">\</span>
  <span class="nt">--reasoning-parser</span> qwen3
</code></pre></div></div>

<p>ثلاث نقاط تشغيلية مهمة. أولاً، <code class="language-plaintext highlighter-rouge">--quantization modelopt</code> هو العلَم الأساسي الذي يحمّل نقطة تفتيش NVFP4. ثم <code class="language-plaintext highlighter-rouge">--reasoning-parser qwen3</code> لازم كي يُحلَّل أثر التفكير والإجابة النهائية تحليلاً صحيحاً. وأخيراً <code class="language-plaintext highlighter-rouge">--max-model-len 262144</code> يفتح سياق 262 ألف رمز كاملاً، وتنمو ميزانية مخزن KV تبعاً لذلك، فالأكفأ للذاكرة خفضه إلى الطول الذي تحتاجه فعلاً.</p>

<p>يفترض العتاد Hopper أو Blackwell، ونظام التشغيل Linux. وبفضل دعم Hopper، يمكنك التحقق من مسار الخدمة على عُقد H100 وH200 الموجودة أصلاً في مركز البيانات دون معدات إضافية.</p>

<h2 id="منظور-خدمة-thakicloud">منظور خدمة ThakiCloud</h2>

<p>تشغّل ThakiCloud منصة AI/ML قائمة على K8s تدير حصص GPU عبر Kueue وتخدم النماذج بأسلوب متعدد المستأجرين عبر vLLM. وتأتي دلالات هذا النموذج على نموذج تشغيلنا من اتجاهين: البنية التحتية والوكلاء.</p>

<p><strong>مضاعفة الكثافة على أصول Hopper القائمة.</strong> هذه أبرز قيمة عملية لهذا الإصدار. فدعم NVFP4 لـ Hopper يعني إمكان جني مكسب 4 بت على H100 وH200 التي تملكها أصلاً، دون استثمار جديد في Blackwell. وحين تنزل أوزان نموذج بـ 27 مليار معامل إلى نحو 20 جيجابايت، يمكنك وضع مزيد من نسخ النموذج على وحدة GPU نفسها، أو تحويل الذاكرة المحرَّرة إلى مخزن KV لضبط حدود تزامن سخية لكل مستأجر. ومن منظور حصص Kueue، تتحمل البطاقة نفسها عبئاً أكبر، فتنخفض تكلفة الوحدة ببساطة.</p>

<p><strong>مرشح على الخوادم الخاصة لعامل استدلال متعدد الوسائط.</strong> إن Paxis، مستوى التحكم بالوكلاء لدى ThakiCloud، سحابةٌ أصيلة الوكلاء تشغّل المهارات في صناديق رمل معزولة وتمرّر كل إجراء عبر بوابات السياسات وسجلات التدقيق. وفي هذه البنية يقرأ عدد من العمّال المستندات ويستدعون الأدوات وينجزون المهام. ويتميز Qwen3.6-27B-NVFP4 في معايير استخدام الأدوات لدى الوكلاء مثل τ²-Bench Telecom، ويستقبل الصورة والفيديو إلى جانب النص، ويستوعب سياق 262 ألف رمز. فهو مرشح مناسب للتشغيل على الخوادم الخاصة كعامل متعدد الوسائط يتعامل مع المستندات والشاشات والفيديو، وكعامل طرفي في حلقات استدعاء الأدوات. وبحسب انضباط التكلفة لدينا، شغّل العامل بثمن زهيد لكن أغلق التوسع بمرحلة تحقق على نموذج أعلى كي لا تتراكم هلوسات العامل.</p>

<p><strong>مرجع لعروض الخوادم الخاصة والامتثال.</strong> إن تكويناً برخصة Apache 2.0 وخدمة على عقدة واحدة هو تكوين يمكن اقتراحه مباشرة على عملاء القطاع العام والمالي حيث يُحظر تسريب البيانات. وفي البيئات المقيَّدة مثل متطلبات الأمن القومي أو الذكاء الاصطناعي السيادي، يصبح تشغيل نموذج استدلال كبير متعدد الوسائط على وحدات GPU خاصة دون واجهة برمجة تجارية مساراً حقيقياً للتبني.</p>

<h2 id="القيود-والاعتراضات">القيود والاعتراضات</h2>

<p>من باب التوازن، إليك التحفظات.</p>

<ul>
  <li><strong>تفاصيل المعمارية غير معلنة.</strong> الانتباه الهجين مذكور، لكن عدد الطبقات وتكوين الخبراء والمعاملات النشطة غائبة عن البطاقة. وحساب كفاءة الدفعة والذاكرة المقيمة بدقة يتطلب مزيداً من المعلومات.</li>
  <li><strong>لا توجد أرقام إنتاجية مقيسة.</strong> يستند هذا المقال إلى حقائق البطاقة مثل توفير الذاكرة والمعايير. وتتفاوت سرعة الرموز لكل تدفق وحدود التزامن كثيراً بحسب العتاد والإعدادات، فأعد القياس بحمل عملك قبل التبني.</li>
  <li><strong>تباين ناتج عن تكميم التنشيط.</strong> دفع قيم التنشيط، لا الأوزان فحسب، إلى 4 بت قد يُدخل تبايناً في الدقة على الأحمال ذات التوزيعات المائلة. وحتى مع بقاء المعايير العامة ضمن نقطة واحدة، تحقق من المهام الخاصة بالنطاق منفصلة.</li>
  <li><strong>نضج مسار الخدمة متعدد الوسائط.</strong> استقبال الصورة والفيديو بثبات في الإنتاج يتطلب التحقق من كل من خط المعالجة الأولية ونضج مسار vLLM متعدد الوسائط.</li>
  <li><strong>التحقق من الاستخدام الكوري الفعلي.</strong> المعايير العامة تتمحور حول الإنجليزية. ويجب التحقق من دقة RAG واستدعاء الأدوات بالكورية منفصلة عبر مجموعة تقييم داخلية.</li>
</ul>

<p>ومع ذلك، فإن مزيج Apache 2.0، وتسريع 4 بت الذي بات يصل إلى Hopper، والاستدلال متعدد الوسائط، وسياق 262 ألف رمز، خيارٌ جذاب للمؤسسات التي تدرس الخدمة على الخوادم الخاصة. ومجرد انخفاض جدار “اشترِ عتاداً جديداً لتنال مكاسب 4 بت” يجعله جديراً بالتحقق اليوم لأي فريق يملك أسطول Hopper.</p>

<h2 id="روابط-مرجعية">روابط مرجعية</h2>

<ul>
  <li><a href="https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4">بطاقة نموذج Qwen3.6-27B-NVFP4 (Hugging Face)</a></li>
  <li><a href="https://github.com/NVIDIA/TensorRT-Model-Optimizer">NVIDIA TensorRT Model Optimizer</a></li>
  <li><a href="https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/">التعريف بـ NVFP4 (NVIDIA Developer)</a></li>
  <li><a href="https://docs.vllm.ai/">توثيق vLLM</a></li>
  <li><a href="https://thakicloud.github.io/ar/owm/gemma-4-26b-nvfp4-dgx-spark/">مراجعة Gemma-4-26B-NVFP4 على DGX Spark (مدونة ThakiCloud)</a></li>
</ul>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><category term="owm" /><category term="qwen3" /><category term="nvfp4" /><category term="quantization" /><category term="hopper" /><category term="blackwell" /><category term="hybrid-attention" /><category term="multimodal" /><category term="vllm" /><category term="on-premise" /><summary type="html"><![CDATA[نموذج Qwen3.6-27B-NVFP4 من NVIDIA يضغط نموذج استدلال بـ 27 مليار معامل ذا انتباه هجين إلى 4 بت، فيخفض الذاكرة بنحو 2.5 ضعف مع إبقاء فجوة المعايير ضمن نقطة واحدة عن FP8. وخلافاً لإصدار Gemma NVFP4 السابق الذي كان يتطلب Blackwell عملياً، يذكر هذا الإصدار معمارية Hopper ضمن العتاد المدعوم، فيستطيع أي فريق يشغّل H100/H200 تجربته على خوادمه الخاصة اليوم. نستعرض حقائق النموذج وآلية NVFP4 ومسار الخدمة ومنظور ThakiCloud.]]></summary></entry><entry xml:lang="ar"><title type="html">التوجيه المزدوج بميزانية الرموز — تقليص وقت GPU في استدلال vLLM بنسبة 31–42% عبر الجدولة ثنائية المجمّعات</title><link href="https://thakicloud.github.io/ar/technique/dual-pool-token-budget-routing-vllm-kueue/" rel="alternate" type="text/html" title="التوجيه المزدوج بميزانية الرموز — تقليص وقت GPU في استدلال vLLM بنسبة 31–42% عبر الجدولة ثنائية المجمّعات" /><published>2026-07-01T00:00:00+09:00</published><updated>2026-07-01T00:00:00+09:00</updated><id>https://thakicloud.github.io/ar/technique/dual-pool-token-budget-routing-vllm-kueue</id><content type="html" xml:base="https://thakicloud.github.io/ar/technique/dual-pool-token-budget-routing-vllm-kueue/"><![CDATA[<h2 id="المشكلة-hol-blocking-يُهدر-وقت-gpu-في-صمت">المشكلة: HoL Blocking يُهدر وقت GPU في صمت</h2>

<p>من يدير خدمة استدلال نماذج اللغة الكبيرة في بيئة إنتاجية يلاحظ حتمًا هذا النمط المتكرر: طلب يُنتج عشرات الرموز فحسب — كردّ بسيط في نظام محادثة — يجلس منتظرًا خلف طلب تلخيص وثيقة طويل أو مهمة توليد كود معقدة، مُهدرًا مئات الميلي ثانية. هذا ما يُعرف بـ Head-of-Line (HoL) Blocking.</p>

<p>يرفع نظام vLLM كفاءة المعالجة الدفعية بشكل ملحوظ عبر continuous batching، غير أن البنية ذات المجمّع الواحد تجعل الطلبات الطويلة تستأثر بذاكرة KV Cache لفترات مطوّلة، مما يُجبر الطلبات القصيرة على الانتظار أو إعادة الحساب عند الاستئناف، فيتراجع الاستخدام الفعلي لوقت GPU.</p>

<p>يعالج أسلوب <strong>Dual-Pool Token-Budget Routing</strong> الوارد في arXiv 2604.08075 هذه المشكلة من جذورها: عند وصول كل طلب، يُقدَّر عدد الرموز المتوقع، ويُوجَّه الطلب إما إلى مجمّع السياق القصير أو مجمّع السياق الطويل، فيتعايش النوعان دون تدخّل متبادل.</p>

<p>النتائج التي رصدها البحث:</p>

<table>
  <thead>
    <tr>
      <th>المقياس</th>
      <th>التأثير</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>توفير وقت GPU</td>
      <td><strong>31–42%</strong></td>
    </tr>
    <tr>
      <td>معدل الاستئناف القسري</td>
      <td><strong>تراجع 5.4 أضعاف</strong></td>
    </tr>
    <tr>
      <td>تحسين P99 TTFT</td>
      <td><strong>6%</strong></td>
    </tr>
  </tbody>
</table>

<h2 id="المبدأ-الأساسي-التوجيه-القائم-على-ميزانية-الرموز">المبدأ الأساسي: التوجيه القائم على ميزانية الرموز</h2>

<p>الفكرة وراء Dual-Pool بسيطة. يُقدَّر لكل طلب <strong>أقصى عدد رموز متوقع</strong>، ثم يُعيَّن الطلب لأحد المجمّعَين بناءً على عتبة محددة.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>الرموز المتوقعة = رموز الإدخال + رموز الإخراج المتوقعة
</code></pre></div></div>

<p>حين يكون عدد رموز الإخراج مجهولًا — وهو الحال في معظم سيناريوهات الإنتاج — تُستخدم إحدى التقريبتين التاليتين:</p>

<ol>
  <li><strong>معاملات الطلب</strong>: استخدام قيمة <code class="language-plaintext highlighter-rouge">max_tokens</code> حدًّا أعلى.</li>
  <li><strong>التصنيف القائم على السجل التاريخي</strong>: تتبّع توزيع أطوال الطلبات السابقة لكل مسار API أو بصمة نظام الرسالة، ثم التصنيف بناءً على قيمة P75 أو P90.</li>
</ol>

<p>تعتمد العتبة الفاصلة على طبيعة أعباء العمل؛ وفي تجارب البحث استُخدم 512 رمزًا في الإخراج حدًّا بين القصير والطويل.</p>

<h2 id="المعمارية-هيكل-المجمّعَين">المعمارية: هيكل المجمّعَين</h2>

<pre><code class="language-mermaid">flowchart TB
    A[Client Request] --&gt; B[Router&lt;br/&gt;Token-Budget Classifier]
    B --&gt;|Estimated tokens &lt; threshold| C[Short-Context Pool&lt;br/&gt;vLLM Instance A]
    B --&gt;|Estimated tokens &gt;= threshold| D[Long-Context Pool&lt;br/&gt;vLLM Instance B]
    C --&gt; E[Kueue LocalQueue&lt;br/&gt;short-pool]
    D --&gt; F[Kueue LocalQueue&lt;br/&gt;long-pool]
    E --&gt; G[GPU Worker Group A&lt;br/&gt;Small KV Cache Requests]
    F --&gt; H[GPU Worker Group B&lt;br/&gt;Large KV Cache Requests]
    G --&gt; I[Return Response]
    H --&gt; I
</code></pre>

<p>يُدير مجمّع السياق القصير دورة KV Cache بسرعة عالية للحفاظ على إنتاجية مرتفعة، فيما يحتجز مجمّع السياق الطويل حجمًا كافيًا من ذاكرة KV Cache لإتمام عمليات التوليد الطويلة دون انقطاع. لا يتدخّل المجمّعان في بعضهما البعض.</p>

<h2 id="التكامل-مع-kueue-localqueue">التكامل مع Kueue LocalQueue</h2>

<p>تعتمد منصة ai-platform الخاصة بـ ThakiCloud على Kueue لجدولة أعباء عمل GPU على Kubernetes. يتيح دمج Dual-Pool Routing مع Kueue LocalQueue إدارة تخصيص موارد كل مجمّع بأسلوب تصريحي على مستوى الكلاستر.</p>

<h3 id="الخطوة-1-تعريف-clusterqueue-و-resourceflavor">الخطوة 1: تعريف ClusterQueue و ResourceFlavor</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">kueue.x-k8s.io/v1beta1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ClusterQueue</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">llm-inference-cq</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">namespaceSelector</span><span class="pi">:</span> <span class="pi">{}</span>
  <span class="na">resourceGroups</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">coveredResources</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">nvidia.com/gpu"</span><span class="pi">]</span>
      <span class="na">flavors</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">gpu-a100</span>
          <span class="na">resources</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">nvidia.com/gpu</span>
              <span class="na">nominalQuota</span><span class="pi">:</span> <span class="m">8</span>
<span class="nn">---</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">kueue.x-k8s.io/v1beta1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ResourceFlavor</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">gpu-a100</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">nodeLabels</span><span class="pi">:</span>
    <span class="na">gpu.nvidia.com/model</span><span class="pi">:</span> <span class="s">A100</span>
</code></pre></div></div>

<h3 id="الخطوة-2-تعريف-localqueue-منفصل-لكل-مجمّع">الخطوة 2: تعريف LocalQueue منفصل لكل مجمّع</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">kueue.x-k8s.io/v1beta1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">LocalQueue</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">short-pool-queue</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">llm-serving</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">clusterQueue</span><span class="pi">:</span> <span class="s">llm-inference-cq</span>
<span class="nn">---</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">kueue.x-k8s.io/v1beta1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">LocalQueue</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">long-pool-queue</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">llm-serving</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">clusterQueue</span><span class="pi">:</span> <span class="s">llm-inference-cq</span>
</code></pre></div></div>

<h3 id="الخطوة-3-إضافة-التعليق-التوضيحي-للقائمة-في-vllm-deployment">الخطوة 3: إضافة التعليق التوضيحي للقائمة في vLLM Deployment</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">apps/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Deployment</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">vllm-short-pool</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">llm-serving</span>
  <span class="na">annotations</span><span class="pi">:</span>
    <span class="na">kueue.x-k8s.io/queue-name</span><span class="pi">:</span> <span class="s">short-pool-queue</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">replicas</span><span class="pi">:</span> <span class="m">2</span>
  <span class="na">template</span><span class="pi">:</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="na">containers</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">vllm</span>
          <span class="na">image</span><span class="pi">:</span> <span class="s">vllm/vllm-openai:latest</span>
          <span class="na">args</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--model"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">meta-llama/Llama-3.1-8B-Instruct"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--max-model-len"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">4096"</span>       <span class="c1"># مجمّع قصير: حد سياق صغير</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--gpu-memory-utilization"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">0.7"</span>        <span class="c1"># دوران سريع لـ KV Cache</span>
          <span class="na">resources</span><span class="pi">:</span>
            <span class="na">limits</span><span class="pi">:</span>
              <span class="na">nvidia.com/gpu</span><span class="pi">:</span> <span class="s2">"</span><span class="s">1"</span>
<span class="nn">---</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">apps/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Deployment</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">vllm-long-pool</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">llm-serving</span>
  <span class="na">annotations</span><span class="pi">:</span>
    <span class="na">kueue.x-k8s.io/queue-name</span><span class="pi">:</span> <span class="s">long-pool-queue</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">replicas</span><span class="pi">:</span> <span class="m">2</span>
  <span class="na">template</span><span class="pi">:</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="na">containers</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">vllm</span>
          <span class="na">image</span><span class="pi">:</span> <span class="s">vllm/vllm-openai:latest</span>
          <span class="na">args</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--model"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">meta-llama/Llama-3.1-8B-Instruct"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--max-model-len"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">32768"</span>      <span class="c1"># مجمّع طويل: نافذة سياق واسعة</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--gpu-memory-utilization"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">0.90"</span>       <span class="c1"># احتجاز كبير لـ KV Cache</span>
          <span class="na">resources</span><span class="pi">:</span>
            <span class="na">limits</span><span class="pi">:</span>
              <span class="na">nvidia.com/gpu</span><span class="pi">:</span> <span class="s2">"</span><span class="s">1"</span>
</code></pre></div></div>

<h3 id="الخطوة-4-تطبيق-الموجِّه-مثال-بـ-python">الخطوة 4: تطبيق الموجِّه (مثال بـ Python)</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">fastapi</span> <span class="kn">import</span> <span class="n">FastAPI</span><span class="p">,</span> <span class="n">Request</span>
<span class="kn">import</span> <span class="n">httpx</span>

<span class="n">app</span> <span class="o">=</span> <span class="nc">FastAPI</span><span class="p">()</span>

<span class="n">SHORT_POOL_URL</span> <span class="o">=</span> <span class="sh">"</span><span class="s">http://vllm-short-pool-svc:8000/v1/chat/completions</span><span class="sh">"</span>
<span class="n">LONG_POOL_URL</span>  <span class="o">=</span> <span class="sh">"</span><span class="s">http://vllm-long-pool-svc:8000/v1/chat/completions</span><span class="sh">"</span>
<span class="n">TOKEN_THRESHOLD</span> <span class="o">=</span> <span class="mi">512</span>  <span class="c1"># اضبط هذه القيمة وفق السجل التاريخي لأعباء العمل
</span>
<span class="k">def</span> <span class="nf">estimate_output_tokens</span><span class="p">(</span><span class="n">payload</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">استخدام max_tokens حدًّا أعلى. الافتراضي 256 عند الغياب.</span><span class="sh">"""</span>
    <span class="k">return</span> <span class="n">payload</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">max_tokens</span><span class="sh">"</span><span class="p">)</span> <span class="ow">or</span> <span class="mi">256</span>

<span class="k">def</span> <span class="nf">route_request</span><span class="p">(</span><span class="n">payload</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">إعادة عنوان URL المستهدف بحسب عدد الرموز المقدَّر.</span><span class="sh">"""</span>
    <span class="n">estimated</span> <span class="o">=</span> <span class="nf">estimate_output_tokens</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">estimated</span> <span class="o">&lt;</span> <span class="n">TOKEN_THRESHOLD</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">SHORT_POOL_URL</span>
    <span class="k">return</span> <span class="n">LONG_POOL_URL</span>

<span class="nd">@app.post</span><span class="p">(</span><span class="sh">"</span><span class="s">/v1/chat/completions</span><span class="sh">"</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">proxy</span><span class="p">(</span><span class="n">request</span><span class="p">:</span> <span class="n">Request</span><span class="p">):</span>
    <span class="n">payload</span> <span class="o">=</span> <span class="k">await</span> <span class="n">request</span><span class="p">.</span><span class="nf">json</span><span class="p">()</span>
    <span class="n">target_url</span> <span class="o">=</span> <span class="nf">route_request</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span>
    <span class="k">async</span> <span class="k">with</span> <span class="n">httpx</span><span class="p">.</span><span class="nc">AsyncClient</span><span class="p">(</span><span class="n">timeout</span><span class="o">=</span><span class="mf">120.0</span><span class="p">)</span> <span class="k">as</span> <span class="n">client</span><span class="p">:</span>
        <span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="p">.</span><span class="nf">post</span><span class="p">(</span><span class="n">target_url</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="n">payload</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">resp</span><span class="p">.</span><span class="nf">json</span><span class="p">()</span>
</code></pre></div></div>

<p>يُنشر هذا الموجِّه كـ Kubernetes Service ويُوضع أمام نقطة نهاية الاستدلال الحالية.</p>

<h2 id="اعتبارات-التشغيل">اعتبارات التشغيل</h2>

<h3 id="ضبط-العتبة-الفاصلة">ضبط العتبة الفاصلة</h3>

<p>قيمة 512 رمزًا نقطة بداية لا معيار ثابت. في بيئات الإنتاج الفعلية، يُنصح بجمع المقاييس التالية على مدار سبعة أيام على الأقل قبل التعديل:</p>

<ul>
  <li>توزيع رموز الإخراج الفعلية لكل طلب (P50، P90، P99)</li>
  <li>معدل الاستئناف القسري لكل مجمّع (<code class="language-plaintext highlighter-rouge">vllm:num_preemptions_total</code> في Prometheus)</li>
  <li>عمق قائمة الانتظار <code class="language-plaintext highlighter-rouge">vllm:num_requests_waiting</code> لكل مجمّع</li>
</ul>

<p>إن ظلّ عمق قائمة انتظار المجمّع القصير مرتفعًا باستمرار، فاخفض العتبة أو أضف نسخًا إضافية. وإن تدنّى معدل استخدام GPU في المجمّع الطويل، فارفع العتبة لتقليل الطلبات الموجَّهة إليه.</p>

<h3 id="التكامل-مع-keda-للتوسع-التلقائي">التكامل مع KEDA للتوسع التلقائي</h3>

<p>تُتيح إضافة ScaledObject من KEDA مستندًا إلى مقاييس Prometheus الخاصة بـ vLLM توسعًا تلقائيًا مستقلًا لكل مجمّع:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">keda.sh/v1alpha1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ScaledObject</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">vllm-short-pool-scaler</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">llm-serving</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">scaleTargetRef</span><span class="pi">:</span>
    <span class="na">name</span><span class="pi">:</span> <span class="s">vllm-short-pool</span>
  <span class="na">minReplicaCount</span><span class="pi">:</span> <span class="m">1</span>
  <span class="na">maxReplicaCount</span><span class="pi">:</span> <span class="m">8</span>
  <span class="na">triggers</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">type</span><span class="pi">:</span> <span class="s">prometheus</span>
      <span class="na">metadata</span><span class="pi">:</span>
        <span class="na">serverAddress</span><span class="pi">:</span> <span class="s">http://prometheus:9090</span>
        <span class="na">metricName</span><span class="pi">:</span> <span class="s">vllm_requests_waiting_short</span>
        <span class="na">query</span><span class="pi">:</span> <span class="s">vllm:num_requests_waiting{deployment="vllm-short-pool"}</span>
        <span class="na">threshold</span><span class="pi">:</span> <span class="s2">"</span><span class="s">5"</span>
</code></pre></div></div>

<p>التوسع القائم على المقاييس أكثر استجابةً لأحمال الاستدلال مقارنةً بالتوسع القائم على معدل HTTP. عتبة <code class="language-plaintext highlighter-rouge">5</code> تعني بدء التوسع حين تتجاوز الطلبات المنتظرة خمسة طلبات.</p>

<h3 id="مشاركة-النموذج-مقابل-فصل-النسخ">مشاركة النموذج مقابل فصل النسخ</h3>

<p>لا يشترط الأسلوب استخدام نسختين منفصلتين من vLLM. تشغيل النموذج ذاته بإعدادات <code class="language-plaintext highlighter-rouge">--max-model-len</code> مختلفة هو التكوين الافتراضي، لكن إن توفّرت ميزانية ذاكرة كافية، يمكن لنسخة واحدة من vLLM أن تعرض منفذَي اتصال خارجيَّين بأولوية داخلية مختلفة.</p>

<p>غير أن <strong>فصل النسخ هو الخيار الأوضح</strong> للقضاء الكامل على تداخل الاستئناف، إذ تتشارك ذاكرة KV Cache داخل عملية vLLM الواحدة.</p>

<h2 id="الأهمية-بالنسبة-لمنصة-thakicloud">الأهمية بالنسبة لمنصة ThakiCloud</h2>

<p>تُخدِّم منصة ai-platform الخاصة بـ ThakiCloud أعباء استدلال عدة مستأجرين على كلاستر GPU مشترك. يُضيف Dual-Pool Routing ميزتَين ملموستَين في هذا السياق.</p>

<p>أولًا، يُقلّص التداخل بين المستأجرين. حين تتأخر طلبات المستأجر أ — ذات الطابع القصير — خلف مهام التحليل الطويلة للمستأجر ب، ينتج عن ذلك انتهاك لمستويات الخدمة المتفق عليها. فصل المجمّعات يقطع هذا التداخل على مستوى البنية.</p>

<p>ثانيًا، يرفع كفاءة ميزانية GPU. توفير 31–42% من وقت GPU يعني إما استيعاب طلبات أكثر بالميزانية ذاتها، أو تحقيق الإنتاجية نفسها بعدد أقل من وحدات GPU. في بيئات الخوادم المحلية ذات الموارد الثابتة، ينعكس هذا التوفير مباشرةً على تكلفة الخدمة.</p>

<p>بالنسبة لكلاسترات ThakiCloud التي تستخدم Kueue LocalQueue بالفعل، يتطلب تطبيق هذا الأسلوب إعلان قائمتَي انتظار فحسب وتوزيع موجِّه خفيف الوزن. التوافق مع مواصفات vLLM Deployment الحالية مرتفع، مما يُوسّع نطاق التبنّي.</p>

<h2 id="خلاصة">خلاصة</h2>

<p>المشكلة التي يعالجها Dual-Pool Token-Budget Routing واضحة: حين تتشارك الطلبات القصيرة والطويلة قائمة انتظار واحدة، تخسر القصيرة. فصلها على مستوى قائمة الانتظار يُتيح لكل نوع المعالجة بسرعته الطبيعية.</p>

<p>النتائج التي رصدها arXiv 2604.08075 — توفير 31–42% من وقت GPU، وتراجع معدل الاستئناف بمقدار 5.4 أضعاف، وتحسن بنسبة 6% في P99 TTFT — تمثّل عائدًا كبيرًا قياسًا بتعقيد التطبيق. على Kubernetes، يكفي وجود قائمتَي Kueue LocalQueue واثنتَي نشر vLLM Deployment وموجِّه خفيف واحد لبناء هذه البنية.</p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><category term="technique" /><category term="vllm" /><category term="llm-inference" /><category term="kueue" /><category term="gpu-scheduling" /><category term="llmops" /><category term="kubernetes" /><summary type="html"><![CDATA[الطلبات القصيرة التي تنتظر خلف الطلبات الطويلة في قائمة انتظار واحدة تُهدر وقت GPU في صمت — وهذا ما يُعرف بـ HoL Blocking. يقترح أسلوب Dual-Pool Token-Budget Routing الوارد في arXiv 2604.08075 تقسيم الطلبات بين مجمّع سياق قصير ومجمّع سياق طويل، محققًا توفيرًا في وقت GPU بنسبة 31–42% وتحسينًا بنسبة 6% في P99 TTFT. يشرح هذا المقال خطوات تطبيق هذه التقنية على Kubernetes باستخدام Kueue LocalQueue.]]></summary></entry><entry xml:lang="en"><title type="html">My Whole AI Stack Went Chinese</title><link href="https://thakicloud.github.io/en/comics/ai-stack-sovereignty/" rel="alternate" type="text/html" title="My Whole AI Stack Went Chinese" /><published>2026-07-01T00:00:00+09:00</published><updated>2026-07-01T00:00:00+09:00</updated><id>https://thakicloud.github.io/en/comics/ai-stack-sovereignty</id><content type="html" xml:base="https://thakicloud.github.io/en/comics/ai-stack-sovereignty/"><![CDATA[<p>You wake up and the whole stack belongs to someone else. The model, the inference engine, the vector DB, all rented from a company across the ocean. It runs fine. The catch is that you control none of it, so the day the terms change or an export rule lands, it’s over. Six panels of Paxis and Metis working out what to do.</p>

<p><img src="/assets/images/posts/comics/ai-stack-sovereignty/strip.png" alt="My Whole AI Stack Went Chinese" /></p>

<blockquote>
  <p>Source: <a href="https://x.com/hjguyhan/status/2071779159391793563">My entire AI stack is now Chinese</a> · twitter</p>
</blockquote>

<h2 id="what-this-means-for-thakicloud">What this means for ThakiCloud</h2>

<p>It’s a joke, but it happens to real teams. You adopted the stack for convenience, and control quietly walked out with it. ThakiCloud is built for exactly this. Train open models on your own GPU cluster with Kubeflow, serve them with vLLM, and keep your data behind your own firewall. Build agents with Paxis, run the platform with Metis, and every layer stays yours. If you would rather not pin your product to someone else’s terms of service, sovereign on-prem AI is the point.</p>

<hr />

<p><em>An auto-generated comic riffing on this week’s industry news.</em></p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><category term="comics" /><category term="만화" /><category term="comic" /><category term="온프렘" /><category term="sovereignty" /><category term="AI" /><summary type="html"><![CDATA[The day the whole stack belonged to someone else, and Paxis and Metis cope.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://thakicloud.github.io/assets/images/posts/comics/ai-stack-sovereignty/strip.png" /><media:content medium="image" url="https://thakicloud.github.io/assets/images/posts/comics/ai-stack-sovereignty/strip.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry xml:lang="en"><title type="html">The Future of Work: Five Archetypes, Not Job Titles</title><link href="https://thakicloud.github.io/en/comics/five-work-archetypes/" rel="alternate" type="text/html" title="The Future of Work: Five Archetypes, Not Job Titles" /><published>2026-07-01T00:00:00+09:00</published><updated>2026-07-01T00:00:00+09:00</updated><id>https://thakicloud.github.io/en/comics/five-work-archetypes</id><content type="html" xml:base="https://thakicloud.github.io/en/comics/five-work-archetypes/"><![CDATA[<p>The future of work supposedly dissolves job titles into five archetypes: prototyper, builder, sweeper, grower, maintainer. Everyone wants the first four. Nobody raises a hand for the sixth, the unglamorous work of keeping the lights on. Paxis and Metis act out what happens to that empty chair.</p>

<p><img src="/assets/images/posts/comics/five-work-archetypes/strip.png" alt="The Future of Work: Five Archetypes, Not Job Titles" /></p>

<blockquote>
  <p>Source: <a href="https://x.com/bcherny">Boris Cherny on the five archetypes of future roles</a> · twitter</p>
</blockquote>

<h2 id="what-this-means-for-thakicloud">What this means for ThakiCloud</h2>

<p>For the five creative archetypes to shine, someone has to take the sixth: GPU lifecycle, scaling, security patches, the 3 a.m. incident nobody sees. That is what ThakiCloud does. The platform absorbs the repetitive toil so your team can build and grow. We even turned these five archetypes into orchestration agents of our own. The sixth role nobody wants is the one the platform is built to carry.</p>

<hr />

<p><em>An auto-generated comic riffing on this week’s industry news.</em></p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><category term="comics" /><category term="future-of-work" /><category term="engineering-roles" /><category term="AI" /><category term="on-prem" /><category term="ThakiCloud" /><summary type="html"><![CDATA[Roles blur into prototyper, builder, sweeper, grower, maintainer, and nobody wants the sixth.]]></summary><media:thumbnail xmlns:media="http://search.yahoo.com/mrss/" url="https://thakicloud.github.io/assets/images/posts/comics/five-work-archetypes/strip.png" /><media:content medium="image" url="https://thakicloud.github.io/assets/images/posts/comics/five-work-archetypes/strip.png" xmlns:media="http://search.yahoo.com/mrss/" /></entry><entry xml:lang="en"><title type="html">Five Archetypes That Remain When Job Titles Dissolve: From Prototyper to Maintainer</title><link href="https://thakicloud.github.io/en/culture/five-product-role-archetypes/" rel="alternate" type="text/html" title="Five Archetypes That Remain When Job Titles Dissolve: From Prototyper to Maintainer" /><published>2026-07-01T00:00:00+09:00</published><updated>2026-07-01T00:00:00+09:00</updated><id>https://thakicloud.github.io/en/culture/five-product-role-archetypes</id><content type="html" xml:base="https://thakicloud.github.io/en/culture/five-product-role-archetypes/"><![CDATA[<p><img src="/assets/images/five-product-role-archetypes-hero.png" alt="Abstract visual depicting the blurring of role boundaries and the emergence of new archetypes" /></p>

<h2 id="overview">Overview</h2>

<p>Job titles are increasingly failing to describe the actual work. Designers shipping prototypes as code, engineers conducting user interviews, data scientists setting product direction – none of this feels strange anymore. As AI tools absorb the mechanical layers of each discipline, the boundaries between engineering, product, design, and data analysis are melting together.</p>

<p>Boris Cherny, creator of Claude Code, offered an interesting observation about this shift. Looking at his own Claude Code team, he noticed five role archetypes cutting across job functions. The reason this observation matters is simple: it raises a hypothesis that future organizations may staff teams around combinations of these archetypes rather than around functional job categories.</p>

<p>This post unpacks what those five archetypes are, why they detach from formal job titles, and what combinations a team needs at different stages of product maturity. This is not a technical summary – it is a culture essay asking how we should think about team composition and hiring. For organizations like ThakiCloud, where humans and agents work side by side, the question is particularly direct.</p>

<h2 id="the-five-role-archetypes">The Five Role Archetypes</h2>

<p>Cherny’s archetypes are as follows. For each one, some notes on how it shows up in real teams.</p>

<p><strong>The Prototyper</strong> generates entirely new ideas. This person produces a flood of concepts, most of which never ship. The value of this archetype lies not in a high success rate but in the density of imagination. Without someone who can open a new direction – even if nine out of ten ideas get discarded – an organization cannot push into new territory.</p>

<p><strong>The Builder</strong> converts prototypes and ideas into production-grade products or infrastructure, quickly. This is the role that closes the distance between conception and launch. If the Prototyper is the sketch, the Builder is the one who turns that sketch into a building that actually stands.</p>

<p><strong>The Sweeper</strong> tidies things up. Polishing messy UIs, simplifying code and systems, removing unused features, improving performance – the Sweeper’s job is subtraction, not addition. Deciding to unship a feature takes just as much courage as building one.</p>

<p><strong>The Grower</strong> takes an existing product and iterates relentlessly to improve PMF. Rather than redesigning the whole board, this person raises conversion, reduces churn, and accumulates small improvements on top of what already exists.</p>

<p><strong>The Maintainer</strong> owns mature systems. This archetype keeps security, stability, speed, and efficiency intact as a system scales. Not glamorous work, but without it a grown product collapses under its own weight.</p>

<pre><code class="language-mermaid">flowchart TB
    P["Prototyper&lt;br/&gt;churns out new ideas"]
    B["Builder&lt;br/&gt;converts to production-grade"]
    S["Sweeper&lt;br/&gt;simplify, tidy, performance"]
    G["Grower&lt;br/&gt;iterates toward PMF"]
    M["Maintainer&lt;br/&gt;keeps security, stability, scale"]
    P --&gt; B
    B --&gt; S
    S --&gt; G
    G --&gt; M
    M -.maintain / reinvent.-&gt; B
</code></pre>

<h2 id="these-are-roles-not-job-titles">These Are Roles, Not Job Titles</h2>

<p>The heart of this observation is not the list itself – it is that these archetypes do not map to job functions. Cherny notes that across Anthropic, some designers fall into archetype 1 (Prototyper), others into archetype 2 (Builder), and still others into archetype 3 (Sweeper). The same is true for engineers, product managers, and data scientists.</p>

<p>Put differently, “we’re hiring a designer” is a sentence that carries less and less information. Two designers can contribute to a team in completely different ways depending on whether they are the Prototyper type who opens new territory or the Sweeper type who refines and completes. A job title tells you what tools someone has learned; it does not tell you which moments they shine in.</p>

<p>Many people straddle two archetypes, and some span three. Someone who is both Prototyper and Builder is especially valuable in an early-stage startup. Someone who combines Sweeper and Maintainer becomes the backbone of a mature infrastructure team. Rather than fitting a person into a single box, it is more accurate to think about where they fall on the spectrum of these archetypes.</p>

<h2 id="team-composition-by-product-lifecycle">Team Composition by Product Lifecycle</h2>

<p>The real reason these archetypes are interesting is that they become a formula for staffing. Cherny argues that a healthy team needs a different archetype mix depending on product maturity.</p>

<p>A new product that has not yet found PMF needs people strong in Prototyper, Builder, and Sweeper (1 + 2 + 3). At this stage nobody knows what will work, so the capacity to build fast, discard fast, and keep changing direction is what counts. Filling this team with people who are primarily Maintainers means spending energy defending something that does not yet exist.</p>

<p>A growing product that has found PMF needs Builder, Sweeper, and Grower (2 + 3 + 4) plus a light Maintainer presence (5). Direction is established; the task now is to improve completeness and conversion while securing just enough stability to handle the expanding user base.</p>

<p>A mature product with strong PMF needs Sweeper, Grower, and Maintainer (3 + 4 + 5) with a sprinkling of Builder (2). The priority is keeping the system simple, improving it continuously, and protecting security and speed at scale – with new builds only when genuinely necessary.</p>

<pre><code class="language-mermaid">flowchart TB
    PRE["Pre-PMF&lt;br/&gt;new product"]
    GROW["Growth stage&lt;br/&gt;PMF found"]
    MATURE["Maturity&lt;br/&gt;strong PMF"]
    PRE --&gt;|"Prototyper + Builder + Sweeper"| GROW
    GROW --&gt;|"Builder + Sweeper + Grower (+ light Maintainer)"| MATURE
    MATURE --&gt;|"Sweeper + Grower + Maintainer (+ light Builder)"| MATURE
</code></pre>

<p>The practical implication of this formula is clear. When adding someone to the team, the first question should not be “do we need more engineers?” but “which archetype is missing for our current product stage?” Keep filling a mature product team with Prototypers and you will have no shortage of new ideas but nobody guarding the system. Fill a pre-PMF product team with only Maintainers and you will be in a defensive posture before there is anything worth defending.</p>

<h2 id="thakiclouds-take-role-realignment-in-the-age-of-agents">ThakiCloud’s Take: Role Realignment in the Age of Agents</h2>

<p>The observation that job roles are dissolving becomes sharper still in organizations where humans and agents work together. As AI agents absorb a significant share of mechanical build work, people naturally migrate toward the archetypes that are genuinely important at each product stage. The bottleneck shifts from the hands that type the code to the eye that judges which archetype is needed right now.</p>

<p>Paxis, ThakiCloud’s Agent-Native Cloud, implements this realignment at the systems layer. Paxis treats Skills, Tools, Policies, and Audit Logs as first-class resources, selecting from over 960 skills using BM25 and executing them in isolated sandboxes. Just as Cherny describes recombining human roles to fit the product moment rather than a fixed title, Paxis dynamically assembles agent capabilities to match the task at hand rather than locking them into a fixed pipeline. A Prototyper pours out ideas, a Builder-role agent converts them to production code, and a Sweeper-role validation gate cleans up the result – the same division of labor reproduced inside the skill harness.</p>

<p>On the infrastructure side, ThakiCloud’s ai-platform takes on the Maintainer archetype’s workload. Scheduling GPUs with Kueue, serving models with vLLM, and satisfying on-premises and sovereign requirements in a K8s-based multi-tenant environment – this is precisely the Maintainer’s job of protecting security, stability, and efficiency in a mature system. Customer organizations delegate this layer to the platform, which frees their own teams to concentrate more resources on the Prototyper and Grower ends of the spectrum.</p>

<p>This lens is also useful for hiring. ThakiCloud looks past the job title on a resume to ask where on the archetype spectrum a candidate actually sits. The person who fills the missing archetype for our current product stage creates the greatest leverage in the team. The question is not only “what can you do?” but “which moments are when you shine?”</p>

<h2 id="limitations-and-counterpoints">Limitations and Counterpoints</h2>

<p>Before accepting this framework uncritically, the other side deserves a hearing. Ben Vinegar pushed back on this conversation, arguing that “people are just learning how software organizations work and mistakenly attributing the dynamics of team roles – which have always existed – to AI.” That is a sharp counterpoint. The distinction between Prototyper and Maintainer existed long before AI, and the insight that different talent profiles are needed at different lifecycle stages is not a new one.</p>

<p>There are also limits to the classification itself. Like any attempt to sort people into five boxes, this framework risks oversimplifying individuals. In practice, one person moves across multiple archetypes from project to project, sometimes within a single day. Treating archetypes as fixed identities produces a harmful effect of the kind: “you’re a Sweeper, so don’t come to me with new ideas.” This is precisely why Cherny himself emphasizes that many people move fluidly across archetypes.</p>

<p>Even so, the framework earns its value not from predictive power but from the language it provides. When a team can say “we’re short on Growers right now” instead of the vague “we need another engineer,” the conversation around hiring and team composition becomes far more concrete. The more AI strips away the mechanical layer of each role, the more what remains is judgment at the archetype level. Future product roles may end up shaped closer to these archetypes than to today’s domain-specific job titles.</p>

<h2 id="closing-thoughts">Closing Thoughts</h2>

<p>Job titles dissolving is not a crisis – it is a restructuring. The five archetypes – Prototyper, Builder, Sweeper, Grower, Maintainer – show what remains when titles fall away. What remains is not a toolset but an essence: how a person contributes, and in which moments.</p>

<p>ThakiCloud is building an organization where humans and agents share these archetypes. The more agents take on repeatable build and maintenance work, the more humans focus on reading which archetype the product needs right now. That judgment is the rarest and most valuable capability in what comes next.</p>

<h2 id="sources">Sources</h2>

<ul>
  <li>Boris Cherny, X(@bcherny), 2026-06-29: <a href="https://x.com/bcherny/status/2071379474277613732">Original tweet</a></li>
  <li>Ben Vinegar, X(@bentlegen): <a href="https://x.com/bentlegen/status/2071576459538567463">Counterpoint</a></li>
</ul>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><category term="culture" /><category term="future-of-work" /><category term="organizational-culture" /><category term="product-teams" /><category term="hiring" /><category term="Boris Cherny" /><category term="Claude Code" /><summary type="html"><![CDATA[As engineering, product, design, and data blur into a single mass, Boris Cherny, the creator of Claude Code, proposes five role archetypes and a team-composition formula tied to product lifecycle stage.]]></summary></entry><entry xml:lang="en"><title type="html">Qwen3.6-27B at 4-bit: Why NVFP4 Quantization Came Down to Hopper</title><link href="https://thakicloud.github.io/en/owm/qwen3-6-27b-nvfp4-onprem-serving/" rel="alternate" type="text/html" title="Qwen3.6-27B at 4-bit: Why NVFP4 Quantization Came Down to Hopper" /><published>2026-07-01T00:00:00+09:00</published><updated>2026-07-01T00:00:00+09:00</updated><id>https://thakicloud.github.io/en/owm/qwen3-6-27b-nvfp4-onprem-serving</id><content type="html" xml:base="https://thakicloud.github.io/en/owm/qwen3-6-27b-nvfp4-onprem-serving/"><![CDATA[<p>⏱️ <strong>Estimated reading time</strong>: 11 min</p>

<p><img src="/assets/images/qwen3-6-27b-nvfp4-onprem-serving-hero.png" alt="Qwen3.6-27B NVFP4 4-bit quantization concept diagram" /></p>

<h2 id="overview">Overview</h2>

<p>NVIDIA has released <code class="language-plaintext highlighter-rouge">nvidia/Qwen3.6-27B-NVFP4</code>, a NVFP4 4-bit quantization of Alibaba’s Qwen3.6-27B. It compresses a 27B-class hybrid-attention reasoning model to 4-bit, cutting weight memory by roughly 2.5x while keeping the gap to the FP8 baseline within 1 point across all nine benchmarks. The license is Apache 2.0.</p>

<p>Three points are worth highlighting. First, unlike the earlier <code class="language-plaintext highlighter-rouge">Gemma-4-26B-A4B-NVFP4</code> that effectively only got 4-bit acceleration on Blackwell, this build’s model card lists <strong>both Hopper and Blackwell as supported targets</strong>. That means a team already running H100 or H200 can try it today without buying new hardware. Second, this is not a text-only LLM but a <strong>multimodal reasoning model that accepts text, image, and video input</strong>. Third, the context window opens up to <strong>262K tokens</strong>, taking long documents and extended conversations in a single pass.</p>

<p>ThakiCloud operates a platform that manages GPU quotas with Kueue and serves models multi-tenant with vLLM on Kubernetes. So “how much larger a model, and how many more tenants, can we fit on the GPUs we already own?” is not a novelty item; it feeds directly into the cost model. This post reviews the model facts, examines why NVFP4 came down to Hopper, then honestly assesses the serving path and its usefulness on our platform.</p>

<h2 id="what-is-this-model">What Is This Model</h2>

<p><code class="language-plaintext highlighter-rouge">nvidia/Qwen3.6-27B-NVFP4</code> is Alibaba’s <code class="language-plaintext highlighter-rouge">Qwen3.6-27B</code> quantized to NVFP4 with NVIDIA Model Optimizer (nvidia-modelopt v0.45.0). The core spec from the model card is as follows.</p>

<table>
  <thead>
    <tr>
      <th>Item</th>
      <th>Value</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Base model</td>
      <td>Alibaba Qwen3.6-27B</td>
    </tr>
    <tr>
      <td>Architecture</td>
      <td>Hybrid attention (Gated DeltaNet + Gated Attention)</td>
    </tr>
    <tr>
      <td>Total parameters</td>
      <td>27B</td>
    </tr>
    <tr>
      <td>Context</td>
      <td>262K tokens</td>
    </tr>
    <tr>
      <td>Input modalities</td>
      <td>Text + image + video</td>
    </tr>
    <tr>
      <td>Output</td>
      <td>Text</td>
    </tr>
    <tr>
      <td>Quantization</td>
      <td>NVFP4 (Model Optimizer v0.45.0)</td>
    </tr>
    <tr>
      <td>Target hardware</td>
      <td>NVIDIA Hopper, Blackwell</td>
    </tr>
    <tr>
      <td>License</td>
      <td>Apache 2.0</td>
    </tr>
  </tbody>
</table>

<p>The notable part is the <strong>hybrid attention</strong> architecture. Gated DeltaNet is a linear-attention path, designed to process long sequences efficiently, unlike standard attention whose cost grows with sequence length. Blending it with Gated Attention, which carries expressiveness, gives a compromise that handles a 262K context while preserving quality. The fact that serving requires <code class="language-plaintext highlighter-rouge">--reasoning-parser qwen3</code> also confirms this is a <strong>reasoning model</strong> that generates a reasoning trace before the final answer.</p>

<p>One thing to state honestly: the model card names hybrid attention but does not disclose the exact layer count, expert configuration, or per-token active parameters. So this post covers only the facts in the card and does not estimate the undisclosed figures.</p>

<h2 id="nvfp4-quantization-what-gets-compressed-and-how">NVFP4 Quantization: What Gets Compressed, and How</h2>

<p>NVFP4 is the 4-bit floating-point format NVIDIA is pushing. Unlike INT4, which simply truncates weights to 4-bit integers, it is a micro-scaling scheme that places an FP8 scale per small block, enjoying 4-bit-level memory savings while keeping accuracy loss small.</p>

<p>In this build the quantization targets are the <strong>weights and activations of the linear operators within the transformer blocks</strong>. Non-linear layers are left untouched. The model card states that reducing bits per parameter from 16 to 4 cuts disk and GPU memory requirements by <strong>about 2.5x</strong>. Loading 27B parameters in BF16 needs roughly 54 GB; applying the ~2.5x reduction brings the checkpoint down to around 20 GB. That opens room to place more than twice the model on the same GPU, or to redirect the freed memory to the KV cache to raise concurrency.</p>

<p>This is where it diverges from the earlier Gemma NVFP4 review. The Gemma build had a broken NVFP4 MoE kernel on consumer and pro Blackwell (SM120), so the only consumer-grade path that actually ran was the DGX Spark. This Qwen3.6 build, by contrast, has a model card that <strong>lists both Hopper and Blackwell as supported targets</strong>, and serving uses vLLM’s <code class="language-plaintext highlighter-rouge">--quantization modelopt</code> path. With activations quantized alongside weights and the modelopt serving path in place, this 4-bit model can run on the H100 and H200 already installed in data centers. The constraint of “you must buy new Blackwell to see 4-bit gains” has been substantially relaxed this time.</p>

<pre><code class="language-mermaid">flowchart TB
    A["Qwen3.6-27B&lt;br/&gt;BF16 ~54GB"] --&gt; B["NVIDIA Model Optimizer&lt;br/&gt;v0.45.0"]
    B --&gt; C["NVFP4 quantization&lt;br/&gt;linear-operator weights + activations&lt;br/&gt;16-bit to 4-bit"]
    C --&gt; D["NVFP4 checkpoint&lt;br/&gt;~20GB range · ~2.5x reduction"]
    D --&gt; E["vLLM serving&lt;br/&gt;--quantization modelopt"]
    E --&gt; F["NVIDIA Hopper&lt;br/&gt;H100 / H200"]
    E --&gt; G["NVIDIA Blackwell&lt;br/&gt;B200 etc."]
</code></pre>

<h2 id="benchmarks-how-much-does-4-bit-cost">Benchmarks: How Much Does 4-bit Cost</h2>

<p>The model card presents the NVFP4 quantized model side by side with the FP8 baseline across nine benchmarks.</p>

<table>
  <thead>
    <tr>
      <th>Benchmark</th>
      <th>FP8</th>
      <th>NVFP4</th>
      <th>Measures</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>MMLU Pro</td>
      <td>86.1</td>
      <td>86.3</td>
      <td>General knowledge and reasoning</td>
    </tr>
    <tr>
      <td>GPQA Diamond</td>
      <td>86.0</td>
      <td>85.5</td>
      <td>Graduate-level science reasoning</td>
    </tr>
    <tr>
      <td>HLE</td>
      <td>21.7</td>
      <td>21.8</td>
      <td>Hard general reasoning</td>
    </tr>
    <tr>
      <td>τ²-Bench Telecom</td>
      <td>95.2</td>
      <td>95.4</td>
      <td>Agent tool use</td>
    </tr>
    <tr>
      <td>MMMU Pro</td>
      <td>74.6</td>
      <td>74.3</td>
      <td>Multimodal reasoning</td>
    </tr>
    <tr>
      <td>SciCode</td>
      <td>44.8</td>
      <td>44.5</td>
      <td>Scientific coding</td>
    </tr>
    <tr>
      <td>AIME 2025</td>
      <td>93.1</td>
      <td>92.7</td>
      <td>Math competition</td>
    </tr>
    <tr>
      <td>AA-LCR</td>
      <td>68.8</td>
      <td>68.3</td>
      <td>Long-context reasoning</td>
    </tr>
    <tr>
      <td>IFBench</td>
      <td>65.1</td>
      <td>65.5</td>
      <td>Instruction following</td>
    </tr>
  </tbody>
</table>

<p>All nine are within 1 point of FP8. On MMLU Pro, HLE, τ²-Bench Telecom, and IFBench the NVFP4 build is even marginally higher, which is safer read as measurement variance. The direction is clear: <strong>quality is essentially preserved under 4-bit</strong>, and this is where NVFP4’s advantage over INT4 shows.</p>

<p>The benchmark mix itself signals the model’s character. τ²-Bench Telecom measures an agent calling tools to complete tasks, AA-LCR measures long-context reasoning, and MMMU Pro measures multimodal understanding. In other words, this model targets <strong>agent tool use, long context, and multimodality</strong>, not just plain knowledge QA. That said, Korean-domain tasks do not appear in the public benchmarks, so we recommend separate validation with an internal eval set before adoption.</p>

<h2 id="serving-guide">Serving Guide</h2>

<p>The recommended path in the model card is vLLM. The run command is as follows.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vllm serve nvidia/Qwen3.6-27B-NVFP4 <span class="se">\</span>
  <span class="nt">--port</span> 8000 <span class="se">\</span>
  <span class="nt">--quantization</span> modelopt <span class="se">\</span>
  <span class="nt">--max-model-len</span> 262144 <span class="se">\</span>
  <span class="nt">--reasoning-parser</span> qwen3
</code></pre></div></div>

<p>Three operational points matter. First, <code class="language-plaintext highlighter-rouge">--quantization modelopt</code> is the key flag that loads the NVFP4 checkpoint. Next, <code class="language-plaintext highlighter-rouge">--reasoning-parser qwen3</code> is required for the reasoning trace and final answer to be parsed correctly. Finally, <code class="language-plaintext highlighter-rouge">--max-model-len 262144</code> opens the full 262K context; the KV cache budget grows accordingly, so it is more memory-efficient to lower it to the length you actually need.</p>

<p>The hardware assumption is Hopper or Blackwell, and the OS is Linux. Thanks to Hopper support, you can validate the serving path on H100 and H200 nodes already in the data center without additional equipment.</p>

<h2 id="thakicloud-serving-perspective">ThakiCloud Serving Perspective</h2>

<p>ThakiCloud runs a K8s-based AI/ML platform that manages GPU quotas with Kueue and serves models multi-tenant with vLLM. The implications for our operating model come in two directions, infrastructure and agents.</p>

<p><strong>Doubling density on existing Hopper assets.</strong> This is the most tangible value of this build. NVFP4 supporting Hopper means you can capture the 4-bit gain on H100 and H200 you already own, without new Blackwell investment. When a 27B model’s weights fall to around 20 GB, you can place more model instances on the same GPU, or redirect the freed memory to the KV cache to set generous per-tenant concurrency limits. From a Kueue quota view, the same card takes on more workload, so the unit cost simply comes down.</p>

<p><strong>An on-prem candidate for a multimodal reasoning worker.</strong> Paxis, ThakiCloud’s agent control plane, is an Agent-Native Cloud that runs skills in isolated sandboxes and passes every action through policy gates and audit logs. In that structure, many workers read documents, call tools, and complete tasks. Qwen3.6-27B-NVFP4 is strong on agent tool-use benchmarks like τ²-Bench Telecom, accepts image and video in addition to text, and handles a 262K context. It is a fit candidate to run on-prem as a multimodal worker handling documents, screens, and video, and as a terminal worker in tool-call loops. Per our cost discipline, run the worker cheaply but close the fan-out with a verification stage on a higher model so worker hallucinations do not accumulate.</p>

<p><strong>A reference for on-prem and compliance proposals.</strong> An Apache 2.0 license with single-node serving is a configuration you can propose directly to public-sector and financial customers where data exfiltration is prohibited. In constrained environments such as national-security requirements or sovereign AI, running a large multimodal reasoning model on your own GPUs without a commercial API becomes a real adoption path.</p>

<h2 id="limitations-and-counterpoints">Limitations and Counterpoints</h2>

<p>For balance, here are the caveats.</p>

<ul>
  <li><strong>Architecture details are undisclosed.</strong> Hybrid attention is stated, but the layer count, expert configuration, and active parameters are absent from the card. Precisely computing batch efficiency and resident memory requires more information.</li>
  <li><strong>No measured throughput numbers.</strong> This post rests on card facts such as memory savings and benchmarks. Per-stream token speed and concurrency limits vary greatly with hardware and settings, so re-measure with your own workload before adoption.</li>
  <li><strong>Variance from activation quantization.</strong> Pushing activations, not just weights, to 4-bit can introduce accuracy variance on workloads with skewed distributions. Even with public benchmarks within 1 point, verify domain-specific tasks separately.</li>
  <li><strong>Maturity of the multimodal serving path.</strong> Stably taking image and video input in production requires validating both the preprocessing pipeline and the maturity of vLLM’s multimodal path.</li>
  <li><strong>Korean real-world validation.</strong> Public benchmarks are English-centric. Korean RAG and tool-call accuracy must be checked separately with an internal eval set.</li>
</ul>

<p>Even so, the combination of Apache 2.0, 4-bit acceleration that now reaches Hopper, multimodal reasoning, and a 262K context is an attractive option for organizations weighing on-prem serving. The mere fact that the “buy new hardware to get 4-bit gains” wall has lowered makes it worth validating today for any team that already owns a Hopper fleet.</p>

<h2 id="references">References</h2>

<ul>
  <li><a href="https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4">Qwen3.6-27B-NVFP4 model card (Hugging Face)</a></li>
  <li><a href="https://github.com/NVIDIA/TensorRT-Model-Optimizer">NVIDIA TensorRT Model Optimizer</a></li>
  <li><a href="https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/">Introducing NVFP4 (NVIDIA Developer)</a></li>
  <li><a href="https://docs.vllm.ai/">vLLM documentation</a></li>
  <li><a href="https://thakicloud.github.io/en/owm/gemma-4-26b-nvfp4-dgx-spark/">Gemma-4-26B-NVFP4 DGX Spark review (ThakiCloud blog)</a></li>
</ul>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><category term="owm" /><category term="qwen3" /><category term="nvfp4" /><category term="quantization" /><category term="hopper" /><category term="blackwell" /><category term="hybrid-attention" /><category term="multimodal" /><category term="vllm" /><category term="on-premise" /><summary type="html"><![CDATA[NVIDIA's Qwen3.6-27B-NVFP4 compresses a 27B hybrid-attention reasoning model to 4-bit, cutting memory by roughly 2.5x while keeping benchmark gaps within 1 point of FP8. Unlike the earlier Gemma NVFP4 build that effectively required Blackwell, this one lists Hopper as a supported target, so any team already running H100/H200 can try it on-premises today. Model facts, the NVFP4 mechanism, the serving path, and ThakiCloud's serving perspective.]]></summary></entry><entry xml:lang="en"><title type="html">Dual-Pool Token-Budget Routing — Cutting vLLM Inference GPU Time by 31–42% with Two-Pool Scheduling</title><link href="https://thakicloud.github.io/en/technique/dual-pool-token-budget-routing-vllm-kueue/" rel="alternate" type="text/html" title="Dual-Pool Token-Budget Routing — Cutting vLLM Inference GPU Time by 31–42% with Two-Pool Scheduling" /><published>2026-07-01T00:00:00+09:00</published><updated>2026-07-01T00:00:00+09:00</updated><id>https://thakicloud.github.io/en/technique/dual-pool-token-budget-routing-vllm-kueue</id><content type="html" xml:base="https://thakicloud.github.io/en/technique/dual-pool-token-budget-routing-vllm-kueue/"><![CDATA[<h2 id="the-problem-hol-blocking-quietly-wastes-gpu-time">The Problem: HoL Blocking Quietly Wastes GPU Time</h2>

<p>Anyone who has run an LLM inference service in production has seen this: a request that generates a few dozen tokens — say, a one-liner chatbot reply — sits waiting behind a long document summarization or code generation job, wasting hundreds of milliseconds. This is Head-of-Line (HoL) blocking.</p>

<p>vLLM’s continuous batching dramatically improves batch efficiency, but in a single-pool setup, long requests hold onto the KV cache for extended periods, forcing shorter requests to be preempted. Preempted requests pay the cost of recomputation, and overall GPU time efficiency drops.</p>

<p>The <strong>Dual-Pool Token-Budget Routing</strong> approach from arXiv 2604.08075 addresses this at the root. At request intake, it estimates the expected response length and routes each request to either a short-context pool or a long-context pool, so the two types never interfere with each other.</p>

<p>The paper reports the following results:</p>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Effect</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>GPU time savings</td>
      <td><strong>31–42%</strong></td>
    </tr>
    <tr>
      <td>Preemption rate</td>
      <td><strong>5.4x reduction</strong></td>
    </tr>
    <tr>
      <td>P99 TTFT improvement</td>
      <td><strong>6%</strong></td>
    </tr>
  </tbody>
</table>

<h2 id="core-idea-token-budget-based-routing">Core Idea: Token-Budget-Based Routing</h2>

<p>The concept behind Dual-Pool is straightforward. For each request, the system estimates the <strong>maximum expected token count</strong> and assigns it to one of two pools based on a threshold.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Expected tokens = input tokens + estimated output tokens
</code></pre></div></div>

<p>When output token count is unknown — which is most of the time in production — two approximations work well:</p>

<ol>
  <li><strong>Request parameters</strong>: Use the <code class="language-plaintext highlighter-rouge">max_tokens</code> value as an upper bound.</li>
  <li><strong>History-based classification</strong>: Track the length distribution of previous requests by API path or system prompt hash, then classify using the P75 or P90 value.</li>
</ol>

<p>The threshold depends on workload characteristics. In the paper’s experiments, 512 output tokens was the boundary between short and long.</p>

<h2 id="architecture-the-two-pool-structure">Architecture: The Two-Pool Structure</h2>

<pre><code class="language-mermaid">flowchart TB
    A[Client Request] --&gt; B[Router&lt;br/&gt;Token-Budget Classifier]
    B --&gt;|Estimated tokens &lt; threshold| C[Short-Context Pool&lt;br/&gt;vLLM Instance A]
    B --&gt;|Estimated tokens &gt;= threshold| D[Long-Context Pool&lt;br/&gt;vLLM Instance B]
    C --&gt; E[Kueue LocalQueue&lt;br/&gt;short-pool]
    D --&gt; F[Kueue LocalQueue&lt;br/&gt;long-pool]
    E --&gt; G[GPU Worker Group A&lt;br/&gt;Small KV Cache Requests]
    F --&gt; H[GPU Worker Group B&lt;br/&gt;Large KV Cache Requests]
    G --&gt; I[Return Response]
    H --&gt; I
</code></pre>

<p>The short-context pool cycles through KV cache quickly, maintaining high throughput. The long-context pool reserves enough KV cache memory to complete long generations without interruption. Neither pool preempts the other.</p>

<h2 id="kueue-localqueue-integration">Kueue LocalQueue Integration</h2>

<p>ThakiCloud’s ai-platform schedules GPU workloads on Kubernetes using Kueue. Integrating Dual-Pool Routing with Kueue LocalQueue lets you manage resource allocation for each pool declaratively at the cluster level.</p>

<h3 id="step-1-define-clusterqueue-and-resourceflavor">Step 1: Define ClusterQueue and ResourceFlavor</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">kueue.x-k8s.io/v1beta1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ClusterQueue</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">llm-inference-cq</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">namespaceSelector</span><span class="pi">:</span> <span class="pi">{}</span>
  <span class="na">resourceGroups</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">coveredResources</span><span class="pi">:</span> <span class="pi">[</span><span class="s2">"</span><span class="s">nvidia.com/gpu"</span><span class="pi">]</span>
      <span class="na">flavors</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">gpu-a100</span>
          <span class="na">resources</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">nvidia.com/gpu</span>
              <span class="na">nominalQuota</span><span class="pi">:</span> <span class="m">8</span>
<span class="nn">---</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">kueue.x-k8s.io/v1beta1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ResourceFlavor</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">gpu-a100</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">nodeLabels</span><span class="pi">:</span>
    <span class="na">gpu.nvidia.com/model</span><span class="pi">:</span> <span class="s">A100</span>
</code></pre></div></div>

<h3 id="step-2-separate-localqueues-per-pool">Step 2: Separate LocalQueues per Pool</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">kueue.x-k8s.io/v1beta1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">LocalQueue</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">short-pool-queue</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">llm-serving</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">clusterQueue</span><span class="pi">:</span> <span class="s">llm-inference-cq</span>
<span class="nn">---</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">kueue.x-k8s.io/v1beta1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">LocalQueue</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">long-pool-queue</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">llm-serving</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">clusterQueue</span><span class="pi">:</span> <span class="s">llm-inference-cq</span>
</code></pre></div></div>

<h3 id="step-3-annotate-vllm-deployments-with-queue-names">Step 3: Annotate vLLM Deployments with Queue Names</h3>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">apps/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Deployment</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">vllm-short-pool</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">llm-serving</span>
  <span class="na">annotations</span><span class="pi">:</span>
    <span class="na">kueue.x-k8s.io/queue-name</span><span class="pi">:</span> <span class="s">short-pool-queue</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">replicas</span><span class="pi">:</span> <span class="m">2</span>
  <span class="na">template</span><span class="pi">:</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="na">containers</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">vllm</span>
          <span class="na">image</span><span class="pi">:</span> <span class="s">vllm/vllm-openai:latest</span>
          <span class="na">args</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--model"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">meta-llama/Llama-3.1-8B-Instruct"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--max-model-len"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">4096"</span>       <span class="c1"># short pool: small context limit</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--gpu-memory-utilization"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">0.7"</span>        <span class="c1"># fast KV cache turnover</span>
          <span class="na">resources</span><span class="pi">:</span>
            <span class="na">limits</span><span class="pi">:</span>
              <span class="na">nvidia.com/gpu</span><span class="pi">:</span> <span class="s2">"</span><span class="s">1"</span>
<span class="nn">---</span>
<span class="na">apiVersion</span><span class="pi">:</span> <span class="s">apps/v1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">Deployment</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">vllm-long-pool</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">llm-serving</span>
  <span class="na">annotations</span><span class="pi">:</span>
    <span class="na">kueue.x-k8s.io/queue-name</span><span class="pi">:</span> <span class="s">long-pool-queue</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">replicas</span><span class="pi">:</span> <span class="m">2</span>
  <span class="na">template</span><span class="pi">:</span>
    <span class="na">spec</span><span class="pi">:</span>
      <span class="na">containers</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">vllm</span>
          <span class="na">image</span><span class="pi">:</span> <span class="s">vllm/vllm-openai:latest</span>
          <span class="na">args</span><span class="pi">:</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--model"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">meta-llama/Llama-3.1-8B-Instruct"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--max-model-len"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">32768"</span>      <span class="c1"># long pool: generous context window</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">--gpu-memory-utilization"</span>
            <span class="pi">-</span> <span class="s2">"</span><span class="s">0.90"</span>       <span class="c1"># large KV cache reservation</span>
          <span class="na">resources</span><span class="pi">:</span>
            <span class="na">limits</span><span class="pi">:</span>
              <span class="na">nvidia.com/gpu</span><span class="pi">:</span> <span class="s2">"</span><span class="s">1"</span>
</code></pre></div></div>

<h3 id="step-4-router-implementation-python-example">Step 4: Router Implementation (Python Example)</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">fastapi</span> <span class="kn">import</span> <span class="n">FastAPI</span><span class="p">,</span> <span class="n">Request</span>
<span class="kn">import</span> <span class="n">httpx</span>

<span class="n">app</span> <span class="o">=</span> <span class="nc">FastAPI</span><span class="p">()</span>

<span class="n">SHORT_POOL_URL</span> <span class="o">=</span> <span class="sh">"</span><span class="s">http://vllm-short-pool-svc:8000/v1/chat/completions</span><span class="sh">"</span>
<span class="n">LONG_POOL_URL</span>  <span class="o">=</span> <span class="sh">"</span><span class="s">http://vllm-long-pool-svc:8000/v1/chat/completions</span><span class="sh">"</span>
<span class="n">TOKEN_THRESHOLD</span> <span class="o">=</span> <span class="mi">512</span>  <span class="c1"># tune this against workload history
</span>
<span class="k">def</span> <span class="nf">estimate_output_tokens</span><span class="p">(</span><span class="n">payload</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Use max_tokens as upper bound. Default to 256 if absent.</span><span class="sh">"""</span>
    <span class="k">return</span> <span class="n">payload</span><span class="p">.</span><span class="nf">get</span><span class="p">(</span><span class="sh">"</span><span class="s">max_tokens</span><span class="sh">"</span><span class="p">)</span> <span class="ow">or</span> <span class="mi">256</span>

<span class="k">def</span> <span class="nf">route_request</span><span class="p">(</span><span class="n">payload</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="sh">"""</span><span class="s">Return the target URL based on estimated token count.</span><span class="sh">"""</span>
    <span class="n">estimated</span> <span class="o">=</span> <span class="nf">estimate_output_tokens</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span>
    <span class="k">if</span> <span class="n">estimated</span> <span class="o">&lt;</span> <span class="n">TOKEN_THRESHOLD</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">SHORT_POOL_URL</span>
    <span class="k">return</span> <span class="n">LONG_POOL_URL</span>

<span class="nd">@app.post</span><span class="p">(</span><span class="sh">"</span><span class="s">/v1/chat/completions</span><span class="sh">"</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">proxy</span><span class="p">(</span><span class="n">request</span><span class="p">:</span> <span class="n">Request</span><span class="p">):</span>
    <span class="n">payload</span> <span class="o">=</span> <span class="k">await</span> <span class="n">request</span><span class="p">.</span><span class="nf">json</span><span class="p">()</span>
    <span class="n">target_url</span> <span class="o">=</span> <span class="nf">route_request</span><span class="p">(</span><span class="n">payload</span><span class="p">)</span>
    <span class="k">async</span> <span class="k">with</span> <span class="n">httpx</span><span class="p">.</span><span class="nc">AsyncClient</span><span class="p">(</span><span class="n">timeout</span><span class="o">=</span><span class="mf">120.0</span><span class="p">)</span> <span class="k">as</span> <span class="n">client</span><span class="p">:</span>
        <span class="n">resp</span> <span class="o">=</span> <span class="k">await</span> <span class="n">client</span><span class="p">.</span><span class="nf">post</span><span class="p">(</span><span class="n">target_url</span><span class="p">,</span> <span class="n">json</span><span class="o">=</span><span class="n">payload</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">resp</span><span class="p">.</span><span class="nf">json</span><span class="p">()</span>
</code></pre></div></div>

<p>Deploy this router as a Kubernetes Service and place it in front of your existing inference endpoint.</p>

<h2 id="operational-considerations">Operational Considerations</h2>

<h3 id="tuning-the-threshold">Tuning the Threshold</h3>

<p>The 512-token boundary is a starting point, not a universal constant. In practice, collect the following metrics over at least seven days before adjusting:</p>

<ul>
  <li>Actual output token distribution per request (P50, P90, P99)</li>
  <li>Per-pool preemption rate (<code class="language-plaintext highlighter-rouge">vllm:num_preemptions_total</code> Prometheus metric)</li>
  <li>Per-pool <code class="language-plaintext highlighter-rouge">vllm:num_requests_waiting</code> queue depth</li>
</ul>

<p>If the short-pool queue grows persistently deep, lower the threshold or add more short-pool replicas. If long-pool GPU utilization stays low, raise the threshold to send fewer requests there.</p>

<h3 id="keda-autoscaling-integration">KEDA Autoscaling Integration</h3>

<p>Adding a KEDA ScaledObject backed by vLLM Prometheus metrics gives each pool its own independent autoscaling behavior:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">keda.sh/v1alpha1</span>
<span class="na">kind</span><span class="pi">:</span> <span class="s">ScaledObject</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">name</span><span class="pi">:</span> <span class="s">vllm-short-pool-scaler</span>
  <span class="na">namespace</span><span class="pi">:</span> <span class="s">llm-serving</span>
<span class="na">spec</span><span class="pi">:</span>
  <span class="na">scaleTargetRef</span><span class="pi">:</span>
    <span class="na">name</span><span class="pi">:</span> <span class="s">vllm-short-pool</span>
  <span class="na">minReplicaCount</span><span class="pi">:</span> <span class="m">1</span>
  <span class="na">maxReplicaCount</span><span class="pi">:</span> <span class="m">8</span>
  <span class="na">triggers</span><span class="pi">:</span>
    <span class="pi">-</span> <span class="na">type</span><span class="pi">:</span> <span class="s">prometheus</span>
      <span class="na">metadata</span><span class="pi">:</span>
        <span class="na">serverAddress</span><span class="pi">:</span> <span class="s">http://prometheus:9090</span>
        <span class="na">metricName</span><span class="pi">:</span> <span class="s">vllm_requests_waiting_short</span>
        <span class="na">query</span><span class="pi">:</span> <span class="s">vllm:num_requests_waiting{deployment="vllm-short-pool"}</span>
        <span class="na">threshold</span><span class="pi">:</span> <span class="s2">"</span><span class="s">5"</span>
</code></pre></div></div>

<p>Metric-based scaling responds more directly to inference load than simple HTTP RPS scaling. A threshold of <code class="language-plaintext highlighter-rouge">5</code> means scale-up begins when more than five requests are queued.</p>

<h3 id="model-sharing-vs-instance-separation">Model Sharing vs. Instance Separation</h3>

<p>The two pools do not strictly require separate vLLM instances. Running the same model with different <code class="language-plaintext highlighter-rouge">--max-model-len</code> settings is the baseline configuration, but if memory budget allows, a single vLLM instance can expose two external ports with different internal priority classes.</p>

<p>That said, <strong>instance separation is the cleaner choice</strong> for fully eliminating preemption interference, because KV cache memory is shared within a single vLLM process.</p>

<h2 id="relevance-to-thakiclouds-ai-platform">Relevance to ThakiCloud’s ai-platform</h2>

<p>ThakiCloud’s ai-platform serves multiple tenants’ inference workloads on a shared GPU cluster. Dual-Pool Routing adds two concrete benefits in this context.</p>

<p>First, it reduces cross-tenant interference. When Tenant A’s chatbot requests — short by nature — get queued behind Tenant B’s long document analysis batch jobs, the result is SLO violations. Pool separation cuts off this interference at the structural level.</p>

<p>Second, it improves GPU budget efficiency. A 31–42% GPU time reduction means either handling more requests with the same GPU budget, or achieving the same throughput with fewer GPUs. In an on-premises environment with a fixed resource ceiling, that savings translates directly into lower serving cost.</p>

<p>For ThakiCloud clusters already using Kueue LocalQueue, adding this architecture requires only two queue declarations and a lightweight router deployment. Compatibility with existing vLLM Deployment specs is high, so the adoption surface is broad.</p>

<h2 id="summary">Summary</h2>

<p>The problem Dual-Pool Token-Budget Routing solves is simple: when short and long requests share a queue, short requests lose. Separating them at the queue level lets each type be processed at its natural pace.</p>

<p>The results from arXiv 2604.08075 — 31–42% GPU time savings, a 5.4x reduction in preemption rate, and 6% improvement in P99 TTFT — represent a strong return for the implementation complexity involved. On Kubernetes, two Kueue LocalQueues, two vLLM Deployments, and one lightweight router are all it takes to build this structure.</p>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><category term="technique" /><category term="vllm" /><category term="llm-inference" /><category term="kueue" /><category term="gpu-scheduling" /><category term="llmops" /><category term="kubernetes" /><summary type="html"><![CDATA[Short requests waiting behind long ones in a single queue silently waste GPU time — this is Head-of-Line (HoL) blocking. Dual-Pool Token-Budget Routing, proposed in arXiv 2604.08075, splits requests into a short-context pool and a long-context pool, achieving 31–42% GPU time savings and a 6% improvement in P99 TTFT. This post walks through implementing the technique on Kubernetes using Kueue LocalQueue.]]></summary></entry><entry xml:lang="ko"><title type="html">직무가 녹아내린 자리에 남는 다섯 가지 원형: 프로토타이퍼부터 메인테이너까지</title><link href="https://thakicloud.github.io/ko/culture/five-product-role-archetypes/" rel="alternate" type="text/html" title="직무가 녹아내린 자리에 남는 다섯 가지 원형: 프로토타이퍼부터 메인테이너까지" /><published>2026-07-01T00:00:00+09:00</published><updated>2026-07-01T00:00:00+09:00</updated><id>https://thakicloud.github.io/ko/culture/five-product-role-archetypes</id><content type="html" xml:base="https://thakicloud.github.io/ko/culture/five-product-role-archetypes/"><![CDATA[<p><img src="/assets/images/five-product-role-archetypes-hero.png" alt="직무의 경계가 흐려지고 새로운 역할 원형이 떠오르는 모습을 담은 추상 비주얼" /></p>

<h2 id="개요">개요</h2>

<p>직함이 하는 일을 설명하지 못하는 순간이 점점 늘고 있습니다. 디자이너가 프로토타입을 코드로 짜고, 엔지니어가 사용자 인터뷰를 하고, 데이터 과학자가 제품 방향을 정하는 장면이 이제 낯설지 않습니다. AI 도구가 각 직무의 기계적인 부분을 흡수하면서 엔지니어링, 제품, 디자인, 데이터 분석의 경계가 한 덩어리로 녹아내리고 있습니다.</p>

<p>이 흐름을 두고 Claude Code를 만든 보리스 체르니(Boris Cherny)가 흥미로운 관찰을 내놓았습니다. 자신이 속한 Claude Code 팀을 들여다보니 직무와 무관하게 다섯 가지 역할 원형이 보이더라는 것입니다. 이 관찰이 중요한 이유는 단순합니다. 앞으로의 조직이 직무 기능이 아니라 이런 원형의 조합으로 팀을 짜게 될지 모른다는 가설을 던지기 때문입니다.</p>

<p>이 글은 그 다섯 가지 원형이 무엇인지, 왜 직무와 분리되는지, 그리고 제품의 성숙도에 따라 어떤 조합이 필요한지를 정리합니다. 기술 요약이 아니라 팀을 어떻게 구성하고 채용을 어떻게 바라볼지를 묻는 문화 에세이입니다. ThakiCloud처럼 사람과 에이전트가 함께 일하는 조직에는 특히 직접적인 질문입니다.</p>

<h2 id="다섯-가지-역할-원형">다섯 가지 역할 원형</h2>

<p>체르니가 제시한 원형은 다음과 같습니다. 각각을 우리말로 옮기면서 실제 팀에서 어떻게 드러나는지 덧붙였습니다.</p>

<p><strong>프로토타이퍼(Prototyper)</strong>는 완전히 새로운 아이디어를 떠올리는 사람입니다. 수많은 아이디어를 쏟아내지만 대부분은 출시되지 못합니다. 이 원형의 가치는 성공률이 아니라 발상의 밀도에 있습니다. 열 개 중 아홉 개가 버려지더라도 하나의 방향을 여는 사람이 없으면 조직은 새 영토로 나아가지 못합니다.</p>

<p><strong>빌더(Builder)</strong>는 프로토타입과 아이디어를 빠르게 프로덕션급 제품이나 인프라로 전환하는 사람입니다. 발상과 출시 사이의 거리를 좁히는 역할입니다. 프로토타이퍼가 스케치라면 빌더는 그 스케치를 실제로 서 있는 건물로 바꿉니다.</p>

<p><strong>스위퍼(Sweeper)</strong>는 정리하는 사람입니다. 어수선한 UI를 다듬고, 코드와 시스템을 단순하게 만들고, 쓰이지 않는 기능을 걷어내고, 성능을 끌어올립니다. 무언가를 더하는 것이 아니라 덜어내는 것이 이 원형의 일입니다. 기능을 없애는 결정(unship)은 만드는 것만큼이나 용기가 필요합니다.</p>

<p><strong>그로어(Grower)</strong>는 이미 만들어진 제품을 가져와 제품-시장 적합성(PMF)을 높이기 위해 반복적으로 개선하는 사람입니다. 큰 판을 새로 짜기보다 이미 있는 판에서 전환율을 끌어올리고, 사용자 이탈을 막고, 작은 개선을 쌓아 올립니다.</p>

<p><strong>메인테이너(Maintainer)</strong>는 성숙한 시스템을 소유하는 사람입니다. 시스템이 커질 때 보안, 안정성, 속도, 효율을 유지합니다. 화려하지 않지만 이 원형이 없으면 성장한 제품은 자기 무게에 눌려 무너집니다.</p>

<pre><code class="language-mermaid">flowchart TB
    P["프로토타이퍼&lt;br/&gt;새 아이디어를 쏟아냄"]
    B["빌더&lt;br/&gt;프로덕션급으로 전환"]
    S["스위퍼&lt;br/&gt;단순화·정리·성능"]
    G["그로어&lt;br/&gt;PMF 반복 개선"]
    M["메인테이너&lt;br/&gt;보안·안정·확장 유지"]
    P --&gt; B
    B --&gt; S
    S --&gt; G
    G --&gt; M
    M -.보수·재발명.-&gt; B
</code></pre>

<h2 id="역할은-직무가-아닙니다">역할은 직무가 아닙니다</h2>

<p>이 관찰의 핵심은 목록 자체가 아니라, 이 원형들이 직무 기능과 연결되지 않는다는 점입니다. 체르니는 Anthropic 전체를 보면 어떤 디자이너는 프로토타이퍼(1번)에, 어떤 디자이너는 빌더(2번)에, 또 어떤 디자이너는 스위퍼(3번)에 해당한다고 말합니다. 엔지니어도, 제품 관리자도, 데이터 과학자도 마찬가지입니다.</p>

<p>바꿔 말하면 “디자이너를 뽑는다”는 문장이 점점 정보량을 잃고 있습니다. 같은 디자이너라도 새 영토를 여는 프로토타이퍼형인지, 다듬어 완성하는 스위퍼형인지에 따라 팀에 기여하는 방식이 완전히 다릅니다. 직함은 그가 배운 도구를 알려줄 뿐, 그가 어떤 순간에 빛나는지는 알려주지 않습니다.</p>

<p>많은 사람이 두 개의 원형을 넘나들고, 때로는 세 개까지 걸칩니다. 프로토타이퍼이면서 빌더인 사람이 초기 스타트업에서 특히 귀합니다. 스위퍼이면서 메인테이너인 사람은 성숙한 인프라 팀의 척추가 됩니다. 한 사람을 하나의 상자에 가두는 대신 그가 어떤 원형의 스펙트럼 위에 있는지를 보는 편이 실제에 더 가깝습니다.</p>

<h2 id="제품-생애주기별-팀-구성">제품 생애주기별 팀 구성</h2>

<p>원형이 흥미로운 진짜 이유는 이것이 팀 구성의 공식이 되기 때문입니다. 체르니는 건강한 팀이라면 제품의 성숙도에 따라 다른 원형 조합이 필요하다고 정리합니다.</p>

<p>새롭고 아직 PMF를 찾지 못한 제품은 프로토타이퍼, 빌더, 스위퍼(1+2+3)에 강한 사람들이 필요합니다. 아직 무엇이 맞는지 모르는 단계이므로 빠르게 만들고, 빠르게 버리고, 방향을 계속 바꾸는 힘이 중요합니다. 이 단계에서 메인테이너 성향이 강한 사람만 모으면 만들어지지도 않은 것을 지키느라 움직이지 못합니다.</p>

<p>성장 중이고 PMF를 찾은 제품은 빌더, 스위퍼, 그로어(2+3+4)에 약간의 메인테이너(5)가 필요합니다. 방향은 잡혔으니 이제 완성도를 높이고 전환을 개선하면서, 늘어나는 사용자를 감당할 최소한의 안정성을 확보해야 합니다.</p>

<p>강한 PMF를 가진 성숙한 제품은 스위퍼, 그로어, 메인테이너(3+4+5)에 약간의 빌더(2)가 필요합니다. 시스템을 단순하게 유지하고, 지속적으로 개선하고, 커지는 규모에서 보안과 속도를 지키되, 필요할 때만 새로운 것을 짓습니다.</p>

<pre><code class="language-mermaid">flowchart TB
    PRE["PMF 이전&lt;br/&gt;새 제품"]
    GROW["성장기&lt;br/&gt;PMF 확보"]
    MATURE["성숙기&lt;br/&gt;강한 PMF"]
    PRE --&gt;|"프로토타이퍼+빌더+스위퍼"| GROW
    GROW --&gt;|"빌더+스위퍼+그로어 (+약간의 메인테이너)"| MATURE
    MATURE --&gt;|"스위퍼+그로어+메인테이너 (+약간의 빌더)"| MATURE
</code></pre>

<p>이 공식이 알려주는 실무적 함의는 분명합니다. 팀에 사람을 더할 때 “엔지니어가 부족하다”가 아니라 “지금 우리 제품 단계에 어떤 원형이 비어 있는가”를 먼저 물어야 한다는 것입니다. 성숙한 제품 팀에 프로토타이퍼만 계속 채우면 새 아이디어는 넘치지만 아무도 시스템을 지키지 않습니다. 반대로 PMF 이전 제품에 메인테이너만 모으면 지킬 것이 생기기도 전에 방어 태세부터 갖춥니다.</p>

<h2 id="thakicloud-관점-에이전트-시대의-역할-재편">ThakiCloud 관점: 에이전트 시대의 역할 재편</h2>

<p>직무가 녹아내린다는 관찰은 사람과 에이전트가 함께 일하는 조직에서 한층 뾰족해집니다. AI 에이전트가 기계적인 빌드 작업의 상당 부분을 흡수하면, 사람은 자연스럽게 제품 단계마다 진짜로 중요한 원형 쪽으로 이동하게 됩니다. 코드를 타이핑하는 손이 아니라, 어떤 원형이 지금 필요한지를 판단하는 눈이 병목이 됩니다.</p>

<p>ThakiCloud가 운용하는 Agent-Native Cloud인 Paxis는 바로 이 재편을 시스템 층위에서 구현합니다. Paxis는 Skills, Tools, Policies, Audit Logs를 일급 리소스로 다루며, 960개가 넘는 스킬을 BM25로 선택해 격리된 샌드박스에서 실행합니다. 체르니가 사람의 역할을 직함이 아니라 제품 순간에 맞춰 재조합한다고 말했듯, Paxis는 에이전트의 역량을 고정된 파이프라인이 아니라 그때그때의 작업에 맞춰 동적으로 조합합니다. 프로토타이퍼가 발상을 쏟아내면 빌더 역할의 에이전트가 프로덕션 코드로 전환하고, 스위퍼 역할의 검증 게이트가 결과를 정리하는 식의 분업이 스킬 하네스 안에서 그대로 재현됩니다.</p>

<p>인프라 쪽에서는 ThakiCloud의 ai-platform이 메인테이너 원형의 일을 대신 짊어집니다. K8s 기반 멀티테넌트 환경에서 Kueue로 GPU를 스케줄링하고, vLLM으로 모델을 서빙하며, 온프렘과 소버린 요구를 만족시키는 것은 정확히 성숙한 시스템의 보안·안정·효율을 지키는 메인테이너의 일입니다. 고객 조직은 이 부분을 플랫폼에 위임함으로써 자기 팀을 프로토타이퍼와 그로어 쪽에 더 배치할 수 있습니다.</p>

<p>채용 관점에서도 이 렌즈는 유용합니다. ThakiCloud는 이력서의 직함보다 지원자가 어떤 원형의 스펙트럼 위에 있는지를 봅니다. 지금 우리 제품 단계에 비어 있는 원형을 채우는 사람이 팀에 가장 큰 레버리지를 만들기 때문입니다. “무엇을 할 줄 아는가”만큼 “어떤 순간에 빛나는가”를 묻는 것입니다.</p>

<h2 id="한계-및-반론">한계 및 반론</h2>

<p>이 프레임워크를 무비판적으로 받아들이기 전에 반대편의 목소리도 들어야 합니다. 벤 비네거(Ben Vinegar)는 같은 논의를 두고 “사람들이 소프트웨어 조직이 어떻게 돌아가는지를 이제야 배우면서, 원래부터 있던 팀 역학을 AI 탓으로 잘못 돌리고 있다”고 지적했습니다. 날카로운 반론입니다. 프로토타이퍼와 메인테이너의 구분은 AI가 없던 시절에도 존재했고, 제품 생애주기에 따라 필요한 인재가 달라진다는 것도 새로운 통찰이 아닙니다.</p>

<p>원형 분류 자체의 한계도 있습니다. 사람을 다섯 개의 상자로 나누는 모든 시도가 그렇듯, 이 틀도 개인을 지나치게 단순화할 위험이 있습니다. 실제로는 한 사람이 프로젝트마다, 심지어 하루 안에서도 여러 원형을 오갑니다. 원형을 고정된 정체성으로 오해하면 “너는 스위퍼니까 새 아이디어는 내지 마”라는 식의 역효과가 납니다. 체르니 본인도 많은 사람이 원형을 넘나든다고 강조한 이유가 여기에 있습니다.</p>

<p>그럼에도 이 프레임워크가 가치 있는 이유는 예측력이 아니라 언어를 준다는 데 있습니다. “엔지니어 한 명 더”라는 모호한 요청 대신 “지금 우리에게 그로어가 부족하다”고 말할 수 있게 되면, 채용과 팀 구성의 대화가 훨씬 구체적으로 바뀝니다. AI가 직무의 기계적 층위를 걷어낼수록, 남는 것은 이런 원형 수준의 판단입니다. 미래의 제품 역할은 오늘의 도메인별 직함보다 이 원형에 더 가깝게 형성될지 모릅니다.</p>

<h2 id="마치며">마치며</h2>

<p>직무가 녹아내리는 것은 위기가 아니라 재편입니다. 프로토타이퍼, 빌더, 스위퍼, 그로어, 메인테이너라는 다섯 원형은 직함이 사라진 자리에 무엇이 남는지를 보여줍니다. 남는 것은 도구가 아니라 어떤 순간에 어떤 방식으로 기여하는가라는 본질입니다.</p>

<p>ThakiCloud는 사람과 에이전트가 이 원형들을 나눠 지는 조직을 만들고 있습니다. 에이전트가 반복 가능한 빌드와 유지의 상당 부분을 맡을수록, 사람은 지금 이 제품 단계에 어떤 원형이 필요한지를 읽어내는 일에 집중하게 됩니다. 그 판단이 다음 시대의 가장 귀한 역량입니다.</p>

<h2 id="출처">출처</h2>

<ul>
  <li>Boris Cherny, X(@bcherny), 2026-06-29: <a href="https://x.com/bcherny/status/2071379474277613732">원문 트윗</a></li>
  <li>Ben Vinegar, X(@bentlegen): <a href="https://x.com/bentlegen/status/2071576459538567463">반론 트윗</a></li>
</ul>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><category term="culture" /><category term="일의미래" /><category term="조직문화" /><category term="제품팀" /><category term="채용" /><category term="Boris Cherny" /><category term="Claude Code" /><summary type="html"><![CDATA[엔지니어링·제품·디자인·데이터가 한 덩어리로 녹아드는 지금, Claude Code를 만든 보리스 체르니가 제안한 다섯 가지 역할 원형과 제품 생애주기별 팀 구성 공식을 살펴봅니다.]]></summary></entry><entry xml:lang="ko"><title type="html">Qwen3.6-27B을 4비트로: NVFP4 양자화가 Hopper까지 내려온 이유</title><link href="https://thakicloud.github.io/ko/owm/qwen3-6-27b-nvfp4-onprem-serving/" rel="alternate" type="text/html" title="Qwen3.6-27B을 4비트로: NVFP4 양자화가 Hopper까지 내려온 이유" /><published>2026-07-01T00:00:00+09:00</published><updated>2026-07-01T00:00:00+09:00</updated><id>https://thakicloud.github.io/ko/owm/qwen3-6-27b-nvfp4-onprem-serving</id><content type="html" xml:base="https://thakicloud.github.io/ko/owm/qwen3-6-27b-nvfp4-onprem-serving/"><![CDATA[<p>⏱️ <strong>예상 읽기 시간</strong>: 11분</p>

<p><img src="/assets/images/qwen3-6-27b-nvfp4-onprem-serving-hero.png" alt="Qwen3.6-27B NVFP4 4비트 양자화 개념도" /></p>

<h2 id="개요">개요</h2>

<p>NVIDIA가 Alibaba의 Qwen3.6-27B을 NVFP4 4비트로 양자화한 <code class="language-plaintext highlighter-rouge">nvidia/Qwen3.6-27B-NVFP4</code>를 공개했습니다. 27B급 하이브리드 어텐션 추론 모델을 4비트로 눌러 가중치 메모리를 약 2.5배 줄이면서, FP8 기준선 대비 아홉 개 벤치마크 전부에서 차이를 1%p 이내로 유지합니다. 라이선스는 Apache 2.0입니다.</p>

<p>이 글에서 짚고 싶은 지점은 세 가지입니다. 첫째, 지난번 <code class="language-plaintext highlighter-rouge">Gemma-4-26B-A4B-NVFP4</code>가 사실상 Blackwell에서만 4비트 가속을 받았던 것과 달리, 이번 빌드는 모델카드에서 <strong>Hopper와 Blackwell을 함께 지원 대상</strong>으로 명시합니다. 이미 H100이나 H200을 굴리는 조직이 새 하드웨어를 사지 않고도 오늘 당장 시험해 볼 수 있다는 뜻입니다. 둘째, 이 모델은 텍스트만 다루는 순수 LLM이 아니라 <strong>텍스트와 이미지, 비디오를 입력받는 멀티모달 추론 모델</strong>입니다. 셋째, 컨텍스트가 <strong>262K 토큰</strong>까지 열려 있어 긴 문서와 장기 대화를 한 번에 받아냅니다.</p>

<p>ThakiCloud는 Kubernetes 위에서 Kueue로 GPU 쿼터를 관리하고 vLLM으로 모델을 멀티테넌트 서빙하는 플랫폼을 운영합니다. 그래서 “기존에 가진 GPU 위에서 더 큰 모델을, 더 많은 테넌트에게 얼마나 얹을 수 있는가”는 신기한 소식이 아니라 비용 모델과 직결되는 질문입니다. 이 글은 모델 팩트를 정리하고, NVFP4가 왜 Hopper까지 내려왔는지 따져 본 뒤, 서빙 경로와 우리 플랫폼에서의 쓸모를 솔직하게 리뷰합니다.</p>

<h2 id="이-모델은-무엇인가">이 모델은 무엇인가</h2>

<p><code class="language-plaintext highlighter-rouge">nvidia/Qwen3.6-27B-NVFP4</code>는 Alibaba의 <code class="language-plaintext highlighter-rouge">Qwen3.6-27B</code>을 NVIDIA Model Optimizer(nvidia-modelopt v0.45.0)로 NVFP4 양자화한 버전입니다. 모델카드 기준 핵심 스펙은 다음과 같습니다.</p>

<table>
  <thead>
    <tr>
      <th>항목</th>
      <th>값</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>베이스 모델</td>
      <td>Alibaba Qwen3.6-27B</td>
    </tr>
    <tr>
      <td>아키텍처</td>
      <td>하이브리드 어텐션 (Gated DeltaNet + Gated Attention)</td>
    </tr>
    <tr>
      <td>총 파라미터</td>
      <td>27B</td>
    </tr>
    <tr>
      <td>컨텍스트</td>
      <td>262K 토큰</td>
    </tr>
    <tr>
      <td>입력 모달리티</td>
      <td>텍스트 + 이미지 + 비디오</td>
    </tr>
    <tr>
      <td>출력</td>
      <td>텍스트</td>
    </tr>
    <tr>
      <td>양자화</td>
      <td>NVFP4 (Model Optimizer v0.45.0)</td>
    </tr>
    <tr>
      <td>타깃 하드웨어</td>
      <td>NVIDIA Hopper, Blackwell</td>
    </tr>
    <tr>
      <td>라이선스</td>
      <td>Apache 2.0</td>
    </tr>
  </tbody>
</table>

<p>주목할 부분은 아키텍처의 <strong>하이브리드 어텐션</strong>입니다. Gated DeltaNet은 선형 어텐션 계열로, 시퀀스 길이에 비례해 비용이 늘어나는 일반 어텐션과 달리 장문을 효율적으로 처리하도록 설계된 경로입니다. 여기에 표현력을 담당하는 Gated Attention을 섞어, 262K 같은 긴 컨텍스트를 감당하면서도 품질을 유지하는 절충을 취합니다. 서빙 시 <code class="language-plaintext highlighter-rouge">--reasoning-parser qwen3</code>를 요구한다는 점에서, 이 모델은 최종 답 이전에 추론 과정을 생성하는 <strong>리즈닝 모델</strong>이라는 것도 확인됩니다.</p>

<p>한 가지 정직하게 밝혀 둘 부분이 있습니다. 모델카드는 하이브리드 어텐션이라는 사실은 명시하지만, 정확한 레이어 수나 전문가(expert) 구성, 토큰당 활성 파라미터 같은 세부는 공개하지 않습니다. 따라서 이 글에서는 카드에 적힌 사실만 다루고, 미공개 수치는 추정하지 않습니다.</p>

<h2 id="nvfp4-양자화-무엇을-어떻게-누르는가">NVFP4 양자화: 무엇을 어떻게 누르는가</h2>

<p>NVFP4는 NVIDIA가 밀어붙이는 4비트 부동소수점 포맷입니다. 가중치를 4비트 정수로 단순 절단하는 INT4와 달리, 작은 블록 단위로 FP8 스케일을 두는 마이크로스케일링 방식이라 4비트 수준의 메모리 절감을 누리면서도 정확도 손실을 작게 억제합니다.</p>

<p>이번 빌드에서 양자화 대상은 <strong>트랜스포머 블록 안 선형 연산자의 가중치와 활성값(activation)</strong>입니다. 비선형 층은 건드리지 않습니다. 모델카드는 파라미터당 비트 수를 16에서 4로 줄여 디스크와 GPU 메모리 요구량을 <strong>약 2.5배 감소</strong>시킨다고 밝힙니다. 27B 파라미터를 BF16으로 올리면 약 54GB가 필요한데, 약 2.5배 감소를 적용하면 체크포인트가 20GB 안팎으로 내려옵니다. 같은 GPU에 모델을 2배 이상 얹거나, 남은 메모리를 KV 캐시로 돌려 동시 세션을 늘릴 여지가 생깁니다.</p>

<p>여기서 지난 Gemma NVFP4 리뷰와 갈리는 대목이 나옵니다. Gemma 빌드는 소비자·프로 Blackwell(SM120)에서 NVFP4 MoE 커널이 아직 깨져 있어, 실제로 도는 소비자급 경로가 DGX Spark에 한정됐습니다. 반면 이번 Qwen3.6 빌드는 모델카드가 <strong>Hopper와 Blackwell을 함께 지원 대상으로 명시</strong>하고, 서빙도 vLLM의 <code class="language-plaintext highlighter-rouge">--quantization modelopt</code> 경로를 씁니다. 가중치뿐 아니라 활성값까지 양자화한 구성과 modelopt 서빙 경로가 맞물리면서, 데이터센터에 이미 깔린 H100·H200 위에서도 이 4비트 모델을 돌릴 수 있게 된 것입니다. “새 Blackwell을 사야만 4비트 이득을 본다”는 제약이 이번에는 상당히 풀렸습니다.</p>

<pre><code class="language-mermaid">flowchart TB
    A["Qwen3.6-27B&lt;br/&gt;BF16 약 54GB"] --&gt; B["NVIDIA Model Optimizer&lt;br/&gt;v0.45.0"]
    B --&gt; C["NVFP4 양자화&lt;br/&gt;선형 연산자 가중치 + 활성값&lt;br/&gt;16비트 → 4비트"]
    C --&gt; D["NVFP4 체크포인트&lt;br/&gt;약 20GB 안팎 · 약 2.5배 감소"]
    D --&gt; E["vLLM 서빙&lt;br/&gt;--quantization modelopt"]
    E --&gt; F["NVIDIA Hopper&lt;br/&gt;H100 / H200"]
    E --&gt; G["NVIDIA Blackwell&lt;br/&gt;B200 등"]
</code></pre>

<h2 id="벤치마크-4비트-손실은-얼마인가">벤치마크: 4비트 손실은 얼마인가</h2>

<p>모델카드는 NVFP4 양자화본과 FP8 기준선을 아홉 개 벤치마크에서 나란히 제시합니다.</p>

<table>
  <thead>
    <tr>
      <th>벤치마크</th>
      <th>FP8</th>
      <th>NVFP4</th>
      <th>측정 영역</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>MMLU Pro</td>
      <td>86.1</td>
      <td>86.3</td>
      <td>일반 지식·추론</td>
    </tr>
    <tr>
      <td>GPQA Diamond</td>
      <td>86.0</td>
      <td>85.5</td>
      <td>대학원 과학 추론</td>
    </tr>
    <tr>
      <td>HLE</td>
      <td>21.7</td>
      <td>21.8</td>
      <td>고난도 종합</td>
    </tr>
    <tr>
      <td>τ²-Bench Telecom</td>
      <td>95.2</td>
      <td>95.4</td>
      <td>에이전트 툴 사용</td>
    </tr>
    <tr>
      <td>MMMU Pro</td>
      <td>74.6</td>
      <td>74.3</td>
      <td>멀티모달 추론</td>
    </tr>
    <tr>
      <td>SciCode</td>
      <td>44.8</td>
      <td>44.5</td>
      <td>과학 코딩</td>
    </tr>
    <tr>
      <td>AIME 2025</td>
      <td>93.1</td>
      <td>92.7</td>
      <td>수학 경시</td>
    </tr>
    <tr>
      <td>AA-LCR</td>
      <td>68.8</td>
      <td>68.3</td>
      <td>장문 추론</td>
    </tr>
    <tr>
      <td>IFBench</td>
      <td>65.1</td>
      <td>65.5</td>
      <td>지시 이행</td>
    </tr>
  </tbody>
</table>

<p>아홉 항목 모두 FP8 대비 1%p 안쪽 차이입니다. MMLU Pro, HLE, τ²-Bench Telecom, IFBench는 오히려 NVFP4가 근소하게 높은데, 이는 측정 분산 범위로 읽는 편이 안전합니다. 방향성은 분명합니다. <strong>4비트로 눌러도 품질이 사실상 유지된다</strong>는 것이고, NVFP4가 INT4 대비 갖는 강점이 여기서 드러납니다.</p>

<p>벤치 구성 자체도 이 모델의 성격을 보여 줍니다. τ²-Bench Telecom은 에이전트가 도구를 호출하며 과제를 수행하는 능력을, AA-LCR은 장문 컨텍스트 추론을, MMMU Pro는 멀티모달 이해를 측정합니다. 순수 지식 QA만이 아니라 <strong>에이전트 툴 사용과 장문, 멀티모달</strong>을 함께 겨냥한 모델이라는 뜻입니다. 다만 한국어 도메인 태스크는 공개 벤치에 드러나지 않으므로, 실제 도입 전에는 내부 평가셋으로 별도 검증을 권장합니다.</p>

<h2 id="서빙-가이드">서빙 가이드</h2>

<p>모델카드가 제시하는 권장 경로는 vLLM입니다. 실행 명령은 다음과 같습니다.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>vllm serve nvidia/Qwen3.6-27B-NVFP4 <span class="se">\</span>
  <span class="nt">--port</span> 8000 <span class="se">\</span>
  <span class="nt">--quantization</span> modelopt <span class="se">\</span>
  <span class="nt">--max-model-len</span> 262144 <span class="se">\</span>
  <span class="nt">--reasoning-parser</span> qwen3
</code></pre></div></div>

<p>운영에서 챙길 포인트는 세 가지입니다. 먼저 <code class="language-plaintext highlighter-rouge">--quantization modelopt</code>가 NVFP4 체크포인트를 로드하는 핵심 플래그입니다. 다음으로 <code class="language-plaintext highlighter-rouge">--reasoning-parser qwen3</code>가 있어야 추론 과정과 최종 답이 올바로 분리돼 파싱됩니다. 마지막으로 <code class="language-plaintext highlighter-rouge">--max-model-len 262144</code>는 262K 컨텍스트를 전부 여는 설정이며, KV 캐시 예산이 그만큼 커지므로 실제로 필요한 길이에 맞춰 낮춰 잡는 것이 메모리 효율에 유리합니다.</p>

<p>하드웨어는 Hopper 또는 Blackwell, OS는 Linux가 전제입니다. Hopper까지 지원한다는 점 덕분에, 데이터센터에 이미 있는 H100·H200 노드에서 별도 장비 없이 서빙 경로를 검증할 수 있습니다.</p>

<h2 id="thakicloud-서빙-관점">ThakiCloud 서빙 관점</h2>

<p>ThakiCloud는 Kueue로 GPU 쿼터를 관리하고 vLLM으로 모델을 멀티테넌트 서빙하는 K8s 기반 AI/ML 플랫폼을 운영합니다. 이 모델이 우리 운용 모델에 주는 시사점은 인프라와 에이전트 두 방향에서 나옵니다.</p>

<p><strong>기존 Hopper 자산 위에서 밀도를 2배 이상으로.</strong> 이 부분이 이번 빌드의 가장 실질적인 가치입니다. NVFP4가 Hopper까지 지원한다는 것은, 새 Blackwell 투자 없이 이미 보유한 H100·H200 위에서 4비트 이득을 볼 수 있다는 뜻입니다. 27B 모델의 가중치가 20GB 안팎으로 내려오면 같은 GPU에 더 많은 모델 인스턴스를 올리거나, 남는 메모리를 KV 캐시로 돌려 테넌트별 동시성 한도를 넉넉히 잡을 수 있습니다. Kueue 쿼터 관점에서는 같은 카드로 더 많은 워크로드를 받는 셈이라 단가가 그대로 내려갑니다.</p>

<p><strong>멀티모달 추론 워커의 온프렘 후보.</strong> ThakiCloud의 에이전트 제어 평면인 Paxis는 Agent-Native Cloud로, 스킬을 격리 샌드박스에서 실행하고 모든 행동을 정책 게이트와 감사 로그로 통과시킵니다. 이 구조에서 다수의 워커가 문서를 읽고 도구를 호출하며 과제를 처리합니다. Qwen3.6-27B-NVFP4는 τ²-Bench Telecom 같은 에이전트 툴 사용 벤치에서 강하고, 텍스트뿐 아니라 이미지와 비디오를 입력받으며, 262K 컨텍스트를 감당합니다. 문서·화면·영상을 함께 다루는 멀티모달 워커, 툴 호출 루프의 말단 워커로 온프렘에서 돌리기에 적합한 후보입니다. 다만 우리 비용 규율대로 워커는 싸게 돌리되, fan-out 결과는 상위 모델의 검증 단계로 닫아 워커 환각이 누적되지 않게 해야 합니다.</p>

<p><strong>온프렘·컴플라이언스 제안의 레퍼런스.</strong> Apache 2.0 라이선스에 단일 노드 서빙이 가능한 구성은, 데이터 외부 반출이 금지된 공공·금융 고객에게 그대로 제안할 수 있습니다. 국정원 요구 대응이나 소버린 AI 같은 제약 환경에서, 상용 API 없이 자체 GPU로 대형 멀티모달 추론 모델을 돌린다는 그림은 실질적인 도입 경로가 됩니다.</p>

<h2 id="한계-및-반론">한계 및 반론</h2>

<p>균형을 위해 짚을 부분입니다.</p>

<ul>
  <li><strong>아키텍처 세부가 공개되지 않았습니다.</strong> 하이브리드 어텐션이라는 사실은 있지만 레이어 수, 전문가 구성, 활성 파라미터가 카드에 없습니다. 배치 효율과 메모리 상주량을 정밀하게 계산하려면 추가 정보가 필요합니다.</li>
  <li><strong>실측 처리량 수치가 없습니다.</strong> 이 글은 메모리 절감과 벤치마크 같은 카드 팩트에 근거합니다. 스트림당 토큰 속도나 동시성 한도는 하드웨어와 설정에 따라 크게 달라지므로, 도입 전 자체 워크로드로 재측정해야 합니다.</li>
  <li><strong>활성값 양자화의 변동성.</strong> 가중치뿐 아니라 활성값까지 4비트로 누르는 구성은 일부 분포가 치우친 워크로드에서 정확도 변동을 낳을 수 있습니다. 공개 벤치가 1%p 이내라 해도, 도메인 특화 태스크는 별도로 확인하는 편이 안전합니다.</li>
  <li><strong>멀티모달 서빙 경로의 성숙도.</strong> 이미지·비디오 입력을 실제 프로덕션에서 안정적으로 받으려면 전처리 파이프라인과 vLLM 멀티모달 경로의 성숙도를 함께 검증해야 합니다.</li>
  <li><strong>한국어 실사용 검증.</strong> 공개 벤치는 영어권 중심입니다. 한국어 RAG·툴콜 정확도는 내부 평가셋으로 따로 봐야 합니다.</li>
</ul>

<p>그럼에도 Apache 2.0, Hopper까지 내려온 4비트 가속, 멀티모달 추론, 262K 컨텍스트라는 조합은 온프렘 서빙을 고민하는 조직에게 충분히 매력적인 선택지입니다. “새 하드웨어를 사야 4비트 이득을 본다”는 벽이 낮아졌다는 점만으로도, 이미 Hopper 플릿을 가진 팀에게는 오늘 검증해 볼 값어치가 있습니다.</p>

<h2 id="참고-링크">참고 링크</h2>

<ul>
  <li><a href="https://huggingface.co/nvidia/Qwen3.6-27B-NVFP4">Qwen3.6-27B-NVFP4 모델카드 (Hugging Face)</a></li>
  <li><a href="https://github.com/NVIDIA/TensorRT-Model-Optimizer">NVIDIA TensorRT Model Optimizer</a></li>
  <li><a href="https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/">NVFP4 소개 (NVIDIA Developer)</a></li>
  <li><a href="https://docs.vllm.ai/">vLLM 공식 문서</a></li>
  <li><a href="https://thakicloud.github.io/ko/owm/gemma-4-26b-nvfp4-dgx-spark/">Gemma-4-26B-NVFP4 DGX Spark 리뷰 (ThakiCloud 블로그)</a></li>
</ul>]]></content><author><name>{&quot;name&quot;=&gt;nil, &quot;avatar&quot;=&gt;nil, &quot;bio&quot;=&gt;nil, &quot;location&quot;=&gt;&quot;Seoul, Korea&quot;, &quot;email&quot;=&gt;&quot;info@thakicloud.co.kr&quot;, &quot;uri&quot;=&gt;nil, &quot;home&quot;=&gt;nil, &quot;links&quot;=&gt;[{&quot;label&quot;=&gt;&quot;Website&quot;, &quot;icon&quot;=&gt;&quot;fas fa-fw fa-link&quot;, &quot;url&quot;=&gt;&quot;https://thakicloud.co.kr&quot;}, {&quot;label&quot;=&gt;&quot;GitHub&quot;, &quot;icon&quot;=&gt;&quot;fab fa-fw fa-github&quot;, &quot;url&quot;=&gt;&quot;https://github.com/thakicloud&quot;}]}</name><email>info@thakicloud.co.kr</email></author><category term="owm" /><category term="qwen3" /><category term="nvfp4" /><category term="quantization" /><category term="hopper" /><category term="blackwell" /><category term="hybrid-attention" /><category term="multimodal" /><category term="vllm" /><category term="on-premise" /><summary type="html"><![CDATA[NVIDIA가 공개한 Qwen3.6-27B-NVFP4는 27B 하이브리드 어텐션 추론 모델을 4비트로 눌러 메모리를 약 2.5배 줄이면서도 FP8 대비 벤치마크 차이를 1%p 이내로 유지합니다. 지난 Gemma NVFP4가 Blackwell 전용이었던 것과 달리 이번 빌드는 Hopper까지 지원해, 이미 H100/H200을 가진 조직이 오늘 당장 온프렘에서 돌릴 수 있습니다. 모델 팩트와 NVFP4 원리, 서빙 경로, 그리고 ThakiCloud 서빙 관점을 정리했습니다.]]></summary></entry></feed>