Growth at ThakiCloud, and Your Career 🚀

“Velocity · Validation · Versioning” (Three Vs) — If these three words make your heart race, ThakiCloud is exactly your stage. We’re a place where you can experience real full-stack MLOps with actual traffic, achieving vertical integration from GPU/NPU infrastructure to SaaS. With the power of technology, culture, and colleagues, we’re moving faster, safer, and further.


ThakiCloud’s MLOps, Here’s How It’s Different

1. Velocity — From Idea to Production, Before Your Coffee Gets Cold

  • IaaS-PaaS-SaaS Vertical Integration: Mixed deployment of GPU·NPU in Kubernetes NodePool, zero rescheduling cost when switching from experiment to serving.
  • JupyterHub Image Auto-build: Just push a branch and Helm Chart is immediately deployed to staging cluster.
  • Feature Store-based Experiment UI: Combine data·feature versions with one click, launch new experiments within 15 minutes.

2. Validation — Fail Fast, Metrics in Product Language

  • Shadow Traffic Funnel: Copy 10% of real-time traffic to evaluate new models without user exposure.
  • Click Rate·MAU ↔ ML Metrics Auto-integration: Monitor business KPIs and ML metrics together on Prometheus + Grafana dashboard.
  • Heuristic Safety Layer: Automatically filter predictions with confidence < τ to protect user experience.

3. Versioning — Time Travel with One Docker Tag Line

  • OCI Model Registry: Manage models·features·metadata as image tags, instant rollback by specifying sha only.
  • Daily Auto-retraining: When data drift is detected, Airflow DAG automatically executes retraining·validation·promotion.
  • Fallback Model: Light model is deployed within 1 second when SLO is violated.

Pain Points We Solved & Next Chapter

  • Dev ↔ Prod Inconsistency → Unified with single Helm Release.
    Next: Multi-region deployment standardization.
  • Alert Flood → Noise reduction with Alert tuner bot.
    Next: Automatic analysis of log level root-cause with GPT.
  • Long-tail Bugs → Reproduction with Feature Slicing debugger.
    Next: Complete reproduction automation based on data synthesis.
  • Slow Deployment → 30 → 5 days with Canary + Progressive Delivery.
    Next: Tighter integration of model·ops·business team OKRs.

What It Means to Work at ThakiCloud — Real Stories

“3 AM, experiment model crashed but rolled back in 5 minutes!”

Mr. B from the MLOps Platform team says: “Thanks to a culture where failures also become assets, I’m not afraid of experiments. Discarded logs also remain as team knowledge, and the experience of open source PRs directly meeting real traffic is ThakiCloud’s unique charm.”

Mr. C from the Cloud Infra team recalls the scene of GPUs running in the Saudi desert. “The experience of directly designing and operating global-scale infrastructure, and collaboration with colleagues becomes the driving force for daily growth.”


Recruitment Positions

Team Mission at a Glance
MLOps Platform Feature Store redesign, Pydantic schema validation automation
LLMOps R&D GPT-based log analysis, Self-Healing serving
Cloud Infra GPU/NPU hybrid scheduling, Multi-region HA
Data Engineering Real-time CDC + Iceberg Lakehouse construction

Application Method

  1. GitHub / Tech Blog Link — Commits are your cover letter.
  2. Any-format Project — Jupyter, Dockerfile, Helm charts all welcome.
  3. Three Vs Experience One Line — e.g., “Model crashed at 3 AM but rolled back in 5 minutes🏃‍♂️”.

Waiting for You to Grow Together

If Velocity · Validation · Versioning make your heart rate go up, let’s meet at git push origin thakicloud.
With ThakiCloud, add a new chapter to your career.