Ecosystem – PyTorch https://pytorch.org Mon, 28 Jul 2025 21:52:53 +0000 en-US hourly 1 https://wordpress.org/?v=6.8.2 https://pytorch.org/wp-content/uploads/2024/10/cropped-favicon-32x32.webp Ecosystem – PyTorch https://pytorch.org 32 32 PyTorch on Kubernetes: Kubeflow Trainer Joins the PyTorch Ecosystem https://pytorch.org/blog/pytorch-on-kubernetes-kubeflow-trainer-joins-the-pytorch-ecosystem/ Mon, 28 Jul 2025 21:52:53 +0000 https://pytorch.org/?p=4686

Kubeflow Trainer Logo

We’re thrilled to announce that the Kubeflow Trainer project has been integrated into the PyTorch ecosystem! This integration ensures that Kubeflow Trainer aligns with PyTorch’s standards and practices, giving developers a reliable, scalable, and community-backed solution to run PyTorch on Kubernetes.

To view the PyTorch Ecosystem, see the PyTorch Landscape. Learn more about how projects can join the PyTorch Ecosystem

About Kubeflow Trainer

Kubeflow Trainer is a Kubernetes-native project enabling scalable, distributed training of AI models and purpose-built for fine-tuning large language models (LLMs). It simplifies the scale-out of training workloads on multiple nodes, managing large datasets efficiently and ensuring fault-tolerance.

Kubeflow Trainer

The core features include:

  • Simplify Kubernetes complexity: Kubeflow Trainer APIs are designed for two primary user personas – AI practitioners – ML engineers and data scientists who develop AI models using the Kubeflow Python SDK and TrainJob APIs, platform admins – administrators and DevOps engineers responsible for managing Kubernetes clusters and Kubeflow Trainer runtimes APIs. AI practitioners can focus on the application code in PyTorch without worrying about infrastructure details. Meanwhile, platform admins can flexibly schedule workload resources for maximum cluster utilization and cost efficiency.  To support these roles, Kubeflow Trainer specifies purpose-built Kubernetes Custom Resource Definitions (CRDs) that streamline model training and infrastructure management.

Kubeflow Trainer user personas

  • Python SDK: A Pythonic interface designed for AI practitioners, abstract the details of interacting directly with Kubernetes APIs. It enables users to focus on developing PyTorch models without worrying about Kubernetes YAML configurations.
  • Blueprints for LLMs fine-tuning on Kubernetes: With built-in trainers, Kubeflow Trainer enables AI practitioners to seamlessly fine-tune their favorite LLMs using the desired configuration for datasets, LoRA parameters, learning rate, etc. In the first release, it implements recipes to support various fine-tuning strategies, including Supervised Fine-Tuning (SFT), Knowledge Distillation, DPO, PPO, GRPO, and Quantization-Aware Training. Community is working towards adding more builtin trainers powered by LLaMA-Factory, Unsloth, HuggingFace TRL to enable efficient LLMs fine-tuning.
  • Optimized GPU utilization: Kubeflow Trainer maximizes GPU efficiency by streaming large-scale data directly to distributed GPUs using an in-memory distributed data cache powered by Apache Arrow and Apache DataFusion
  • Advanced scheduling capabilities: Kubeflow Trainer supports gang scheduling through the PodGroupPolicy API, enabling coordinated scheduling of pods across nodes. It also integrates with Kubernetes schedulers such as Kueue, Coscheduling, Volcano, and KAI Scheduler to ensure all required resources are allocated before training jobs start.
  • Accelerate MPI workloads on Kubernetes: Kubeflow Trainer supports MPI-based runtimes such as DeepSpeed and MLX. It handles all necessary orchestration of MPI workloads with SSH-based optimization to boost MPI performance.
  • Improved resilience and fault-tolerance: By leveraging Kubernetes-native APIs like Jobs and JobSets, Kubeflow Trainer improves reliability and efficiency of AI  workloads. With support for the PodFailurePolicy API, users can reduce cost by avoiding unnecessary restarts. Additionally, the SuccessPolicy API allows training jobs to complete early once the target objective is achieved.

Background and Evolution

This project was originally started as a distributed training operator for TensorFlow (e.g. TFJob), and later we merged efforts from other Kubeflow Training Operators (e.g. PyTorchJob, MPIJob) to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We’d also like to thank everyone who’s contributed to and maintained the original operators.

By joining the PyTorch Ecosystem, we strive to apply best practices of deploying distributed PyTorch applications on Kubernetes and bring first-class PyTorch support in Kubeflow Trainer. 

Integrations with PyTorch Ecosystem

Kubeflow Trainer is deeply integrated with the PyTorch ecosystem, supporting a broad range of tools and libraries—including torch, DeepSpeed, HuggingFace, Horovod, and more.

It empowers PyTorch users to implement advanced distributed training strategies such as Distributed Data Parallel (DDP), Fully Sharded Data Parallel (FSDP & FSDP2), and Tensor Parallelism, enabling efficient large-scale model training on Kubernetes.

Additionally, Kubeflow Trainer supports data parallelism using PyTorch IterableDatasets, streaming data directly from distributed in-memory data cache nodes. This allows scalable training even with massive datasets that exceed local memory capacity.

Quick Start

Follow the steps below to quickly deploy Kubeflow Trainer and run your first training job.

Prerequisites

Install Kubeflow Trainer

Deploy Kubeflow Trainer control plane on your local kind cluster:

$ kind create cluster

$ kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/manager?ref=v2.0.0"


# Ensure that JobSet and Trainer controller manager are running.
$ kubectl get pods -n kubeflow-system

NAME                                                  READY   STATUS    RESTARTS   AGE
jobset-controller-manager-54968bd57b-88dk4            2/2     Running   0          65s
kubeflow-trainer-controller-manager-cc6468559-dblnw   1/1     Running   0          65s


# Deploy the Kubeflow Trainer runtimes.
$ kubectl apply --server-side -k "https://github.com/kubeflow/trainer.git/manifests/overlays/runtimes?ref=v2.0.0"

# Install Kubeflow SDK
$ pip install git+https://github.com/kubeflow/sdk.git@64d74db2b6c9a0854e39450d8d1c0201e1e9b3f7#subdirectory=python

Define PyTorch Training Function

After installing the Kubeflow Trainer, define your PyTorch training function that contains end-to-end training script:

def train_pytorch():
    import os
    import torch
    import torch.distributed as dist
    from torch.utils.data import DataLoader, DistributedSampler
    from torchvision import datasets, transforms, models

    # [1] Configure CPU/GPU device and distributed backend.
    device, backend = ("cuda", "nccl") if torch.cuda.is_available() else ("cpu", "gloo")
    dist.init_process_group(backend=backend)
    local_rank = int(os.getenv("LOCAL_RANK", 0))
    device = torch.device(f"{device}:{local_rank}")
    
    # [2] Get the pre-defined model.
    model = models.shufflenet_v2_x0_5(num_classes=10)
    model.conv1 = torch.nn.Conv2d(1, 24, kernel_size=3, stride=2, padding=1, bias=False)
    model = torch.nn.parallel.DistributedDataParallel(model.to(device))
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
   
    # [3] Get the FashionMNIST dataset and distribute it across all available devices.
    if local_rank == 0: # Download dataset only on local_rank=0 process.
        dataset = datasets.FashionMNIST("./data", train=True, download=True, transform=transforms.Compose([transforms.ToTensor()]))
    dist.barrier()
    dataset = datasets.FashionMNIST("./data", train=True, download=False, transform=transforms.Compose([transforms.ToTensor()]))
    train_loader = DataLoader(dataset, batch_size=100, sampler=DistributedSampler(dataset))

    # [4] Define the PyTorch training loop.
    for epoch in range(3):
        for batch_idx, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)
            # Forward and Backward pass
            outputs = model(inputs)
            loss = torch.nn.functional.cross_entropy(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if batch_idx % 10 == 0 and dist.get_rank() == 0:
                print(f"Epoch {epoch} [{batch_idx * len(inputs)}/{len(train_loader.dataset)}] "
                    f"Loss: {loss.item():.4f}"
                )

Run PyTorch on Kubernetes with TrainJob

After defining the training function, use the Kubeflow SDK to create TrainJob:

from kubeflow.trainer import TrainerClient, CustomTrainer

job_id = TrainerClient().train(
    trainer=CustomTrainer(
        func=train_pytorch,
        num_nodes=2,
        resources_per_node={
            "cpu": 3,
            "memory": "3Gi",
            # "gpu": 2, # Uncomment this line if you have GPUs.
        },
    ),
    runtime=TrainerClient().get_runtime("torch-distributed"),
)

Get the TrainJob Results

After creating the TrainJob, you should be able to list it:

for job in TrainerClient().list_jobs():
    print(f"TrainJob: {job.name}, Status: {job.status}")

TrainJob: q33a18f65635, Status: Created 

It may take a few minutes for the TrainJob to pull the PyTorch image the first time. Once the image is pulled, the TrainJob‘s steps should transition to Running status. Each step represents a training node, and the number of devices per step corresponds to the number of devices on that node.

for s in TrainerClient().get_job(name=job_id).steps:
    print(f"Step: {s.name}, Status: {s.status}, Devices: {s.device} x {s.device_count}")

Step: node-0, Status: Running, Devices: cpu x 3
Step: node-1, Status: Running, Devices: cpu x 3 

After steps are running, you can check the TrainJob logs. The dataset of 60,000 samples has been evenly distributed across 6 CPUs, with each device processing 10,000 samples: 60,000 / 6 = 10,000

print(TrainerClient().get_job_logs(name=job_id)["node-0"])

...
Epoch 0 [8000/60000] Loss: 0.4476
Epoch 0 [9000/60000] Loss: 0.4784
Epoch 1 [0/60000] Loss: 0.3909
Epoch 1 [1000/60000] Loss: 0.4888
Epoch 1 [2000/60000] Loss: 0.4100
... 

Congratulations, you created your first distributed training job with PyTorch and Kubeflow Trainer!

What’s next

Kubeflow Trainer has exciting roadmap including the following items:

Call to Action

We are excited to welcome Kubeflow Trainer to the PyTorch ecosystem! Kubeflow Trainer democratizes AI model training on Kubernetes and significantly improves the development experience for AI practitioners. We invite you to explore the following resources to learn more about the project:

We can’t wait to see what you’ll build with Kubeflow Trainer!

]]>
FlagGems Joins the PyTorch Ecosystem: Triton-Powered Operator Library for Universal AI Acceleration https://pytorch.org/blog/flaggems-joins-the-pytorch-ecosystem-triton-powered-operator-library-for-universal-ai-acceleration/ Wed, 25 Jun 2025 16:53:26 +0000 https://pytorch.org/?p=4482
FlagGems

In the race to accelerate large language models across diverse AI hardware, FlagGems delivers a high-performance, flexible, and scalable solution. Built on Triton language, FlagGems is a plugin-based PyTorch operator and kernel library designed to democratize AI compute. Its mission: to enable a write-once, JIT-everywhere experience, so developers can deploy optimized kernels effortlessly across a wide spectrum of hardware backends.  FlagGems recently joined the PyTorch Ecosystem upon acceptance by the PyTorch Ecosystem Working Group.

With over 180 operators already implemented—spanning native PyTorch ops and widely used custom ops for large models—FlagGems is evolving fast to keep pace with the generative AI frontier.

To view the PyTorch Ecosystem, see the PyTorch Landscape and learn more about how projects can join the PyTorch Ecosystem.

Key Features

  • Extensive Operator Library: 180+ PyTorch-compatible operators and growing
  • Performance Optimized: Select operators hand-tuned for speed
  • Torch.compile Independent: Fully functional in eager mode
  • Pointwise Operator Codegen: Auto-generates kernels for arbitrary input types and layouts
  • Fast Kernel Dispatching: Per-function runtime dispatch logic
  • C++ Triton Dispatcher: In development for even faster execution
  • Multi-Backend Ready: Works across 10+ hardware platforms with a backend-neutral runtime API

Architecture

FlagGems extends the PyTorch dispatch system out-of-tree through a multi-backend library powered by Triton. It intercepts ATen operator calls and provides backend-specific Triton implementations, making it easy to support alternative GPUs and domain-specific accelerators (DSAs).

Plug-and-Play

  • Registers with PyTorch’s dispatch system
  • Intercepts ATen operator calls
  • Seamlessly replaces CUDA operator implementations

Write Once, Compile Anywhere

  • Unified operator library code
  • Compilable on any backend with Triton support
  • Supports GPUs and heterogeneous chips like DSAs

Getting Started in 3 Steps

1. Install dependencies

pip install torch>=2.2.0  # 2.6.0 preferred

pip install triton>=2.2.0 # 3.2.0 preferred

2. Install FlagGems

git clone https://github.com/FlagOpen/FlagGems.git

cd FlagGems

pip install --no-build-isolation .

or editable install:

pip install --no-build-isolation -e .

3. Enable FlagGems in your project

import flag_gems

flag_gems.enable()  # Replaces supported PyTorch ops globally

Prefer finer control? Use a managed context:

with flag_gems.use_gems():

    output = model.generate(**inputs)

Need explicit ops?

out = flag_gems.ops.slice_scatter(inp, dim=0, src=src, start=0, end=10, step=1)

Automatic Codegen for Pointwise Ops

With the @pointwise_dynamic decorator, FlagGems can auto-generate efficient kernels with broadcast, fusion, and memory layout support. Here’s an example implementing fused GeLU and element-wise multiplication:

@pointwise_dynamic(promotion_methods=[(0, 1, “DEFAULT”)])

@triton.jit
def gelu_tanh_and_mul_kernel(x, y):

    x_fp32 = x.to(tl.float32)

    x_gelu = 0.5 * x_fp32 * (1 + tanh(x_fp32 * 0.79788456 * (1 + 0.044715 * pow(x_fp32, 2))))

    return x_gelu * y

Performance Validation

FlagGems includes built-in testing and benchmarking:

  • Accuracy Testing
cd tests

pytest test_<op_name>_ops.py --ref cpu
  • Performance Benchmarks
cd benchmark

pytest test_<op_name>_perf.py -s  # CUDA microbenchmarks

pytest test_<op_name>_perf.py -s --mode cpu  # End-to-end comparison

Benchmark Results

FlagGems Speedup

Initial benchmark results of FlagGems showcase its performance against PyTorch’s native operator implementations. The results represent the average measured speedups, a value greater than 1 indicating that FlagGems is faster than the native PyTorch operator. For a vast majority of operators, FlagGems either matches or significantly surpasses the performance of PyTorch’s native implementations.

For a significant portion of the 180+ operators, FlagGems achieves a speedup close to 1.0, indicating performance on par with the native PyTorch implementations.

Some of the core operations like LAYERNORMCROSS_ENTROPY_LOSSADDMM and SOFTMAX also show impressive speedups.

Multi-Backend Support

FlagGems is vendor-flexible and backend-aware:

Set the desired backend with:

export GEMS_VENDOR=<vendor>

Check active backend in Python:

import flag_gems

print(flag_gems.vendor_name)

meetups

Summary

FlagGems delivers a unified kernel library for large models acceleration that bridges software portability and hardware performance. With broad backend support, a growing op set, and advanced codegen features, it’s your go-to Triton playground for pushing the limits of AI compute.

]]>