CMS Workflow Management DevOps

Developing, Operating and extending on WMCore + Unified, so CMS Monte Carlo and reconstruction workloads stay healthy across the WLCG sites.

CMS Workflow Management DevOps

CMS Workflow Operations

I maintain the workflow management stack (WMCore + Unified) for the CMS experiment. This system handles scheduling for Monte Carlo production and data reconstruction across the Worldwide LHC Computing Grid (WLCG).

Key Contributions

  • Operational Console: Built a React and FastAPI interface for operators. It replaces manual shell scripts, making it easier to triage requests, find bottlenecks, and update policies.
  • Modern Deployment: Moved critical services to Kubernetes and ArgoCD. This modernized our deployment process, allowing for safer updates and quick rollbacks.
  • Monitoring: Created dashboards in OpenSearch and Grafana to track telemetry from WMCore and our databases (Oracle, Mongo, MySQL). This gives us visibility into dataset delays before they become a problem.
  • Automation: Added automation for priority and site policies, reducing the need for manual intervention.

Outcomes

  • Faster Turnarounds: Reduced the time it takes to process datasets by automating routine tasks.
  • Recognition: Received the 2024 CMS Award for modernizing the software base and improving tooling for the production teams.
  • Stability: Established clear operational policies that align physicists and engineers, keeping the pipeline predictable.

CMS Workflow Management DevOps

CMS Workflow Management DevOps