Ingero, an open-source tool designed for GPU observability and incident response, has recently released version 0.9.1 with new features that enhance its capabilities in handling multi-node training environments. Here's a summary of what this update brings:
Key Features
Cluster-Level Tracing
- Fan-Out Queries: The ability to run queries across multiple nodes simultaneously, providing insights into the entire cluster.
- Offline Merge Path: Allows merging databases from individual nodes offline for investigation in air-gapped or disconnected environments.
AI-Assisted Investigation (MCP)
- Query Fleet Tool: An interface that allows an AI assistant to query the fleet directly without SSH access. Supports actions like
chains(causal analysis),sql,ops, andoverview.
Technical Details
-
Clock Skew Measurement:
- The fan-out queries measure clock skew between nodes, ensuring accurate time synchronization for cross-node correlation.
-
Partial Failure Handling:
- Results from reachable nodes are returned even if some nodes fail to respond, providing partial insights and identifying problematic nodes.
-
Cross-Node Correlation:
- Work in progress to correlate events across multiple nodes, enabling more sophisticated
Read the full article at DEV Community
Want to create content about this topic? Use Nemati AI tools to generate articles, social posts, and more.

![[AINews] The Unreasonable Effectiveness of Closing the Loop](/_next/image?url=https%3A%2F%2Fmedia.nemati.ai%2Fmedia%2Fblog%2Fimages%2Farticles%2F600e22851bc7453b.webp&w=3840&q=75)



