Monitoring Kubernetes with Grafana Cloud: Integrating Grafana Agent Operator for Enhanced Insights
Improve Grafana Cloud integration with OKE/Kubernetes. Enhance robustness, scalability, and customization. Gain deeper visibility and control over your containerized applications and infrastructure.
The integration of Grafana Cloud with Oracle Kubernetes Engine (OKE) or Kubernetes offers powerful monitoring and observability capabilities for your containerized applications. In this blog, you will get to know the process of integrating Grafana Cloud with OKE/Kubernetes, discussing the resources involved and their significance.
I've also highlighted the key changes you can make to enhance the integration's robustness compared to the Grafana Cloud-provided setup. Lastly, I've identified and addressed common issues that may arise during the integration process. So let's begin from:
Setting up Kubernetes Monitoring with Grafana Agent Operator
At this point, you will know how to deploy Grafana Agent Operator in a Kubernetes cluster for Kubernetes Monitoring. The Agent Operator automatically sets up and configures Grafana Agent using Kubernetes custom resource objects.
The Grafana Cloud documentation regarding Agent Operator Configuration guides on configuring the Kubernetes Agent Operator for monitoring Kubernetes clusters because I'll be using Agent Operator Configuration instructions. Here's an overview of the instructions:
- Installing the Operator: The documentation explains how to install the Kubernetes Agent Operator, which is responsible for deploying and managing the Grafana Agent in your cluster.
- Creating a Secret: To authenticate the Grafana Agent with Grafana Cloud, you need to create a Kubernetes Secret containing the necessary credentials. The documentation describes the required fields and how to create the Secret.
- Configuring the Operator: You need to configure the Agent Operator to specify details such as the organization and namespace to associate the Agent with. This configuration enables the Operator to deploy and manage the Agent effectively.
- Deploying the Agent: The documentation provides instructions for deploying the Grafana Agent using the Agent Operator. This includes creating a custom resource definition (CRD) and applying a YAML configuration file that specifies the desired Agent settings.
- Customizing Agent Configuration: You can customize the configuration of the Grafana Agent based on your monitoring requirements. The documentation explains how to modify the Agent's YAML configuration file to adjust various settings such as scraping intervals, enabling specific metrics, or configuring service discovery.
- Verifying Agent Deployment: After deploying the Agent, you can verify its status and check for potential errors. The documentation outlines the steps to confirm that the Agent is running and successfully communicating with Grafana Cloud.
- Updating Agent Configuration: If you need to modify the Agent's configuration after deployment, the documentation provides instructions on how to update the YAML configuration file and apply the changes using the Agent Operator.
If you have followed the instructions, you will get below mentioned YAML file to apply to your cluster:
What To Update
cluster: my-cluster
update this to reflect your cluster name.namespace: default
update the namespace where you want to install all monitoring resources.- Update the `metrics-secret`, I would suggest you remove this secret YAML from the configuration and convert into a separate secret and configure it with external-secret, if you are not familiar with external-secret, review my "Simplifying Kubernetes Secrets Management with External Secrets"
To collect the API token for monitoring purposes, follow these steps:
- Log in to the monitoring service or API provider's website.
- Navigate to the account settings or API access section.
- Generate a new API token by clicking on the "Generate Token" or similar button.
- Copy the generated API token to your clipboard.
- Replace the
REPLACE_WITH_API_TOKEN
placeholder in the configuration file with the copied API token. - Save the updated configuration file and proceed with the installation or update process.
By following these steps, you will have the necessary API token to configure metrics-secret
in the Kubernetes configuration.
4. update the 'logs-secret', I would suggest you remove this secret YAML from the configuration YAML and convert into a separate one and configure it with external-secret, if you are not familiar with external-secret, review my "Simplifying Kubernetes Secrets Management with External Secrets"
apiVersion: v1
kind: Secret
metadata:
name: logs-secret
namespace: default
type: Opaque
stringData:
password: "REPLACE_WITH_API_TOKEN"
username: "131313"
Follow below mentioned process to collect the "REPLACE_WITH_API_TOKEN"
- Identify the service or API provider associated with the API token.
- Log in to the provider's website or access the platform.
- Generate a new API token with the required permissions.
- Copy the generated API token.
- Open the configuration file.
- Replace
"REPLACE_WITH_API_TOKEN"
in thepassword
field with the copied API token. - If needed, update the
username
field with the appropriate identifier. - Save the updated configuration file.
What Does the Data Include
The collected telemetry data includes
- Grafana agents, log instance, metrics instance, and pod logs.
- Installation of Kube-state-metrics (Make sure it is not already installed.) for Kubernetes metrics.
- KSM monitor, service monitor to collect metrics from kube-state-metrics.
- Agent Event Handler Integration, to collect the Kubernetes events.
- Node Exporter integration to collect the metrics from Kubernetes nodes.
- "kubelet" and "caadvisor" service monitor
Problems Identified
There were several issues that were encountered during the setup and configuration of the monitoring system. These problems were identified through a careful examination of the configuration and observed behavior of the system. By closely analyzing the deployment process and monitoring functionalities, I was able to pinpoint specific areas where challenges arose. These issues range from the malfunctioning persistent volume dashboard in Grafana cloud to the high cardinality of labels associated with metrics. Additionally, recognized the need for separating and configuring the 'metrics-secret' and 'logs-secret' with external-secret for improved security and flexibility. By identifying these problems, I have set the stage for further discussion and resolution in order to ensure a smooth and effective monitoring solution.
- The persistent volume dashboard provided in Kubernetes monitoring from the Grafana cloud does not work. This issue could be due to misconfiguration or compatibility problems between the Kubernetes cluster and Grafana cloud. Further investigation and troubleshooting are required to identify the exact cause and resolve the issue.
- The list of labels from all whitelisted metrics has high cardinality. to reduce you can drop below mentioned labels from all integration. High cardinality means there are a large number of unique label values, which can impact the performance and efficiency of the monitoring system.
- action: labeldrop
regex: (id|uid|service|endpoint|metrics_path|name|pod_ip|owner_name|created_by_name)
👆This widget provides an overview of the labels and their corresponding value counts.
Consider the following steps to review the widget:
- Access your GrafanaCloud account or the Grafana instance where the dashboard is hosted.
- Locate the 'GrafanaCloud/Cardinality management - 1 - an overview' dashboard.
- Navigate to the dashboard and locate the specific widget titled "Top labels by value count".
- Analyze the labels displayed in the widget and their corresponding value counts.
- Identify labels that have a large number of unique values, indicating high cardinality.
- Compare the identified labels with the regex
(id|uid|service|endpoint|metrics_path|name|pod_ip|owner_name|created_by_name)
to determine if any labels match the criteria for dropping. - Based on the analysis, finalize the labels to be dropped from the provided regex.
Reviewing this widget will help you make an informed decision about which labels to drop. By examining the value counts, you can identify labels that have a high cardinality or a large number of unique values. Dropping such labels can help reduce the cardinality of your metrics, which can improve performance and resource utilization.
List Of Resources
Here's a table listing the shared resources that can be created on Kubernetes as part of an integration:
Resource | Description |
---|---|
Deployment | Manages the deployment and scaling of containerized applications, ensuring the desired number of replicas are running. |
Service | Exposes applications running on pods and provides a stable network endpoint for accessing them within the cluster. |
Ingress | Routes external traffic to services based on defined rules, enabling HTTP and HTTPS access to applications. |
ConfigMap | Stores configuration data as key-value pairs that can be accessed by containers running in pods. |
Secret | Safely stores sensitive information like credentials, API keys, or TLS certificates, encrypting them at rest. |
PersistentVolume | Provides persistent storage to applications by abstracting underlying storage infrastructure, allowing data to persist across pod restarts. |
PersistentVolumeClaim | Requests and binds PersistentVolumes to pods, ensuring reliable and dynamically provisioned storage. |
Job | Runs batch processes or one-time tasks to completion, ensuring a certain number of successful completions before terminating. |
CronJob | Schedules jobs to run periodically based on a defined cron-like syntax, automating recurring tasks. |
StatefulSet | Manages the deployment and scaling of stateful applications, providing stable network identities and persistent storage. |
DaemonSet | Ensures a pod runs on each node in the cluster, typically used for running agents or cluster-level services. |
ServiceAccount | Provides an identity for pods and controls the permissions and access to resources within the cluster. |
Role and RoleBinding | Defines granular permissions for accessing and modifying resources within namespaces. |
Namespace | Creates logical partitions within a cluster, allowing multiple teams or applications to run independently. |
PodDisruptionBudget | Ensures the availability of applications during disruptions by specifying the minimum number of available pods. |
NetworkPolicy | Defines network rules to control the traffic flow between pods, enhancing security and isolation. |
These resources enable seamless integration, scalability, and efficient management of applications on Kubernetes, providing a solid foundation for building robust and reliable systems. The specific resources required and their configurations will depend on the integration requirements and the architecture of the applications being integrated.
Want To Make Integration More Robust?
To make the integration more robust compared to the Grafana Cloud-provided setup, here are some changes that can be implemented:
- Self-Hosted Grafana: Instead of relying on Grafana Cloud, you can deploy and manage a self-hosted Grafana instance. This provides more control and flexibility over the configuration, maintenance, and scalability of your monitoring solution.
- High Availability and Scalability: Implement a high availability and scalable setup for Grafana and related components. Utilize multiple Grafana instances in a cluster with load balancing to ensure redundancy and handle increased traffic and workload.
- Multi-Cluster Support: Extend the integration to monitor multiple Kubernetes clusters. This can involve deploying Grafana instances and Grafana Agents in each cluster and setting up cross-cluster visibility and monitoring.
- Backup and Disaster Recovery: Implement regular backup and disaster recovery procedures for Grafana configurations, dashboards, and data sources. Ensure that you have a well-defined backup strategy and mechanisms in place to recover from any data loss or system failures.
- Custom Dashboards and Alerts: Create custom dashboards tailored to your specific integration requirements. Design informative visualizations and configure alerting rules to proactively detect and notify about any anomalies or critical issues in your integrated system.
- Advanced-Data Sources: Explore and leverage additional data sources beyond the default ones provided by Grafana. Integrate with specialized databases, logging systems, or external APIs to capture and visualize more comprehensive data related to your integrated system.
- Security Hardening: Implement robust security measures to protect the monitoring infrastructure. Secure communication channels using encryption, enforce strong access controls and authentication mechanisms, regularly update and patch all components, and conduct regular security audits and vulnerability assessments.
- Integration Testing: Develop and execute integration tests to validate the functionality and performance of the integrated system. This helps identify any compatibility issues, performance bottlenecks, or inconsistencies between the integrated components.
- Automated Configuration Management: Utilize infrastructure-as-code tools like Kubernetes ConfigMaps or Helm charts to manage and version the configuration of Grafana and related components. This enables automated deployment, scaling, and rollbacks, reducing the chances of misconfigurations and easing the management process.
- Monitoring and Logging for the Integration: Set up dedicated monitoring and logging for the integration components themselves. Monitor the health, performance, and logs of the Grafana instances, Grafana Agents, and any additional components or services added to the integration.
Conclusion
Integrating Grafana Cloud with OKE/Kubernetes unlocks powerful monitoring capabilities for your containerized applications. By understanding the resources involved, making necessary changes to enhance robustness, and addressing identified issues, you can create a reliable and efficient monitoring solution. Leverage the flexibility and scalability of Kubernetes alongside Grafana Cloud to gain valuable insights and observability into your applications and infrastructure.
Resources
If you're just starting to get your hands on Grafana, make sure to check my blog: How to Use Grafana Cloud for Advanced Kubernetes Monitoring and Analysis
Hi! I am Safoor Safdar a Senior SRE. Read More. Don't hesitate to reach out! You can find me on Linkedin, or simply drop me an email at me@safoorsafdar.com