Longhorn,企業(yè)級云原生容器分布式存儲之監控
發(fā)布時(shí)間:2021-11-03 13:46
來(lái)源:黑客下午茶
閱讀:0
作者:為少
欄目: 云計算
歡迎投稿:712375056
本文檔提供了一個(gè)監控 Longhorn 的示例設置。監控系統使用 Prometheus 收集數據和警報,使用 Grafana 將收集的數據可視化/儀表板(visualizing/dashboarding)。
目錄
-
設置 Prometheus 和 Grafana 來(lái)監控 Longhorn
-
將 Longhorn 指標集成到 Rancher 監控系統中
-
Longhorn 監控指標
-
支持 Kubelet Volume 指標
-
Longhorn 警報規則示例
設置 Prometheus 和 Grafana 來(lái)監控 Longhorn
概覽
Longhorn 在 REST 端點(diǎn) http://LONGHORN_MANAGER_IP:PORT/metrics 上以 Prometheus 文本格式原生公開(kāi)指標。有關(guān)所有可用指標的說(shuō)明,請參閱 Longhorn's metrics。您可以使用 Prometheus, Graphite, Telegraf 等任何收集工具來(lái)抓取這些指標,然后通過(guò) Grafana 等工具將收集到的數據可視化。
本文檔提供了一個(gè)監控 Longhorn 的示例設置。監控系統使用 Prometheus 收集數據和警報,使用 Grafana 將收集的數據可視化/儀表板(visualizing/dashboarding)。高級概述來(lái)看,監控系統包含:
-
Prometheus 服務(wù)器從 Longhorn 指標端點(diǎn)抓取和存儲時(shí)間序列數據。Prometheus 還負責根據配置的規則和收集的數據生成警報。Prometheus 服務(wù)器然后將警報發(fā)送到 Alertmanager。
-
AlertManager 然后管理這些警報(alerts),包括靜默(silencing)、抑制(inhibition)、聚合(aggregation)和通過(guò)電子郵件、呼叫通知系統和聊天平臺等方法發(fā)送通知。
-
Grafana 向 Prometheus 服務(wù)器查詢(xún)數據并繪制儀表板進(jìn)行可視化。
下圖描述了監控系統的詳細架構。
上圖中有 2 個(gè)未提及的組件:
-
Longhorn 后端服務(wù)是指向 Longhorn manager pods 集的服務(wù)。Longhorn 的指標在端點(diǎn) http://LONGHORN_MANAGER_IP:PORT/metrics 的 Longhorn manager pods 中公開(kāi)。
-
Prometheus operator 使在 Kubernetes 上運行 Prometheus 變得非常容易。operator 監視 3 個(gè)自定義資源:ServiceMonitor、Prometheus 和 AlertManager。當用戶(hù)創(chuàng )建這些自定義資源時(shí),Prometheus Operator 會(huì )使用用戶(hù)指定的配置部署和管理 Prometheus server, AlerManager。
安裝
按照此說(shuō)明將所有組件安裝到 monitoring 命名空間中。要將它們安裝到不同的命名空間中,請更改字段 namespace: OTHER_NAMESPACE
創(chuàng )建 monitoring 命名空間
-
apiVersion: v1
-
kind: Namespace
-
metadata:
-
name: monitoring
安裝 Prometheus Operator
部署 Prometheus Operator 及其所需的 ClusterRole、ClusterRoleBinding 和 Service Account。
-
apiVersion: rbac.authorization.k8s.io/v1
-
kind: ClusterRoleBinding
-
metadata:
-
labels:
-
app.kubernetes.io/component: controller
-
app.kubernetes.io/name: prometheus-operator
-
app.kubernetes.io/version: v0.38.3
-
name: prometheus-operator
-
namespace: monitoring
-
roleRef:
-
apiGroup: rbac.authorization.k8s.io
-
kind: ClusterRole
-
name: prometheus-operator
-
subjects:
-
- kind: ServiceAccount
-
name: prometheus-operator
-
namespace: monitoring
-
-
apiVersion: rbac.authorization.k8s.io/v1
-
kind: ClusterRole
-
metadata:
-
labels:
-
app.kubernetes.io/component: controller
-
app.kubernetes.io/name: prometheus-operator
-
app.kubernetes.io/version: v0.38.3
-
name: prometheus-operator
-
namespace: monitoring
-
rules:
-
- apiGroups:
-
- apiextensions.k8s.io
-
resources:
-
- customresourcedefinitions
-
verbs:
-
- create
-
- apiGroups:
-
- apiextensions.k8s.io
-
resourceNames:
-
- alertmanagers.monitoring.coreos.com
-
- podmonitors.monitoring.coreos.com
-
- prometheuses.monitoring.coreos.com
-
- prometheusrules.monitoring.coreos.com
-
- servicemonitors.monitoring.coreos.com
-
- thanosrulers.monitoring.coreos.com
-
resources:
-
- customresourcedefinitions
-
verbs:
-
- get
-
- update
-
- apiGroups:
-
- monitoring.coreos.com
-
resources:
-
- alertmanagers
-
- alertmanagers/finalizers
-
- prometheuses
-
- prometheuses/finalizers
-
- thanosrulers
-
- thanosrulers/finalizers
-
- servicemonitors
-
- podmonitors
-
- prometheusrules
-
verbs:
-
- '*'
-
- apiGroups:
-
- apps
-
resources:
-
- statefulsets
-
verbs:
-
- '*'
-
- apiGroups:
-
- ""
-
resources:
-
- configmaps
-
- secrets
-
verbs:
-
- '*'
-
- apiGroups:
-
- ""
-
resources:
-
- pods
-
verbs:
-
- list
-
- delete
-
- apiGroups:
-
- ""
-
resources:
-
- services
-
- services/finalizers
-
- endpoints
-
verbs:
-
- get
-
- create
-
- update
-
- delete
-
- apiGroups:
-
- ""
-
resources:
-
- nodes
-
verbs:
-
- list
-
- watch
-
- apiGroups:
-
- ""
-
resources:
-
- namespaces
-
verbs:
-
- get
-
- list
-
- watch
-
-
apiVersion: apps/v1
-
kind: Deployment
-
metadata:
-
labels:
-
app.kubernetes.io/component: controller
-
app.kubernetes.io/name: prometheus-operator
-
app.kubernetes.io/version: v0.38.3
-
name: prometheus-operator
-
namespace: monitoring
-
spec:
-
replicas: 1
-
selector:
-
matchLabels:
-
app.kubernetes.io/component: controller
-
app.kubernetes.io/name: prometheus-operator
-
template:
-
metadata:
-
labels:
-
app.kubernetes.io/component: controller
-
app.kubernetes.io/name: prometheus-operator
-
app.kubernetes.io/version: v0.38.3
-
spec:
-
containers:
-
- args:
-
-
-
-
-
-
-
-
-
image: quay.io/prometheus-operator/prometheus-operator:v0.38.3
-
name: prometheus-operator
-
ports:
-
- containerPort: 8080
-
name: http
-
resources:
-
limits:
-
cpu: 200m
-
memory: 200Mi
-
requests:
-
cpu: 100m
-
memory: 100Mi
-
securityContext:
-
allowPrivilegeEscalation: false
-
nodeSelector:
-
beta.kubernetes.io/os: linux
-
securityContext:
-
runAsNonRoot: true
-
runAsUser: 65534
-
serviceAccountName: prometheus-operator
-
-
apiVersion: v1
-
kind: ServiceAccount
-
metadata:
-
labels:
-
app.kubernetes.io/component: controller
-
app.kubernetes.io/name: prometheus-operator
-
app.kubernetes.io/version: v0.38.3
-
name: prometheus-operator
-
namespace: monitoring
-
-
apiVersion: v1
-
kind: Service
-
metadata:
-
labels:
-
app.kubernetes.io/component: controller
-
app.kubernetes.io/name: prometheus-operator
-
app.kubernetes.io/version: v0.38.3
-
name: prometheus-operator
-
namespace: monitoring
-
spec:
-
clusterIP: None
-
ports:
-
- name: http
-
port: 8080
-
targetPort: http
-
selector:
-
app.kubernetes.io/component: controller
-
app.kubernetes.io/name: prometheus-operator
安裝 Longhorn ServiceMonitor
Longhorn ServiceMonitor 有一個(gè)標簽選擇器 app: longhorn-manager 來(lái)選擇 Longhorn 后端服務(wù)。稍后,Prometheus CRD 可以包含 Longhorn ServiceMonitor,以便 Prometheus server 可以發(fā)現所有 Longhorn manager pods 及其端點(diǎn)。
-
apiVersion: monitoring.coreos.com/v1
-
kind: ServiceMonitor
-
metadata:
-
name: longhorn-prometheus-servicemonitor
-
namespace: monitoring
-
labels:
-
name: longhorn-prometheus-servicemonitor
-
spec:
-
selector:
-
matchLabels:
-
app: longhorn-manager
-
namespaceSelector:
-
matchNames:
-
- longhorn-system
-
endpoints:
-
- port: manager
安裝和配置 Prometheus AlertManager
使用 3 個(gè)實(shí)例創(chuàng )建一個(gè)高可用的 Alertmanager 部署:
-
apiVersion: monitoring.coreos.com/v1
-
kind: Alertmanager
-
metadata:
-
name: longhorn
-
namespace: monitoring
-
spec:
-
replicas: 3
除非提供有效配置,否則 Alertmanager 實(shí)例將無(wú)法啟動(dòng)。有關(guān) Alertmanager 配置的更多說(shuō)明,請參見(jiàn)此處。下面的代碼給出了一個(gè)示例配置:
-
global:
-
resolve_timeout: 5m
-
route:
-
group_by: [alertname]
-
receiver: email_and_slack
-
receivers:
-
- name: email_and_slack
-
email_configs:
-
- to: <the email address to send notifications to>
-
from: <the sender address>
-
smarthost: <the SMTP host through which emails are sent>
-
# SMTP authentication information.
-
auth_username: <the username>
-
auth_identity: <the identity>
-
auth_password: <the password>
-
headers:
-
subject: 'Longhorn-Alert'
-
text: |-
-
{{ range .Alerts }}
-
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
-
*Description:* {{ .Annotations.description }}
-
*Details:*
-
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
-
{{ end }}
-
{{ end }}
-
slack_configs:
-
- api_url: <the Slack webhook URL>
-
channel: <the channel or user to send notifications to>
-
text: |-
-
{{ range .Alerts }}
-
*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
-
*Description:* {{ .Annotations.description }}
-
*Details:*
-
{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
-
{{ end }}
-
{{ end }}
將上述 Alertmanager 配置保存在名為 alertmanager.yaml 的文件中,并使用 kubectl 從中創(chuàng )建一個(gè) secret。
Alertmanager 實(shí)例要求 secret 資源命名遵循 alertmanager-{ALERTMANAGER_NAME} 格式。在上一步中,Alertmanager 的名稱(chēng)是 longhorn,所以 secret 名稱(chēng)必須是 alertmanager-longhorn
-
$ kubectl create secret generic alertmanager-longhorn
為了能夠查看 Alertmanager 的 Web UI,請通過(guò) Service 公開(kāi)它。一個(gè)簡(jiǎn)單的方法是使用 NodePort 類(lèi)型的 Service :
-
apiVersion: v1
-
kind: Service
-
metadata:
-
name: alertmanager-longhorn
-
namespace: monitoring
-
spec:
-
type: NodePort
-
ports:
-
- name: web
-
nodePort: 30903
-
port: 9093
-
protocol: TCP
-
targetPort: web
-
selector:
-
alertmanager: longhorn
創(chuàng )建上述服務(wù)后,您可以通過(guò)節點(diǎn)的 IP 和端口 30903 訪(fǎng)問(wèn) Alertmanager 的 web UI。
使用上面的 NodePort 服務(wù)進(jìn)行快速驗證,因為它不通過(guò) TLS 連接進(jìn)行通信。您可能希望將服務(wù)類(lèi)型更改為 ClusterIP,并設置一個(gè) Ingress-controller 以通過(guò) TLS 連接公開(kāi) Alertmanager 的 web UI。
安裝和配置 Prometheus server
創(chuàng )建定義警報條件的 PrometheusRule 自定義資源。
-
apiVersion: monitoring.coreos.com/v1
-
kind: PrometheusRule
-
metadata:
-
labels:
-
prometheus: longhorn
-
role: alert-rules
-
name: prometheus-longhorn-rules
-
namespace: monitoring
-
spec:
-
groups:
-
- name: longhorn.rules
-
rules:
-
- alert: LonghornVolumeUsageCritical
-
annotations:
-
description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% used for
-
more than 5 minutes.
-
summary: Longhorn volume capacity is over 90% used.
-
expr: 100 * (longhorn_volume_usage_bytes / longhorn_volume_capacity_bytes) > 90
-
for: 5m
-
labels:
-
issue: Longhorn volume {{$labels.volume}} usage on {{$labels.node}} is critical.
-
severity: critical
有關(guān)如何定義警報規則的更多信息,請參見(jiàn)https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#alerting-rules
如果激活了 RBAC 授權,則為 Prometheus Pod 創(chuàng )建 ClusterRole 和 ClusterRoleBinding:
-
apiVersion: v1
-
kind: ServiceAccount
-
metadata:
-
name: prometheus
-
namespace: monitoring
-
apiVersion: rbac.authorization.k8s.io/v1beta1
-
kind: ClusterRole
-
metadata:
-
name: prometheus
-
namespace: monitoring
-
rules:
-
- apiGroups: [""]
-
resources:
-
- nodes
-
- services
-
- endpoints
-
- pods
-
verbs: ["get", "list", "watch"]
-
- apiGroups: [""]
-
resources:
-
- configmaps
-
verbs: ["get"]
-
- nonResourceURLs: ["/metrics"]
-
verbs: ["get"]
-
apiVersion: rbac.authorization.k8s.io/v1beta1
-
kind: ClusterRoleBinding
-
metadata:
-
name: prometheus
-
roleRef:
-
apiGroup: rbac.authorization.k8s.io
-
kind: ClusterRole
-
name: prometheus
-
subjects:
-
- kind: ServiceAccount
-
name: prometheus
-
namespace: monitoring
創(chuàng )建 Prometheus 自定義資源。請注意,我們在 spec 中選擇了 Longhorn 服務(wù)監視器(service monitor)和 Longhorn 規則。
-
apiVersion: monitoring.coreos.com/v1
-
kind: Prometheus
-
metadata:
-
name: prometheus
-
namespace: monitoring
-
spec:
-
replicas: 2
-
serviceAccountName: prometheus
-
alerting:
-
alertmanagers:
-
- namespace: monitoring
-
name: alertmanager-longhorn
-
port: web
-
serviceMonitorSelector:
-
matchLabels:
-
name: longhorn-prometheus-servicemonitor
-
ruleSelector:
-
matchLabels:
-
prometheus: longhorn
-
role: alert-rules
為了能夠查看 Prometheus 服務(wù)器的 web UI,請通過(guò) Service 公開(kāi)它。一個(gè)簡(jiǎn)單的方法是使用 NodePort 類(lèi)型的 Service:
-
apiVersion: v1
-
kind: Service
-
metadata:
-
name: prometheus
-
namespace: monitoring
-
spec:
-
type: NodePort
-
ports:
-
- name: web
-
nodePort: 30904
-
port: 9090
-
protocol: TCP
-
targetPort: web
-
selector:
-
prometheus: prometheus
創(chuàng )建上述服務(wù)后,您可以通過(guò)節點(diǎn)的 IP 和端口 30904 訪(fǎng)問(wèn) Prometheus server 的 web UI。
此時(shí),您應該能夠在 Prometheus server UI 的目標和規則部分看到所有 Longhorn manager targets 以及 Longhorn rules。
使用上述 NodePort service 進(jìn)行快速驗證,因為它不通過(guò) TLS 連接進(jìn)行通信。您可能希望將服務(wù)類(lèi)型更改為 ClusterIP,并設置一個(gè) Ingress-controller 以通過(guò) TLS 連接公開(kāi) Prometheus server 的 web UI。
安裝 Grafana
創(chuàng )建 Grafana 數據源配置:
-
apiVersion: v1
-
kind: ConfigMap
-
metadata:
-
name: grafana-datasources
-
namespace: monitoring
-
data:
-
prometheus.yaml: |-
-
{
-
"apiVersion": 1,
-
"datasources": [
-
{
-
"access":"proxy",
-
"editable": true,
-
"name": "prometheus",
-
"orgId": 1,
-
"type": "prometheus",
-
"url": "http://prometheus:9090",
-
"version": 1
-
}
-
]
-
}
創(chuàng )建 Grafana 部署:
-
apiVersion: apps/v1
-
kind: Deployment
-
metadata:
-
name: grafana
-
namespace: monitoring
-
labels:
-
app: grafana
-
spec:
-
replicas: 1
-
selector:
-
matchLabels:
-
app: grafana
-
template:
-
metadata:
-
name: grafana
-
labels:
-
app: grafana
-
spec:
-
containers:
-
- name: grafana
-
image: grafana/grafana:7.1.5
-
ports:
-
- name: grafana
-
containerPort: 3000
-
resources:
-
limits:
-
memory: "500Mi"
-
cpu: "300m"
-
requests:
-
memory: "500Mi"
-
cpu: "200m"
-
volumeMounts:
-
- mountPath: /var/lib/grafana
-
name: grafana-storage
-
- mountPath: /etc/grafana/provisioning/datasources
-
name: grafana-datasources
-
readOnly: false
-
volumes:
-
- name: grafana-storage
-
emptyDir: {}
-
- name: grafana-datasources
-
configMap:
-
defaultMode: 420
-
name: grafana-datasources
在 NodePort 32000 上暴露 Grafana:
-
apiVersion: v1
-
kind: Service
-
metadata:
-
name: grafana
-
namespace: monitoring
-
spec:
-
selector:
-
app: grafana
-
type: NodePort
-
ports:
-
- port: 3000
-
targetPort: 3000
-
nodePort: 32000
使用上述 NodePort 服務(wù)進(jìn)行快速驗證,因為它不通過(guò) TLS 連接進(jìn)行通信。您可能希望將服務(wù)類(lèi)型更改為 ClusterIP,并設置一個(gè) Ingress-controller 以通過(guò) TLS 連接公開(kāi) Grafana。
使用端口 32000 上的任何節點(diǎn) IP 訪(fǎng)問(wèn) Grafana 儀表板。默認憑據為:
-
User: admin
-
Pass: admin
安裝 Longhorn dashboard
進(jìn)入 Grafana 后,導入預置的面板:https://grafana.com/grafana/dashboards/13032
有關(guān)如何導入 Grafana dashboard 的說(shuō)明,請參閱 https://grafana.com/docs/grafana/latest/reference/export_import/
成功后,您應該會(huì )看到以下 dashboard:
將 Longhorn 指標集成到 Rancher 監控系統中
關(guān)于 Rancher 監控系統
使用 Rancher,您可以通過(guò)與領(lǐng)先的開(kāi)源監控解決方案 Prometheus 的集成來(lái)監控集群節點(diǎn)、Kubernetes 組件和軟件部署的狀態(tài)和進(jìn)程。
有關(guān)如何部署/啟用 Rancher 監控系統的說(shuō)明,請參見(jiàn)https://rancher.com/docs/rancher/v2.x/en/monitoring-alerting/
將 Longhorn 指標添加到 Rancher 監控系統
如果您使用 Rancher 來(lái)管理您的 Kubernetes 并且已經(jīng)啟用 Rancher 監控,您可以通過(guò)簡(jiǎn)單地部署以下 ServiceMonitor 將 Longhorn 指標添加到 Rancher 監控中:
-
apiVersion: monitoring.coreos.com/v1
-
kind: ServiceMonitor
-
metadata:
-
name: longhorn-prometheus-servicemonitor
-
namespace: longhorn-system
-
labels:
-
name: longhorn-prometheus-servicemonitor
-
spec:
-
selector:
-
matchLabels:
-
app: longhorn-manager
-
namespaceSelector:
-
matchNames:
-
- longhorn-system
-
endpoints:
-
- port: manager
創(chuàng )建 ServiceMonitor 后,Rancher 將自動(dòng)發(fā)現所有 Longhorn 指標。
然后,您可以設置 Grafana 儀表板以進(jìn)行可視化。
Longhorn 監控指標
Volume(卷)
Node(節點(diǎn))
Disk(磁盤(pán))
Instance Manager(實(shí)例管理器)
Manager(管理器)
支持 Kubelet Volume 指標
關(guān)于 Kubelet Volume 指標
Kubelet 公開(kāi)了以下指標:
-
kubelet_volume_stats_capacity_bytes
-
kubelet_volume_stats_available_bytes
-
kubelet_volume_stats_used_bytes
-
kubelet_volume_stats_inodes
-
kubelet_volume_stats_inodes_free
-
kubelet_volume_stats_inodes_used
這些指標衡量與 Longhorn 塊設備內的 PVC 文件系統相關(guān)的信息。
它們與 longhorn_volume_* 指標不同,后者測量特定于 Longhorn 塊設備(block device)的信息。
您可以設置一個(gè)監控系統來(lái)抓取 Kubelet 指標端點(diǎn)以獲取 PVC 的狀態(tài)并設置異常事件的警報,例如 PVC 即將耗盡存儲空間。
一個(gè)流行的監控設置是 prometheus-operator/kube-prometheus-stack,,它抓取 kubelet_volume_stats_* 指標并為它們提供儀表板和警報規則。
Longhorn CSI 插件支持
在 v1.1.0 中,Longhorn CSI 插件根據 CSI spec 支持 NodeGetVolumeStats RPC。
這允許 kubelet 查詢(xún) Longhorn CSI 插件以獲取 PVC 的狀態(tài)。
然后 kubelet 在 kubelet_volume_stats_* 指標中公開(kāi)該信息。
Longhorn 警報規則示例
我們在下面提供了幾個(gè)示例 Longhorn 警報規則供您參考。請參閱此處獲取所有可用 Longhorn 指標的列表并構建您自己的警報規則。
-
apiVersion: monitoring.coreos.com/v1
-
kind: PrometheusRule
-
metadata:
-
labels:
-
prometheus: longhorn
-
role: alert-rules
-
name: prometheus-longhorn-rules
-
namespace: monitoring
-
spec:
-
groups:
-
- name: longhorn.rules
-
rules:
-
- alert: LonghornVolumeActualSpaceUsedWarning
-
annotations:
-
description: The actual space used by Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% capacity for
-
more than 5 minutes.
-
summary: The actual used space of Longhorn volume is over 90% of the capacity.
-
expr: (longhorn_volume_actual_size_bytes / longhorn_volume_capacity_bytes) * 100 > 90
-
for: 5m
-
labels:
-
issue: The actual used space of Longhorn volume {{$labels.volume}} on {{$labels.node}} is high.
-
severity: warning
-
- alert: LonghornVolumeStatusCritical
-
annotations:
-
description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Fault for
-
more than 2 minutes.
-
summary: Longhorn volume {{$labels.volume}} is Fault
-
expr: longhorn_volume_robustness == 3
-
for: 5m
-
labels:
-
issue: Longhorn volume {{$labels.volume}} is Fault.
-
severity: critical
-
- alert: LonghornVolumeStatusWarning
-
annotations:
-
description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is Degraded for
-
more than 5 minutes.
-
summary: Longhorn volume {{$labels.volume}} is Degraded
-
expr: longhorn_volume_robustness == 2
-
for: 5m
-
labels:
-
issue: Longhorn volume {{$labels.volume}} is Degraded.
-
severity: warning
-
- alert: LonghornNodeStorageWarning
-
annotations:
-
description: The used storage of node {{$labels.node}} is at {{$value}}% capacity for
-
more than 5 minutes.
-
summary: The used storage of node is over 70% of the capacity.
-
expr: (longhorn_node_storage_usage_bytes / longhorn_node_storage_capacity_bytes) * 100 > 70
-
for: 5m
-
labels:
-
issue: The used storage of node {{$labels.node}} is high.
-
severity: warning
-
- alert: LonghornDiskStorageWarning
-
annotations:
-
description: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is at {{$value}}% capacity for
-
more than 5 minutes.
-
summary: The used storage of disk is over 70% of the capacity.
-
expr: (longhorn_disk_usage_bytes / longhorn_disk_capacity_bytes) * 100 > 70
-
for: 5m
-
labels:
-
issue: The used storage of disk {{$labels.disk}} on node {{$labels.node}} is high.
-
severity: warning
-
- alert: LonghornNodeDown
-
annotations:
-
description: There are {{$value}} Longhorn nodes which have been offline for more than 5 minutes.
-
summary: Longhorn nodes is offline
-
expr: longhorn_node_total - (count(longhorn_node_status{condition="ready"}==1) OR on() vector(0))
-
for: 5m
-
labels:
-
issue: There are {{$value}} Longhorn nodes are offline
-
severity: critical
-
- alert: LonghornIntanceManagerCPUUsageWarning
-
annotations:
-
description: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is {{$value}}% for
-
more than 5 minutes.
-
summary: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} has CPU Usage / CPU request is over 300%.
-
expr: (longhorn_instance_manager_cpu_usage_millicpu/longhorn_instance_manager_cpu_requests_millicpu) * 100 > 300
-
for: 5m
-
labels:
-
issue: Longhorn instance manager {{$labels.instance_manager}} on {{$labels.node}} consumes 3 times the CPU request.
-
severity: warning
-
- alert: LonghornNodeCPUUsageWarning
-
annotations:
-
description: Longhorn node {{$labels.node}} has CPU Usage / CPU capacity is {{$value}}% for
-
more than 5 minutes.
-
summary: Longhorn node {{$labels.node}} experiences high CPU pressure for more than 5m.
-
expr: (longhorn_node_cpu_usage_millicpu / longhorn_node_cpu_capacity_millicpu) * 100 > 90
-
for: 5m
-
labels:
-
issue: Longhorn node {{$labels.node}} experiences high CPU pressure.
-
severity: warning
在https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/#alerting-rules 查看有關(guān)如何定義警報規則的更多信息。
原文鏈接:https://mp.weixin.qq.com/s/znaf4v3OBdGrLp0j23BcaQ