700字范文 > Kubernetes更优雅的监控工具Prometheus Operator

Kubernetes更优雅的监控工具Prometheus Operator

时间：2023-02-11 22:10:00

[TOC]

1. Kubernetes Operator 介绍

在 Kubernetes 的支持下，管理和伸缩 Web 应用、移动应用后端以及 API 服务都变得比较简单了。其原因是这些应用一般都是无状态的，所以 Deployment 这样的基础 Kubernetes API 对象就可以在无需附加操作的情况下，对应用进行伸缩和故障恢复了。

而对于数据库、缓存或者监控系统等有状态应用的管理，就是个挑战了。这些系统需要应用领域的知识，来正确的进行伸缩和升级，当数据丢失或不可用的时候，要进行有效的重新配置。我们希望这些应用相关的运维技能可以编码到软件之中，从而借助 Kubernetes 的能力，正确的运行和管理复杂应用。

Operator 这种软件，使用 TPR(第三方资源，现在已经升级为 CRD) 机制对 Kubernetes API 进行扩展，将特定应用的知识融入其中，让用户可以创建、配置和管理应用。和 Kubernetes 的内置资源一样，Operator 操作的不是一个单实例应用，而是集群范围内的多实例。

2. Prometheus Operator介绍

Kubernetes的Prometheus Operator为Kubernetes服务和Prometheus实例的部署和管理提供了简单的监控定义。

安装完毕后，Prometheus Operator提供了以下功能：

创建/毁坏: 在Kubernetes namespace中更容易启动一个Prometheus实例，一个特定的应用程序或团队更容易使用Operator。简单配置: 配置Prometheus的基础东西，比如在Kubernetes的本地资源versions, persistence, retention policies, 和replicas。Target Services通过标签: 基于常见的Kubernetes label查询，自动生成监控target 配置；不需要学习普罗米修斯特定的配置语言。

Prometheus Operator 架构图如下：

以上架构中的各组成部分以不同的资源方式运行在 Kubernetes 集群中，它们各自有不同的作用：

Operator： Operator 资源会根据自定义资源（Custom Resource Definition / CRDs）来部署和管理 Prometheus Server，同时监控这些自定义资源事件的变化来做相应的处理，是整个系统的控制中心。

Prometheus： Prometheus 资源是声明性地描述 Prometheus 部署的期望状态。

Prometheus Server： Operator 根据自定义资源 Prometheus 类型中定义的内容而部署的 Prometheus Server 集群，这些自定义资源可以看作是用来管理 Prometheus Server 集群的 StatefulSets 资源。

ServiceMonitor： ServiceMonitor 也是一个自定义资源，它描述了一组被 Prometheus 监控的 targets 列表。该资源通过 Labels 来选取对应的 Service Endpoint，让 Prometheus Server 通过选取的 Service 来获取 Metrics 信息。

Service： Service 资源主要用来对应 Kubernetes 集群中的 Metrics Server Pod，来提供给 ServiceMonitor 选取让 Prometheus Server 来获取信息。简单的说就是 Prometheus 监控的对象，例如 Node Exporter Service、Mysql Exporter Service 等等。

Alertmanager： Alertmanager 也是一个自定义资源类型，由 Operator 根据资源描述内容来部署 Alertmanager 集群。

3. Prometheus Operator部署

环境：

Kubernetes version:kubeadm安装的1.12helm version:v2.11.0

我们使用helm安装。helm chart根据实际使用修改。prometheus-operator

里面整合了grafana和监控kubernetes的exporter。需要注意的是，grafana我配置使用了mysql保存数据，相关说明在另一篇文章中《使用Helm部署Prometheus和Grafana监控Kubernetes》。

cd helm/prometheus-operator/helm install --name prometheus-operator --namespace monitoring -f values.yaml ./

为了更加灵活的的使用Prometheus Operator，添加自定义监控是必不可少的。这里我们使用ceph-exporter做示例。

values.yaml中这一段即是使用servicemonitor来添加监控：

serviceMonitor:enabled: true # 开启监控# on what port are the metrics exposed by etcdexporterPort: 9128# for apps that have deployed outside of the cluster, list their adresses hereendpoints: []# Are we talking http or https?scheme: http# service selector label key to target ceph exporter podsserviceSelectorLabelKey: app# default rules are in templates/ceph-exporter.rules.yamlprometheusRules: {}# Custom Labels to be added to ServiceMonitor# 经过测试，servicemonitor标签添加prometheus operator的release标签即可正常监控additionalServiceMonitorLabels: release: prometheus-operator#Custom Labels to be added to Prometheus Rules CRDadditionalRulesLabels: {}

最重要的是这个参数additionalServiceMonitorLabels，经过测试，servicemonitor需要添加prometheus operator已有的标签，才能成功添加监控。

[root@lab1 prometheus-operator]# kubectl get servicemonitor ceph-exporter -n monitoring -o yaml[root@lab1 templates]# kubectl get servicemonitor -n monitoring ceph-exporter -o yamlapiVersion: /v1kind: ServiceMonitormetadata:creationTimestamp: -10-30T06:51:12Zgeneration: 1labels:app: ceph-exporterchart: ceph-exporter-0.1.0heritage: Tillerprometheus: ceph-exporterrelease: prometheus-operatorname: ceph-exporternamespace: monitoringresourceVersion: "13937459"selfLink: /apis//v1/namespaces/monitoring/servicemonitors/ceph-exporteruid: 30569173-dc10-11e8-bcf3-000c293d66a5spec:endpoints:- interval: 30sport: httpnamespaceSelector:matchNames:- monitoringselector:matchLabels:app: ceph-exporterrelease: ceph-exporter

[root@lab1 prometheus-operator]# kubectl get pod -n monitoring prometheus-operator-operator-7459848949-8dddt -o yaml|moreapiVersion: v1kind: Podmetadata:creationTimestamp: -10-30T00:39:37ZgenerateName: prometheus-operator-operator-7459848949-labels:app: prometheus-operator-operatorchart: prometheus-operator-0.1.6heritage: Tillerpod-template-hash: "745984894release: prometheus-operator

要点说明：

ServiceMonitor的标签中至少需要有和prometheus-operator POD中标签相匹配；ServiceMonitor的spec参数service能被prometheus访问，各端点正常；遇到问题，可以开启prometheus operator和prometheus的调试日志。虽然日志没有什么其它信息，但是prometheus operator调试日志可以看到当前监控到的servicemonitor，这样可以确认安装的servicemonitor是否被匹配到。

安装成功后，查看相关资源：

[root@lab1 prometheus-operator]# kubectl get service,servicemonitor,ep -n monitoringNAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGEservice/alertmanager-operatedClusterIP None <none> 9093/TCP,6783/TCP 12dservice/ceph-exporter ClusterIP 10.100.57.62<none> 9128/TCP 46hservice/monitoring-mysql-mysqlClusterIP 10.108.93.155 <none> 3306/TCP 42dservice/prometheus-operated ClusterIP None <none> 9090/TCP 12dservice/prometheus-operator-alertmanagerClusterIP 10.98.42.209<none> 9093/TCP 6d19hservice/prometheus-operator-grafanaClusterIP 10.103.100.150 <none> 80/TCP 6d19hservice/prometheus-operator-kube-state-metrics ClusterIP 10.110.76.250 <none> 8080/TCP 6d19hservice/prometheus-operator-operator ClusterIP None <none> 8080/TCP 6d19hservice/prometheus-operator-prometheus ClusterIP 10.111.24.83<none> 9090/TCP 6d19hservice/prometheus-operator-prometheus-node-exporter ClusterIP 10.97.126.74<none> 9100/TCP 6d19hNAME AGEservicemonitor./ceph-exporter 1dservicemonitor./prometheus-operator 8dservicemonitor./prometheus-operator-alertmanager 6dservicemonitor./prometheus-operator-apiserver 6dservicemonitor./prometheus-operator-coredns 6dservicemonitor./prometheus-operator-kube-controller-manager 6dservicemonitor./prometheus-operator-kube-etcd 6dservicemonitor./prometheus-operator-kube-scheduler 6dservicemonitor./prometheus-operator-kube-state-metrics 6dservicemonitor./prometheus-operator-kubelet 6dservicemonitor./prometheus-operator-node-exporter 6dservicemonitor./prometheus-operator-operator 6dservicemonitor./prometheus-operator-prometheus6dNAME ENDPOINTSAGEendpoints/alertmanager-operated10.244.6.174:9093,10.244.6.174:6783 12dendpoints/ceph-exporter 10.244.2.59:9128 46hendpoints/monitoring-mysql-mysql10.244.6.171:3306 42dendpoints/prometheus-operated 10.244.2.60:9090,10.244.6.175:909012dendpoints/prometheus-operator-alertmanager10.244.6.174:9093 6d19hendpoints/prometheus-operator-grafana10.244.6.106:3000 6d19hendpoints/prometheus-operator-kube-state-metrics 10.244.2.163:8080 6d19hendpoints/prometheus-operator-operator 10.244.6.113:8080 6d19hendpoints/prometheus-operator-prometheus 10.244.2.60:9090,10.244.6.175:90906d19hendpoints/prometheus-operator-prometheus-node-exporter 192.168.105.92:9100,192.168.105.93:9100,192.168.105.94:9100 + 4 more... 6d19h

4. Grafana添加dashboard

上面的prometheus-operator里的_dashboards有我修改过的dashboard，比较全面，使用手动在grafana界面导入，后续可以随意修改dashboard，使用过程中非常方便。而如果将dashboard json文件放到dashboards目录中，helm安装的话，安装的dashboard不支持grafana中直接修改，使用过程中比较麻烦。

5. Alertmanager添加报警

添加prometheusrule，以下是一个示例：

[root@lab1 ceph-exporter]# kubectl get prometheusrule -n monitoring ceph-exporter -o yaml apiVersion: /v1kind: PrometheusRulemetadata:creationTimestamp: -10-30T06:51:12Zgeneration: 1labels:app: prometheuschart: ceph-exporter-0.1.0heritage: Tillerprometheus: ceph-exporterrelease: ceph-exportername: ceph-exporternamespace: monitoringresourceVersion: "13965150"selfLink: /apis//v1/namespaces/monitoring/prometheusrules/ceph-exporteruid: 30543ec9-dc10-11e8-bcf3-000c293d66a5spec:groups:- name: ceph-exporter.rulesrules:- alert: Cephannotations:description: There is no running ceph exporter.summary: Ceph exporter is downexpr: absent(up{job="ceph-exporter"} == 1)for: 5mlabels:severity: critical

默认监控k8s的rule已经很多很全面了，可以自行调整prometheus-operator/templates/all-prometheus-rules.yaml。

报警规则可修改values.yaml中alertmanager:下面这段

config:global:resolve_timeout: 5m# The smarthost and SMTP sender used for mail notifications.smtp_smarthost: ':25'smtp_from: 'xxxxxx@'smtp_auth_username: 'xxxxxx@'smtp_auth_password: 'xxxxxx'# The API URL to use for Slack notifications.slack_api_url: '/services/some/api/token'route:group_by: ["job", "alertname"]group_wait: 30sgroup_interval: 5mrepeat_interval: 12hreceiver: 'noemail'routes:- match:severity: criticalreceiver: critical_email_alert- match_re:alertname: "^KubeJob*"receiver: default_emailreceivers:- name: 'default_email'email_configs:- to : 'xxxxxx@'send_resolved: true- name: 'critical_email_alert'email_configs:- to : 'xxxxxx@'send_resolved: true- name: 'noemail'email_configs:- to : 'null@'send_resolved: false## Alertmanager template files to format alerts## ref: https://prometheus.io/docs/alerting/notifications/##https://prometheus.io/docs/alerting/notification_examples/##templateFiles:template_1.tmpl: |-{{ define "cluster" }}{{ .ExternalURL | reReplaceAll ".*alertmanager\\.(.*)" "$1" }}{{ end }}{{ define "slack.k8s.text" }}{{- $root := . -}}{{ range .Alerts }}*Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`*Cluster:* {{ template "cluster" $root }}*Description:* {{ .Annotations.description }}*Graph:* <{{ .GeneratorURL }}|:chart_with_upwards_trend:>*Runbook:* <{{ .Annotations.runbook }}|:spiral_note_pad:>*Details:*{{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`{{ end }}