700字范文,内容丰富有趣,生活中的好帮手!
700字范文 > Kubernetes部署和使用Prometheus

Kubernetes部署和使用Prometheus

时间:2019-03-05 22:58:50

相关推荐

Kubernetes部署和使用Prometheus

Kubernetes部署和使用Prometheus

默认已安装kuberneter

kubernetes1.25 kube-prometheus release-0.12

这里使用的是Kube-Prometheus

1、架构介绍

Prometheus Server:抓取和存储时间序列数据,同时提供数据的查询和告警策略的配置管理

Alertmanager:Prometheus Server 会将告警发送给 Alertmanager,Alertmanager 根据路由配置,将告警信息发送给指定的或组,支持邮件、Webhook、微信、钉钉、短信等

Grafana:用于展示数据

Push Gateway:Prometheus 通过 Pull 的方式拉取数据,但是有些监控数据可能是短期的,如果没有采集数据可能会出现丢失,Push Gateway 可以用来接收数据

ServiceMonitor:监控配置,通过选择service的/metrics接口获取数据

Exporter:用来采集非云原生监控数据,主机的监控数据通过 node_exporter 采集,MySQL 的监控数据通过 mysql_exporter 采集

PromQL:查询数据的一种语法

Service Discovery:用来发现监控目标的自动发现,常用的有基于 Kubernetes、 Consul、Eureka、文件的自动发现等

2、安装

# 这里注意去github找kubernetes对应的版本,1.25对应的kube-prometheus版本为release-0.12git clone -b release-0.12 /prometheus-operator/kube-prometheus.gitcd kube-prometheus# 部分镜像下载不下来,这里在对应文件修改镜像地址vi manifests/kubeStateMetrics-deployment.yamlvi manifests/prometheusAdapter-deployment.yaml

apiVersion: apps/v1kind: Deploymentmetadata:labels:app.kubernetes.io/component: exporterapp.kubernetes.io/name: kube-state-metricsapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 2.7.0name: kube-state-metricsnamespace: monitoringspec:replicas: 1selector:matchLabels:app.kubernetes.io/component: exporterapp.kubernetes.io/name: kube-state-metricsapp.kubernetes.io/part-of: kube-prometheustemplate:metadata:annotations:kubectl.kubernetes.io/default-container: kube-state-metricslabels:app.kubernetes.io/component: exporterapp.kubernetes.io/name: kube-state-metricsapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 2.7.0spec:automountServiceAccountToken: truecontainers:- args:- --host=127.0.0.1- --port=8081- --telemetry-host=127.0.0.1- --telemetry-port=8082# image: registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.7.0image: -/ialso/kube-state-metrics:v2.7.0name: kube-state-metricsresources:limits:cpu: 100mmemory: 250Mirequests:cpu: 10mmemory: 190MisecurityContext:allowPrivilegeEscalation: falsecapabilities:drop:- ALLreadOnlyRootFilesystem: truerunAsUser: 65534- args:- --logtostderr- --secure-listen-address=:8443- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305- --upstream=http://127.0.0.1:8081/image: quay.io/brancz/kube-rbac-proxy:v0.14.0name: kube-rbac-proxy-mainports:- containerPort: 8443name: https-mainresources:limits:cpu: 40mmemory: 40Mirequests:cpu: 20mmemory: 20MisecurityContext:allowPrivilegeEscalation: falsecapabilities:drop:- ALLreadOnlyRootFilesystem: truerunAsGroup: 65532runAsNonRoot: truerunAsUser: 65532- args:- --logtostderr- --secure-listen-address=:9443- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305- --upstream=http://127.0.0.1:8082/image: quay.io/brancz/kube-rbac-proxy:v0.14.0name: kube-rbac-proxy-selfports:- containerPort: 9443name: https-selfresources:limits:cpu: 20mmemory: 40Mirequests:cpu: 10mmemory: 20MisecurityContext:allowPrivilegeEscalation: falsecapabilities:drop:- ALLreadOnlyRootFilesystem: truerunAsGroup: 65532runAsNonRoot: truerunAsUser: 65532nodeSelector:kubernetes.io/os: linuxserviceAccountName: kube-state-metrics

apiVersion: apps/v1kind: Deploymentmetadata:labels:app.kubernetes.io/component: metrics-adapterapp.kubernetes.io/name: prometheus-adapterapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 0.10.0name: prometheus-adapternamespace: monitoringspec:replicas: 2selector:matchLabels:app.kubernetes.io/component: metrics-adapterapp.kubernetes.io/name: prometheus-adapterapp.kubernetes.io/part-of: kube-prometheusstrategy:rollingUpdate:maxSurge: 1maxUnavailable: 1template:metadata:labels:app.kubernetes.io/component: metrics-adapterapp.kubernetes.io/name: prometheus-adapterapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 0.10.0spec:automountServiceAccountToken: truecontainers:- args:- --cert-dir=/var/run/serving-cert- --config=/etc/adapter/config.yaml- --logtostderr=true- --metrics-relist-interval=1m- --prometheus-url=http://prometheus-k8s.monitoring.svc:9090/- --secure-port=6443- --tls-cipher-suites=TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA,TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA,TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA,TLS_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA# image: registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.10.0image: -/ialso/prometheus-adapter:v0.10.0livenessProbe:failureThreshold: 5httpGet:path: /livezport: httpsscheme: HTTPSinitialDelaySeconds: 30periodSeconds: 5name: prometheus-adapterports:- containerPort: 6443name: httpsreadinessProbe:failureThreshold: 5httpGet:path: /readyzport: httpsscheme: HTTPSinitialDelaySeconds: 30periodSeconds: 5resources:limits:cpu: 250mmemory: 180Mirequests:cpu: 102mmemory: 180MisecurityContext:allowPrivilegeEscalation: falsecapabilities:drop:- ALLreadOnlyRootFilesystem: truevolumeMounts:- mountPath: /tmpname: tmpfsreadOnly: false- mountPath: /var/run/serving-certname: volume-serving-certreadOnly: false- mountPath: /etc/adaptername: configreadOnly: falsenodeSelector:kubernetes.io/os: linuxserviceAccountName: prometheus-adaptervolumes:- emptyDir: {}name: tmpfs- emptyDir: {}name: volume-serving-cert- configMap:name: adapter-configname: config

kubectl apply --server-side -f manifests/setupkubectl wait \--for condition=Established \--all CustomResourceDefinition \--namespace=monitoringkubectl apply -f manifests/

3、配置外部访问

apiVersion: networking.k8s.io/v1kind: Ingressmetadata:name: ingress-grafananamespace: monitoringannotations:kubernetes.io/ingress.class: "nginx"spec:# 转发规则rules:- host: http:paths:- path: /pathType: Prefixbackend:service:name: grafanaport:number: 3000- host: http:paths:- path: /pathType: Prefixbackend:service:name: alertmanager-mainport:number: 9093- host: http:paths:- path: /pathType: Prefixbackend:service:name: prometheus-k8sport:number: 9090

4、配置grafana

面板配置:/grafana/dashboards/

导入面板

输入模板ID

配置

5、ControllerManager报警处理

# 确定目标serviceMonitor是否存在kubectl get serviceMonitor -n monitoring kube-controller-manager# 查看目标对应的service标签kubectl get serviceMonitor -n monitoring kube-controller-manager -o yaml# 查看目标service是否存在(我这里service不存在)kubectl get svc -n kube-system -l app.kubernetes.io/name=kube-controller-manager# 查看kube-controll端口netstat -lntp|grep "kube-controll"# 为目标创建service&endpiontvi cm-prometheus.yaml

apiVersion: v1kind: Endpointsmetadata:labels:app.kubernetes.io/name: kube-controller-managername: cm-prometheusnamespace: kube-systemsubsets:- addresses:- ip: 10.10.0.15ports:- name: https-metricsport: 10252protocol: TCP---apiVersion: v1kind: Servicemetadata:labels:app.kubernetes.io/name: kube-controller-managername: cm-prometheusnamespace: kube-systemspec:type: ClusterIPports:- name: https-metricsport: 10252protocol: TCPtargetPort: 10252

# 再次查看目标service是否存在kubectl get svc -n kube-system -l app.kubernetes.io/name=kube-controller-manager

此时prometheus target已经出现了kube-controller-manager,但是状态时DOWN,可能是因为Controller Manager监听的127.0.0.1,导致无法被外部访问

# 修改kube-controller-manager配置文件(- --bind-address=0.0.0.0)vi /etc/kubernetes/manifests/kube-controller-manager.yaml

此时prometheus target已经出现了kube-controller-manager的状态就应该是UP了

6、Scheduler报警处理

# 确定目标serviceMonitor是否存在kubectl get serviceMonitor -n monitoring kube-scheduler# 查看目标对应的service标签kubectl get serviceMonitor -n monitoring kube-scheduler -o yaml# 查看目标service是否存在(我这里service不存在)kubectl get serviceMonitor -n monitoring kube-scheduler -l app.kubernetes.io/name=kube-scheduler# 查看kube-scheduler端口netstat -lntp|grep "kube-scheduler"# 为目标创建service&endpiontvi scheduler-prometheus.yaml

apiVersion: v1kind: Endpointsmetadata:labels:app.kubernetes.io/name: kube-schedulername: scheduler-prometheusnamespace: kube-systemsubsets:- addresses:- ip: 10.10.0.15ports:- name: https-metricsport: 10259protocol: TCP---apiVersion: v1kind: Servicemetadata:labels:app.kubernetes.io/name: kube-schedulername: scheduler-prometheusnamespace: kube-systemspec:type: ClusterIPports:- name: https-metricsport: 10259protocol: TCPtargetPort: 10259

# 再次查看目标service是否存在kubectl get serviceMonitor -n monitoring kube-scheduler -l app.kubernetes.io/name=kube-scheduler

此时prometheus target已经出现了kube-scheduler,但是状态时DOWN,可能是因为Scheduler监听的127.0.0.1,导致无法被外部访问

# 修改kube-scheduler配置文件(- --bind-address=0.0.0.0)vi /etc/kubernetes/manifests/kube-scheduler.yaml

此时prometheus target已经出现了kube-scheduler的状态就应该是UP了

7、监控etcd

配置Service&Endpoints

# 查看证书文件位置cat /etc/kubernetes/manifests/etcd.yaml# 找到其中(--cert-file、--key-file)--cert-file=/etc/kubernetes/pki/etcd/server.crt--key-file=/etc/kubernetes/pki/etcd/server.key# 尝试访问metrics接口curl --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://10.10.0.15:2379/metrics -k | tail -1# 查看目前的ServiceMonitorkubectl get ServiceMonitor -n monitoring# 创建Etcd的Service&Endpointsvi etcd-prometheus.yamlkubectl apply -f etcd-prometheus.yaml# 测试10.96.27.17是service etcd-prometheus的ClusterIPcurl --cert /etc/kubernetes/pki/etcd/server.crt --key /etc/kubernetes/pki/etcd/server.key https://10.96.27.17:2379/metrics -k | tail -1

apiVersion: v1kind: Endpointsmetadata:labels:app: etcd-prometheusname: etcd-prometheusnamespace: kube-systemsubsets:- addresses:- ip: 10.10.0.15ports:- name: https-metricsport: 2379protocol: TCP---apiVersion: v1kind: Servicemetadata:labels:app: etcd-prometheusname: etcd-prometheusnamespace: kube-systemspec:type: ClusterIPports:- name: https-metricsport: 2379protocol: TCPtargetPort: 2379

创建证书供Prometheus使用

# 创建证书kubectl create secret generic etcd-ssl \--from-file=/etc/kubernetes/pki/etcd/ca.crt \--from-file=/etc/kubernetes/pki/etcd/server.crt \--from-file=/etc/kubernetes/pki/etcd/server.key \-n monitoring# 挂载到Prometheus(修改后prometheus-k8s-n会自动重启)kubectl edit prometheus k8s -n monitoring

apiVersion: /v1kind: Prometheusmetadata:annotations:kubectl.kubernetes.io/last-applied-configuration: |{"apiVersion":"/v1","kind":"Prometheus","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"prometheus","app.kubernetes.io/instance":"k8s","app.kubernetes.io/name":"prometheus","app.kubernetes.io/part-of":"kube-prometheus","app.kubernetes.io/version":"2.41.0"},"name":"k8s","namespace":"monitoring"},"spec":{"alerting":{"alertmanagers":[{"apiVersion":"v2","name":"alertmanager-main","namespace":"monitoring","port":"web"}]},"enableFeatures":[],"externalLabels":{},"image":"quay.io/prometheus/prometheus:v2.41.0","nodeSelector":{"kubernetes.io/os":"linux"},"podMetadata":{"labels":{"app.kubernetes.io/component":"prometheus","app.kubernetes.io/instance":"k8s","app.kubernetes.io/name":"prometheus","app.kubernetes.io/part-of":"kube-prometheus","app.kubernetes.io/version":"2.41.0"}},"podMonitorNamespaceSelector":{},"podMonitorSelector":{},"probeNamespaceSelector":{},"probeSelector":{},"replicas":2,"resources":{"requests":{"memory":"400Mi"}},"ruleNamespaceSelector":{},"ruleSelector":{},"securityContext":{"fsGroup":2000,"runAsNonRoot":true,"runAsUser":1000},"serviceAccountName":"prometheus-k8s","serviceMonitorNamespaceSelector":{},"serviceMonitorSelector":{},"version":"2.41.0"}}creationTimestamp: "-05-22T07:44:05Z"generation: 1labels:app.kubernetes.io/component: prometheusapp.kubernetes.io/instance: k8sapp.kubernetes.io/name: prometheusapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 2.41.0name: k8snamespace: monitoringresourceVersion: "105405"uid: 9f6efc98-0f83-4d4d-b8e6-c77bd857efa0spec:alerting:alertmanagers:- apiVersion: v2name: alertmanager-mainnamespace: monitoringport: webenableFeatures: []evaluationInterval: 30sexternalLabels: {}image: quay.io/prometheus/prometheus:v2.41.0nodeSelector:kubernetes.io/os: linuxpodMetadata:labels:app.kubernetes.io/component: prometheusapp.kubernetes.io/instance: k8sapp.kubernetes.io/name: prometheusapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 2.41.0podMonitorNamespaceSelector: {}podMonitorSelector: {}probeNamespaceSelector: {}probeSelector: {}replicas: 2# 添加证书secrets:- etcd-sslresources:requests:memory: 400MiruleNamespaceSelector: {}ruleSelector: {}scrapeInterval: 30ssecurityContext:fsGroup: 2000runAsNonRoot: truerunAsUser: 1000serviceAccountName: prometheus-k8sserviceMonitorNamespaceSelector: {}serviceMonitorSelector: {}version: 2.41.0

创建ServiceMonitor

# 创建etcd的ServiceMonitorvi etcd-servermonitor.yamlkubectl apply -f etcd-servermonitor.yaml

apiVersion: /v1kind: ServiceMonitormetadata:name: etcdnamespace: monitoringlabels:app: etcdspec:jobLabel: k8s-appendpoints:- interval: 30sport: https-metrics # 对应Service.spec.ports.namescheme: httpstlsConfig:# 注意此处路径应为caFile: /etc/prometheus/secrets/etcd-ssl/ca.crt # 证书路径certFile: /etc/prometheus/secrets/etcd-ssl/server.crtkeyFile: /etc/prometheus/secrets/etcd-ssl/server.keyinsecureSkipVerify: true # 关闭证书校验selector:matchLabels:app: etcd-prometheus # 跟svc的 lables 保持一致namespaceSelector:matchNames:- kube-system

此时应当可以看到prometheus target中的etcd

grafana面板(面板id:3070)

8、监控mysql

默认集群已安装mysql,这里用的mysql5.7

另外这里监控的是单节点,多节点监控暂时还不清楚要怎么用,可能要多个mysql_exporter?

创建用于监控的用户

create user 'exporter'@'%' identified by '123456';grant process,replication client,select on *.* to 'exporter'@'%';

配置mysql_exporter

apiVersion: apps/v1kind: Deploymentmetadata:name: mysql-exporternamespace: monitoringspec:replicas: 1selector:matchLabels:k8s-app: mysql-exportertemplate:metadata:labels:k8s-app: mysql-exporterspec:containers:- name: mysql-exporterimage: -/dotbalo/mysqld-exporterenv:- name: DATA_SOURCE_NAME# 格式为username:password@(ip/service.namespace:3306)/value: "exporter:123456@(mysql-master.mysql:3306)/"imagePullPolicy: IfNotPresentports:- containerPort: 9104---apiVersion: v1kind: Servicemetadata:labels:k8s-app: mysql-exportername: mysql-exporternamespace: monitoringspec:type: ClusterIPports:# 此处name应为api,代指9104端口- name: apiprotocol: TCPport: 9104selector:k8s-app: mysql-exporter

# 10.96.136.65是service mysql-exporter的ClusterIPcurl 10.96.136.65:9104/metrics

配置serviceMonitor

apiVersion: /v1kind: ServiceMonitormetadata:name: mysql-exporternamespace: monitoringlabels:k8s-app: mysql-exporternamespace: monitoringspec:jobLabel: k8s-appendpoints:- port: apiinterval: 30sscheme: httpselector:matchLabels:k8s-app: mysql-exporternamespaceSelector:matchNames:- monitoring

此时应当可以看到prometheus target中的mysql_exporter

grafana面板(面板id:7362)

如果出现问题可参照上方cm和scheduler方式排查,若均无问题可查看日志看是否出错kubectl logs -n monitoring mysql-exporter-6559759477-m8tqc

9、告警(邮件)

https://prometheus.io/docs/alerting/latest/alertmanager/

/prometheus/alertmanager/blob/main/doc/examples/simple.yml

Global:全局配置,主要用来配置一些通用的配置,比如邮件通知的账号、密码、SMTP 服务器、微信告警等Templates:用于放置自定义模板的位置Route:告警路由配置,用于告警信息的分组路由,可以将不同分组的告警发送给不同的收件人Inhibit_rules:告警抑制,主要用于减少告警的次数,防止“告警轰炸”Receivers:告警收件人配置

# 修改告警规则vi kube-prometheus/manifests/alertmanager-secret.yaml# 应用新的告警规则kubectl replace -f kube-prometheus/manifests/alertmanager-secret.yaml

apiVersion: v1kind: Secretmetadata:labels:app.kubernetes.io/component: alert-routerapp.kubernetes.io/instance: mainapp.kubernetes.io/name: alertmanagerapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 0.25.0name: alertmanager-mainnamespace: monitoringstringData:alertmanager.yaml: |-# global:全局配置,主要配置告警方式,如邮件、webhook等global:# 超时,默认5minresolve_timeout: 5m# 邮件配置smtp_smarthost: ':465'smtp_from: '2750955630@'smtp_auth_username: '2750955630@'smtp_auth_password: 'puwluaqcmkrdddge'smtp_require_tls: false# 模板templates:- '/usr/local/alertmanager/*.tmp'# 路由route:# 路由分组规则group_by: [ 'namespace', 'job', 'alertname' ]# 当一个新告警组被创建时,需要等待'group_wait'后才发送初始通知,防止告警轰炸group_wait: 30s# 当第一次告警通知发出后,在新的评估周期内又收到了该分组最新告警,则需等待'group_interval'时间后,开始发送为该组触发的新告警group_interval: 2m# 告警通知成功发送后,若问题一直未恢复,再次重复发送的间隔repeat_interval: 10m# 配置告警消息接收者receiver: 'Default'# 子路由routes:- receiver: 'email'match:alertname: "Watchdog"# 配置报警信息接收者信息receivers:- name: 'Default'email_configs:# 接收警报的email- to: 'xumeng03@'# 故障恢复后通知send_resolved: true- name: 'email'email_configs:# 接收警报的email- to: '2750955630@'# 故障恢复后通知send_resolved: true# 抑制规则配置inhibit_rules:- source_matchers:- severity="critical"target_matchers:- severity=~"warning|info"equal:- namespace- alertname- source_matchers:- severity="warning"target_matchers:- severity="info"equal:- namespace- alertname- source_matchers:- alertname="InfoInhibitor"target_matchers:- severity="info"equal:- namespacetype: Opaque

10、告警(企业微信)

需获取到以下信息

# 企业ID: wwe86504f797d306ce# 部门ID: 4# 应用AgentId: 1000002# 应用Secret: FrAuzVnZvkmJdQcRiESKtBHsX8Xmq5LHEc-cn-xxxx

并且配置好“网页授权及JS-SDK”与“企业可信IP”

这里如果使用nginx配置可信域名,可以采用以下方式(在能获取授权文件的同时不影响正常的服务)

server {# 监听端口listen 443 ssl;# 监听域名server_name ;# 证书信息ssl_certificate/etc/nginx/ssl/_bundle.pem;ssl_certificate_key /etc/nginx/ssl/.key;ssl_session_cache shared:SSL:1m;ssl_session_timeout 5m;ssl_ciphers HIGH:!aNULL:!MD5;ssl_prefer_server_ciphers on;root /etc/nginx/wechat; # MP_verify_7UJT32UzCOGkaUNB.txt 文件放在了 /etc/nginx/wechat 目录下location / {# 访问 /WW_verify_wEY0iTPFwKQAen0a.txt 时 --> try_files $uri --> try_files /WW_verify_wEY0iTPFwKQAen0a.txt --> /etc/nginx/wechat/WW_verify_wEY0iTPFwKQAen0a.txt --> 实现了访问# 访问 /网关转发的uri/xxx 时 --> try_files $uri --> try_files /网关转发的uri/xxx --> /data/wx/网关转发的uri/xxx 不存在 --> try_files @gateway --> location @gateway --> proxy_pass http://ialso_index --> 实现了访问try_files $uri @gateway;}# 转发/下的请求到ialso_kubernetes_ingresslocation @gateway {proxy_pass http://ialso_index;# 转发时保留原始请求域名proxy_set_header Host $host;proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;proxy_set_header X-Real-IP $remote_addr;}}

配置好上述信息就可以开始验证下消息的发送

python wechat.py Warning "warning message"

#!/bin/python# wechat.pyimport urllib,urllib2import jsonimport sysimport simplejsonreload(sys)sys.setdefaultencoding('utf-8')def gettoken(corpid,corpsecret):gettoken_url = 'https://qyapi./cgi-bin/gettoken?corpid=' + corpid + '&corpsecret=' + corpsecretprint gettoken_urltry:token_file = urllib2.urlopen(gettoken_url)except urllib2.HTTPError as e:print e.codeprint e.read().decode("utf8")sys.exit()token_data = token_file.read().decode('utf-8')token_json = json.loads(token_data)token_json.keys()token = token_json['access_token']return tokendef senddata(access_token,subject,content):send_url = 'https://qyapi./cgi-bin/message/send?access_token=' + access_token# toparty: 部门ID agentid: 应用IDsend_values = {"toparty":"4","msgtype":"text","agentid":"1000002","text":{"content":subject + '\n' + content},"safe":"0"}send_data = simplejson.dumps(send_values, ensure_ascii=False).encode('utf-8')send_request = urllib2.Request(send_url, send_data)response = json.loads(urllib2.urlopen(send_request).read())print str(response)if __name__ == '__main__':# 消息标题subject = str(sys.argv[1])# 消息内容content = str(sys.argv[2])# 企业IDcorpid = 'wwe86504f797d306ce'# 应用secretcorpsecret = 'FrAuzVnZvkmJdQcRiESKtBHsX8Xmq5LHEc-cn-wl3UY'accesstoken = gettoken(corpid,corpsecret)senddata(accesstoken,subject,content)

成功收到消息后就开始修改alertmanager规则(若不成功,可根据返回消息指示进行配置)

apiVersion: v1kind: Secretmetadata:labels:app.kubernetes.io/component: alert-routerapp.kubernetes.io/instance: mainapp.kubernetes.io/name: alertmanagerapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 0.25.0name: alertmanager-mainnamespace: monitoringstringData:alertmanager.yaml: |-# global:全局配置,主要配置告警方式,如邮件、webhook等global:# 超时,默认5minresolve_timeout: 5m# 邮件配置smtp_smarthost: ':465'smtp_from: '2750955630@'smtp_auth_username: '2750955630@'smtp_auth_password: 'puwluaqcmkrdddge'smtp_require_tls: false# 企业微信配置wechat_api_url: 'https://qyapi./cgi-bin/'# 企业IDwechat_api_corp_id: 'wwe86504f797d306ce'# 应用secretwechat_api_secret: 'FrAuzVnZvkmJdQcRiESKtBHsX8Xmq5LHEc-cn-wl3UY'# 模板templates:- '/usr/local/alertmanager/*.tmp'# 路由route:# 路由分组规则group_by: [ 'namespace', 'job', 'alertname' ]# 当一个新告警组被创建时,需要等待'group_wait'后才发送初始通知,防止告警轰炸group_wait: 30s# 当第一次告警通知发出后,在新的评估周期内又收到了该分组最新告警,则需等待'group_interval'时间后,开始发送为该组触发的新告警group_interval: 2m# 告警通知成功发送后,若问题一直未恢复,再次重复发送的间隔repeat_interval: 10m# 配置告警消息接收者receiver: 'Default'# 子路由routes:- receiver: 'wechat'match:alertname: "Watchdog"# 配置报警信息接收者信息receivers:- name: 'Default'email_configs:# 接收报警的email- to: 'xumeng03@'# 故障恢复后通知send_resolved: true- name: 'email'email_configs:# 接收报警的email- to: '2750955630@'# 故障恢复后通知send_resolved: true- name: 'wechat'wechat_configs:# 接收报警的部门ID- to_party: 2# 报警应用AgentIdagent_id: 1000002# 故障恢复后通知send_resolved: true# 抑制规则配置inhibit_rules:- source_matchers:- severity="critical"target_matchers:- severity=~"warning|info"equal:- namespace- alertname- source_matchers:- severity="warning"target_matchers:- severity="info"equal:- namespace- alertname- source_matchers:- alertname="InfoInhibitor"target_matchers:- severity="info"equal:- namespacetype: Opaque

11、自定义告警模板

添加自定义模板

# 添加自定义模板vi kube-prometheus/manifests/alertmanager-secret.yaml# 应用新的告警规则kubectl replace -f kube-prometheus/manifests/alertmanager-secret.yaml

apiVersion: v1kind: Secretmetadata:labels:app.kubernetes.io/component: alert-routerapp.kubernetes.io/instance: mainapp.kubernetes.io/name: alertmanagerapp.kubernetes.io/part-of: kube-prometheusapp.kubernetes.io/version: 0.25.0name: alertmanager-mainnamespace: monitoringstringData:wechat.tmpl: |-{{define "wechat.default.message" }}{{- if gt (len .Alerts.Firing) 0 -}}{{- range $index, $alert := .Alerts -}}{{- if eq $index 0 -}}# 异常报警!!!{{- end }}告警状态:{{.Status }}告警级别:{{.Labels.severity }}告警类型:{{$alert.Labels.alertname }}故障主机: {{$alert.Labels.instance }}告警主题: {{$alert.Annotations.summary }}告警详情: {{$alert.Annotations.message }}{{$alert.Annotations.description}}触发阀值:{{.Annotations.value }}故障时间: {{($alert.StartsAt.Add 28800e9).Format "-01-02 15:04:05" }}{{- end }}{{- end }}{{- if gt (len .Alerts.Resolved) 0 -}}{{- range $index, $alert := .Alerts -}}{{- if eq $index 0 -}}# 异常恢复!!!{{- end }}异常恢复!!!告警类型:{{.Labels.alertname }}告警状态:{{.Status }}告警主题: {{$alert.Annotations.summary }}告警详情: {{$alert.Annotations.message }}{{$alert.Annotations.description}}故障时间: {{($alert.StartsAt.Add 28800e9).Format "-01-02 15:04:05" }}恢复时间: {{($alert.EndsAt.Add 28800e9).Format "-01-02 15:04:05" }}{{- if gt (len $alert.Labels.instance) 0 }}实例信息: {{$alert.Labels.instance }}{{- end }}{{- end }}{{- end }}{{- end }}alertmanager.yaml: |-# global:全局配置,主要配置告警方式,如邮件、webhook等global:# 超时,默认5minresolve_timeout: 5m# 邮件配置smtp_smarthost: ':465'smtp_from: '2750955630@'smtp_auth_username: '2750955630@'smtp_auth_password: 'puwluaqcmkrdddge'smtp_require_tls: false# 企业微信配置wechat_api_url: 'https://qyapi./cgi-bin/'wechat_api_corp_id: 'wwe86504f797d306ce'wechat_api_secret: 'FrAuzVnZvkmJdQcRiESKtBHsX8Xmq5LHEc-cn-wl3UY'# 模板templates:- '/etc/alertmanager/config/*.tmpl'# 路由route:# 路由分组规则group_by: [ 'namespace', 'job', 'alertname' ]# 当一个新告警组被创建时,需要等待'group_wait'后才发送初始通知,防止告警轰炸group_wait: 30s# 当第一次告警通知发出后,在新的评估周期内又收到了该分组最新告警,则需等待'group_interval'时间后,开始发送为该组触发的新告警group_interval: 2m# 告警通知成功发送后,若问题一直未恢复,再次重复发送的间隔repeat_interval: 5m# 配置告警消息接收者receiver: 'Default'# 子路由routes:- receiver: 'wechat'match:alertname: "Watchdog"# 配置报警信息接收者信息receivers:- name: 'Default'email_configs:# 接收报警的email- to: 'xumeng03@'# 故障恢复后通知send_resolved: true- name: 'email'email_configs:# 接收报警的email- to: '2750955630@'# 故障恢复后通知send_resolved: true- name: 'wechat'wechat_configs:# 接收报警的部门ID- to_party: 4# 报警应用AgentIdagent_id: 1000002# 故障恢复后通知send_resolved: true# 使用指定模板message: '{{ template "wechat.default.message" . }}'# 抑制规则配置inhibit_rules:- source_matchers:- severity="critical"target_matchers:- severity=~"warning|info"equal:- namespace- alertname- source_matchers:- severity="warning"target_matchers:- severity="info"equal:- namespace- alertname- source_matchers:- alertname="InfoInhibitor"target_matchers:- severity="info"equal:- namespacetype: Opaque

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。