Kubernetes日志管理与分析
Kubernetes日志管理与分析引言日志是 Kubernetes 集群中故障排查、性能监控和安全审计的重要数据来源。有效的日志管理策略能够帮助运维团队快速定位问题、分析系统行为。本文将深入探讨 Kubernetes 日志管理的最佳实践和分析方法。一、日志架构概述1.1 日志层次结构┌─────────────────────────────────────────────────────────────┐ │ Kubernetes 日志架构 │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 应用层日志 │ │ │ │ - 容器应用日志 │ │ │ │ - 应用程序日志 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 容器运行时日志 │ │ │ │ - Docker/containerd 日志 │ │ │ │ - 容器启动/停止日志 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 节点系统日志 │ │ │ │ - kubelet/kube-proxy 日志 │ │ │ │ - 操作系统日志 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ 控制平面日志 │ │ │ │ - API Server/etcd 日志 │ │ │ │ - Scheduler/Controller Manager 日志 │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘1.2 日志类型对比日志类型来源内容重要性应用日志容器内应用业务逻辑日志高容器日志容器运行时容器生命周期中节点日志kubelet/系统节点状态高控制平面日志集群组件集群管理高二、日志收集方案2.1 日志收集架构apiVersion: apps/v1 kind: DaemonSet metadata: name: fluentd namespace: kube-system spec: selector: matchLabels: app: fluentd template: metadata: labels: app: fluentd spec: serviceAccountName: fluentd tolerations: - key: node-role.kubernetes.io/control-plane operator: Exists effect: NoSchedule containers: - name: fluentd image: fluent/fluentd-kubernetes-daemonset:v1.15-debian-elasticsearch7 env: - name: FLUENT_ELASTICSEARCH_HOST value: elasticsearch.logging.svc.cluster.local - name: FLUENT_ELASTICSEARCH_PORT value: 9200 resources: limits: memory: 512Mi requests: cpu: 100m memory: 200Mi volumeMounts: - name: varlog mountPath: /var/log - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true volumes: - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containers2.2 Loki 日志收集apiVersion: apps/v1 kind: DaemonSet metadata: name: promtail namespace: logging spec: selector: matchLabels: app: promtail template: metadata: labels: app: promtail spec: serviceAccountName: promtail containers: - name: promtail image: grafana/promtail:latest args: - -config.file/etc/promtail/config.yaml volumeMounts: - name: config mountPath: /etc/promtail - name: varlog mountPath: /var/log - name: varlibdockercontainers mountPath: /var/lib/docker/containers readOnly: true volumes: - name: config configMap: name: promtail-config - name: varlog hostPath: path: /var/log - name: varlibdockercontainers hostPath: path: /var/lib/docker/containersPromtail 配置apiVersion: v1 kind: ConfigMap metadata: name: promtail-config data: config.yaml: | server: http_listen_port: 9080 grpc_listen_port: 0 positions: filename: /tmp/positions.yaml clients: - url: http://loki:3100/api/v1/push scrape_configs: - job_name: kubernetes-pods kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] target_label: app - source_labels: [__meta_kubernetes_namespace] target_label: namespace2.3 EFK 堆栈配置# Elasticsearch StatefulSet apiVersion: apps/v1 kind: StatefulSet metadata: name: elasticsearch namespace: logging spec: serviceName: elasticsearch replicas: 3 selector: matchLabels: app: elasticsearch template: metadata: labels: app: elasticsearch spec: containers: - name: elasticsearch image: docker.elastic.co/elasticsearch/elasticsearch:8.5.0 resources: requests: memory: 4Gi cpu: 2 env: - name: discovery.type value: single-node - name: ES_JAVA_OPTS value: -Xms2g -Xmx2g ports: - containerPort: 9200 name: http volumeMounts: - name: data mountPath: /usr/share/elasticsearch/data volumeClaimTemplates: - metadata: name: data spec: accessModes: [ReadWriteOnce] resources: requests: storage: 100Gi三、日志管理最佳实践3.1 日志格式标准化apiVersion: v1 kind: ConfigMap metadata: name: fluentd-config data: fluent.conf: | match ** type elasticsearch host elasticsearch port 9200 logstash_format true logstash_prefix kubernetes include_tag_key true tag_key log_name /match结构化日志输出{ timestamp: 2024-01-15T10:30:00Z, level: INFO, logger: app, message: User login successful, request_id: abc123, user_id: user-456, response_time: 125, status_code: 200 }3.2 日志保留策略apiVersion: policy.k8s.io/v1 kind: PodDisruptionBudget metadata: name: elasticsearch-pdb spec: minAvailable: 2 selector: matchLabels: app: elasticsearch --- apiVersion: batch/v1 kind: CronJob metadata: name: curator namespace: logging spec: schedule: 0 2 * * * jobTemplate: spec: template: spec: containers: - name: curator image: bobrik/curator:5.8 command: - curator - --config - /config/config.yml - /config/action_file.yml volumeMounts: - name: config mountPath: /config volumes: - name: config configMap: name: curator-config restartPolicy: OnFailureCurator 配置# config.yml client: hosts: - elasticsearch port: 9200 url_prefix: use_ssl: False certificate: client_cert: client_key: ssl_no_validate: False http_auth: timeout: 30 master_only: False logging: loglevel: INFO # action_file.yml actions: 1: action: delete_indices description: Delete indices older than 30 days options: ignore_empty_list: True timeout_override: continue_if_exception: False filters: - filtertype: pattern kind: prefix value: kubernetes- - filtertype: age source: name direction: older timestring: %Y.%m.%d unit: days unit_count: 303.3 日志访问控制apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: log-reader namespace: logging rules: - apiGroups: [] resources: [pods/log] verbs: [get, list] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: log-reader-binding namespace: logging roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: log-reader subjects: - kind: User name: developerexample.com四、日志分析与查询4.1 Kibana 查询示例# 查询错误日志 level: ERROR AND timestamp:[now-1h TO now] # 查询特定应用日志 app: my-app AND response_time 500 # 查询认证失败 message: *authentication failed* # 聚合分析 GET /_search { aggs: { errors_by_app: { terms: { field: app.keyword, size: 10 }, aggs: { error_count: { filter: { term: { level: ERROR } } } } } } }4.2 Loki 查询示例# 查询 Pod 日志 {appmy-app, namespacedefault} | error # 统计错误数 count_over_time({appmy-app} | ERROR [5m]) # 过滤时间范围 {namespacekube-system} | logfmt | levelerror # 正则匹配 {appmy-app} |~ authentication.*failed4.3 Grafana 日志仪表板{ dashboard: { title: Kubernetes 日志分析, panels: [ { type: logs, target: { expr: {namespace~\$namespace\, app~\$app\}, refId: A }, title: 实时日志 }, { type: graph, target: count_over_time({namespace~\$namespace\} | \ERROR\ [5m]), title: 错误率 }, { type: stat, target: sum(count_over_time({namespace~\$namespace\}[5m])), title: 日志总量 } ], templating: { list: [ { name: namespace, type: query, query: label_values({__name__\namespace\}, namespace) }, { name: app, type: query, query: label_values({namespace~\$namespace\}, app) } ] } } }五、日志监控与告警5.1 告警规则配置apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: log-alerts namespace: monitoring spec: groups: - name: log_rules rules: - alert: HighErrorRate expr: sum(count_over_time({app~.} | ERROR[5m])) by (app) 10 for: 5m labels: severity: critical annotations: summary: 应用 {{ $labels.app }} 错误率过高 description: 过去5分钟内错误日志超过10条 - alert: LogVolumeHigh expr: sum by (namespace) (kube_pod_container_resource_requests_storage_bytes) 100G for: 10m labels: severity: warning annotations: summary: {{ $labels.namespace }} 日志存储过高 description: 日志存储已超过100GB - alert: LogCollectionFailed expr: absent(promtail_scrape_samples_scraped_total[5m]) for: 5m labels: severity: critical annotations: summary: 日志收集失败 description: Promtail 未收集到日志5.2 异常检测apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: log-anomaly-detection spec: groups: - name: anomaly_rules rules: - record: log_anomaly_score expr: | (sum(count_over_time({appmy-app}[1m])) / sum(count_over_time({appmy-app}[1h]))) 2 - alert: LogVolumeSpike expr: log_anomaly_score 3 for: 2m labels: severity: warning annotations: summary: {{ $labels.app }} 日志量突增 description: 日志量超过历史平均值3倍六、日志安全与合规6.1 日志加密apiVersion: v1 kind: Secret metadata: name: elasticsearch-certificates type: Opaque data: tls.crt: base64-encoded-cert tls.key: base64-encoded-key --- apiVersion: apps/v1 kind: Deployment metadata: name: kibana namespace: logging spec: template: spec: containers: - name: kibana image: docker.elastic.co/kibana/kibana:8.5.0 env: - name: ELASTICSEARCH_HOSTS value: https://elasticsearch:9200 - name: ELASTICSEARCH_USERNAME valueFrom: secretKeyRef: name: elasticsearch-credentials key: username - name: ELASTICSEARCH_PASSWORD valueFrom: secretKeyRef: name: elasticsearch-credentials key: password6.2 访问日志审计apiVersion: v1 kind: ConfigMap metadata: name: nginx-config data: nginx.conf: | http { log_format main $remote_addr - $remote_user [$time_local] $request $status $body_bytes_sent $http_referer $http_user_agent $http_x_forwarded_for $request_time $upstream_response_time; access_log /var/log/nginx/access.log main; }七、常见问题与解决方案7.1 日志丢失问题分析Pod 重启导致容器日志丢失日志收集器配置错误存储容量不足解决方案# 配置持久化存储 volumes: - name: varlog persistentVolumeClaim: claimName: log-storage7.2 日志查询慢问题分析索引过多查询条件不合理存储性能不足解决方案# 配置索引生命周期管理 apiVersion: elasticsearch.k8s.elastic.co/v1 kind: Elasticsearch metadata: name: quickstart spec: nodeSets: - name: default count: 3 config: node.store.allow_mmap: false volumeClaimTemplates: - metadata: name: elasticsearch-data spec: accessModes: - ReadWriteOnce resources: requests: storage: 100Gi storageClassName: fast7.3 日志敏感信息泄露问题分析日志中包含密码、token 等敏感信息日志未脱敏处理解决方案# Fluentd 脱敏配置 filter ** type record_transformer enable_ruby true record message ${record[message].gsub(/(password|token)[^]/, \1***)} /record /filter结论Kubernetes 日志管理是集群运维的重要组成部分。通过合理的日志收集架构、标准化的日志格式、完善的存储策略和智能的分析工具可以构建一个高效、可靠的日志管理系统。同时结合安全合规要求和持续优化能够更好地支持故障排查和业务分析需求。