はじめに
初めましてこんにちは。 ngerukatakataです。
営業上がりの未経験エンジニアとしてそこそこの期間を働いております。
最近AWSEKS環境なんてものを触り始めました。
k8s環境に触れるのも初めてなうえに、AWSもそんなに触ったことない人間なので四苦八苦としています。
簡単な面もあればどうしたら実現できるんだ!なんて面にもぶつかったり…
皆さんもk8sに触れるときには同じような苦しみを感じたんじゃないかなぁって思います。
さて、今回は苦労したものの一つAWS_EKSの『メトリクス監視』についてお話しさせていただければと思います。
今回お話しするメトリクス監視ツールは ADOT(AWS Distro for OpenTelemetry) についてとなります。
目次
背景
この度新しい運用基盤へのチャレンジということでk8s環境への取り組みが始まりました。
新しい運用基盤への取り組みが始まるということは、当然のことながら、
新しい監視について検討をしなくてはなりません。
今までの環境は物理または仮想のサーバ環境に対して、Zabbixエージェントを導入して監視を行っておりました。
ただ調べてみると、EKS環境というのは今まで通りzabbixエージェントを仕込んで…というのはどうも難しそう。
今まで通りのやり方を踏襲してやれば楽勝じゃん!とはいかなそうではありました。
そこでいろいろな賢人たちのブログを読み漁りADOTというものに出会いました。
ADOT(AWS Distro for OpenTelemetry)はOpenTelemetryの仕組みを使って、
いい感じにデータを抜き出してくれる仕組み…これでメトリクスも完成だ!としたところで、
どうやら既存のADOT設定はFargate特化、EC2ノードを追加した構成ではうまく動かないことが分かりました。
そこでcadvisorといわれる仕組みを勉強したりOtelの構造を勉強したりなどして、
なんとかFargate+EC2のEKS構成でもADOTを利用したメトリクス監視構成を作成することができました
今回はそんな新しい監視ツールADOTの説明と、
それをEC2同居構成でどのように使えるようにしたかという説明をさせていただければと思います。
EKSの監視を始めよう!
まずは早速ADOTについて説明をさせていただきます。
ADOT とは?
AWS Distro for OpenTelemetry は、AWS がサポートする OpenTelemetry プロジェクトのディストリビューションです。
ADOT CollectorというPodでメトリクスの情報を収集し、Cloudwatchなどに送信するところまでやってくれています。
OpenTelemetry とは?
システム監視におけるメトリクスデータなどの収集や送信を標準化し、
特定のベンダに依存しない形でシンプルに収集/送信をするものです。
ADOTはこちらを利用して収集から送信をAWS用にいい感じにしてくれるものと捉えてもらえればよろしいかと思います。
ADOTをEKSonFargateに導入するには
それでは実際に私が実施した ADOTをつかったメトリクス監視の追加方法について、
実例をもとに説明させていただきます。
今回の実例は、以下に示すように
- FargateProfileの作成
- IAMの作成
- ADOTコレクタの作成
- Container Insightsでの確認
の流れになっていますので、ごらんの皆様もイメージしやすいかと思います!
FargateProfileの作成
ADOTCollectorはFargateで起動するため、
事前にFargateProfileを作成してEKSに認識させなくてはいけません。
弊社ではTerraformを使ってAWSの構成管理を行っているので以下のような記述を作ってFargateProfileを作成しました。
module "eks" { source = "terraform-aws-modules/eks/aws" version = "~> 18.30.2" 中略 fargate_profiles = { default = { name = "default" selectors = [ { namespace = "default" }, { namespace = "kube-system" } ] subnet_ids = var.private_subnets }, fargate-container-insights = { name = "fargate-container-insights" selectors = [ { namespace = "fargate-container-insights" } ] subnet_ids = var.private_subnets iam_role_additional_policies = ["arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"] } } }
IAMの作成
ADOT Collector から、メトリクスデータを CloudWatch に送信するために IAM アクセス許可が必要です。
Terraformを使って以下のような記述を作ってIAM許可ルールを作成しました。
module "eks-fargate-adot_irsa" { source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc" version = "3.5.0" create_role = true role_name = "${var.cluster.name}-EKS-Fargate-ADOT-ServiceAccount-Role" role_policy_arns = ["arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"] provider_url = module.eks.cluster_oidc_issuer_url oidc_fully_qualified_subjects = ["system:serviceaccount:fargate-container-insights:adot-collector"] }
これはrole_nameの名前でCloudWatchAgentServerPolicyの権限を持ったroleを作成しています。
EKS上のnamespace「fargate-container-insights」のpod「adot-collector」が処理をするときに
本roleにassume出来るようにしています。
該当のEKSには以下のようなyamlを実行してnamespaceとServiceAccountを作っておきましょう。
apiVersion: v1 kind: Namespace metadata: name: fargate-container-insights labels: name: fargate-container-insights
apiVersion: v1 kind: ServiceAccount metadata: name: adot-collector namespace: fargate-container-insights annotations: eks.amazonaws.com/role-arn: [IRSAARN]
ここでいう[IRSAARN]には先ほどTerraformで生成したrole_nameのARNを入力します。
ADOTコレクタの作成
次に以下のyamlを実行してStagtefulsetとしてADOT Collectorを作成しましょう。
以下のようなroleを作成し、
kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: adotcol-admin-role rules: - apiGroups: [""] resources: - nodes - nodes/proxy - nodes/metrics - services - endpoints - pods - pods/proxy verbs: ["get", "list", "watch"] - nonResourceURLs: [ "/metrics/cadvisor"] verbs: ["get", "list", "watch"]
さきほど作ったServiceAccountに権限を付与します。
kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: adotcol-admin-role-binding subjects: - kind: ServiceAccount name: adot-collector namespace: fargate-container-insights roleRef: kind: ClusterRole name: adotcol-admin-role apiGroup: rbac.authorization.k8s.io
そして重要となるAdotのconfig用のConfigmap、
これがOpenTelemetoryの設定になります。
長すぎるのでコードは閉じておきますが、
「receivers」でデータをどのように受け取るかの設定をし、
「processors」でどのようにデータを取り扱うかの設定をし、
「exporters」で出力先の設定をしています。
ここでは
「receivers」で「cadvisor」というものを使ってk8s環境の情報を取得し、
「processors」で必要な情報をメトリクスデータとして整理し、
「exporters」で「Cloudwatch」宛に出力する設定をしているということだけご認識ください。
>>>>コードを見る<<<<
apiVersion: v1 kind: ConfigMap metadata: name: adot-collector-config namespace: fargate-container-insights labels: app: aws-adot component: adot-collector-config data: adot-collector-config: | receivers: prometheus: config: global: scrape_interval: 1m scrape_timeout: 40s scrape_configs: - job_name: 'kubelets-cadvisor-metrics' sample_limit: 10000 scheme: https kubernetes_sd_configs: - role: node tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs: - action: labelmap regex: __meta_kubernetes_node_label_(.+) # Only for Kubernetes ^1.7.3. # See: https://github.com/prometheus/prometheus/issues/2916 - target_label: __address__ # Changes the address to Kube API server's default address and port replacement: kubernetes.default.svc:443 - source_labels: [__meta_kubernetes_node_name] regex: (.+) target_label: __metrics_path__ # Changes the default metrics path to kubelet's proxy cadvdisor metrics endpoint replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor metric_relabel_configs: # extract readable container/pod name from id field - action: replace source_labels: [id] regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$' target_label: rkt_container_name replacement: '$${2}-$${1}' - action: replace source_labels: [id] regex: '^/system\.slice/(.+)\.service$' target_label: systemd_service_name replacement: '$${1}' processors: # rename labels which apply to all metrics and are used in metricstransform/rename processor metricstransform/label_1: transforms: - include: .* match_type: regexp action: update operations: - action: update_label label: name new_label: container_id - action: update_label label: kubernetes_io_hostname new_label: NodeName - action: update_label label: eks_amazonaws_com_compute_type new_label: LaunchType # rename container and pod metrics which we care about. # container metrics are renamed to `new_container_*` to differentiate them with unused container metrics metricstransform/rename: transforms: - include: container_spec_cpu_quota new_name: new_container_cpu_limit_raw action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_spec_cpu_shares new_name: new_container_cpu_request action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_cpu_usage_seconds_total new_name: new_container_cpu_usage_seconds_total action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_spec_memory_limit_bytes new_name: new_container_memory_limit action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_memory_cache new_name: new_container_memory_cache action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_memory_max_usage_bytes new_name: new_container_memory_max_usage action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_memory_usage_bytes new_name: new_container_memory_usage action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_memory_working_set_bytes new_name: new_container_memory_working_set action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_memory_rss new_name: new_container_memory_rss action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_memory_swap new_name: new_container_memory_swap action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_memory_failcnt new_name: new_container_memory_failcnt action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_memory_failures_total new_name: new_container_memory_hierarchical_pgfault action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "hierarchy"} - include: container_memory_failures_total new_name: new_container_memory_hierarchical_pgmajfault action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "hierarchy"} - include: container_memory_failures_total new_name: new_container_memory_pgfault action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "container"} - include: container_memory_failures_total new_name: new_container_memory_pgmajfault action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "container"} - include: container_fs_limit_bytes new_name: new_container_filesystem_capacity action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} - include: container_fs_usage_bytes new_name: new_container_filesystem_usage action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"} # POD LEVEL METRICS - include: container_spec_cpu_quota new_name: pod_cpu_limit_raw action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_spec_cpu_shares new_name: pod_cpu_request action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_cpu_usage_seconds_total new_name: pod_cpu_usage_seconds_total action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_spec_memory_limit_bytes new_name: pod_memory_limit action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_memory_cache new_name: pod_memory_cache action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_memory_max_usage_bytes new_name: pod_memory_max_usage action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_memory_usage_bytes new_name: pod_memory_usage action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_memory_working_set_bytes new_name: pod_memory_working_set action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_memory_rss new_name: pod_memory_rss action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_memory_swap new_name: pod_memory_swap action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_memory_failcnt new_name: pod_memory_failcnt action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"} - include: container_memory_failures_total new_name: pod_memory_hierarchical_pgfault action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "hierarchy"} - include: container_memory_failures_total new_name: pod_memory_hierarchical_pgmajfault action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "hierarchy"} - include: container_memory_failures_total new_name: pod_memory_pgfault action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "container"} - include: container_memory_failures_total new_name: pod_memory_pgmajfault action: insert match_type: regexp experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "container"} - include: container_network_receive_bytes_total new_name: pod_network_rx_bytes action: insert match_type: regexp experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"} - include: container_network_receive_packets_dropped_total new_name: pod_network_rx_dropped action: insert match_type: regexp experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"} - include: container_network_receive_errors_total new_name: pod_network_rx_errors action: insert match_type: regexp experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"} - include: container_network_receive_packets_total new_name: pod_network_rx_packets action: insert match_type: regexp experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"} - include: container_network_transmit_bytes_total new_name: pod_network_tx_bytes action: insert match_type: regexp experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"} - include: container_network_transmit_packets_dropped_total new_name: pod_network_tx_dropped action: insert match_type: regexp experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"} - include: container_network_transmit_errors_total new_name: pod_network_tx_errors action: insert match_type: regexp experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"} - include: container_network_transmit_packets_total new_name: pod_network_tx_packets action: insert match_type: regexp experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"} # filter out only renamed metrics which we care about filter: metrics: include: match_type: regexp metric_names: - new_container_.* - pod_.* # convert cumulative sum datapoints to delta cumulativetodelta: metrics: - new_container_cpu_usage_seconds_total - pod_cpu_usage_seconds_total - pod_memory_pgfault - pod_memory_pgmajfault - pod_memory_hierarchical_pgfault - pod_memory_hierarchical_pgmajfault - pod_network_rx_bytes - pod_network_rx_dropped - pod_network_rx_errors - pod_network_rx_packets - pod_network_tx_bytes - pod_network_tx_dropped - pod_network_tx_errors - pod_network_tx_packets - new_container_memory_pgfault - new_container_memory_pgmajfault - new_container_memory_hierarchical_pgfault - new_container_memory_hierarchical_pgmajfault # convert delta to rate deltatorate: metrics: - new_container_cpu_usage_seconds_total - pod_cpu_usage_seconds_total - pod_memory_pgfault - pod_memory_pgmajfault - pod_memory_hierarchical_pgfault - pod_memory_hierarchical_pgmajfault - pod_network_rx_bytes - pod_network_rx_dropped - pod_network_rx_errors - pod_network_rx_packets - pod_network_tx_bytes - pod_network_tx_dropped - pod_network_tx_errors - pod_network_tx_packets - new_container_memory_pgfault - new_container_memory_pgmajfault - new_container_memory_hierarchical_pgfault - new_container_memory_hierarchical_pgmajfault experimental_metricsgeneration/1: rules: - name: pod_network_total_bytes unit: Bytes/Second type: calculate metric1: pod_network_rx_bytes metric2: pod_network_tx_bytes operation: add - name: pod_memory_utilization_over_pod_limit unit: Percent type: calculate metric1: pod_memory_working_set metric2: pod_memory_limit operation: percent - name: pod_cpu_usage_total unit: Millicore type: scale metric1: pod_cpu_usage_seconds_total operation: multiply # core to millicore: multiply by 1000 # millicore seconds to millicore nanoseconds: multiply by 10^9 scale_by: 1000 - name: pod_cpu_limit unit: Millicore type: scale metric1: pod_cpu_limit_raw operation: divide scale_by: 100 experimental_metricsgeneration/2: rules: - name: pod_cpu_utilization_over_pod_limit type: calculate unit: Percent metric1: pod_cpu_usage_total metric2: pod_cpu_limit operation: percent # add `Type` and rename metrics and labels metricstransform/label_2: transforms: - include: pod_.* match_type: regexp action: update operations: - action: add_label new_label: Type new_value: "Pod" - include: new_container_.* match_type: regexp action: update operations: - action: add_label new_label: Type new_value: Container - include: .* match_type: regexp action: update operations: - action: update_label label: namespace new_label: Namespace - action: update_label label: pod new_label: PodName - include: ^new_container_(.*)$$ match_type: regexp action: update new_name: container_$$1 # add cluster name from env variable and EKS metadata resourcedetection: detectors: [env, eks] batch: timeout: 60s # only pod level metrics in metrics format, details in https://aws-otel.github.io/docs/getting-started/container-insights/eks-fargate exporters: awsemf: log_group_name: '/aws/containerinsights/{ClusterName}/performance' log_stream_name: '{PodName}' namespace: 'ContainerInsights' region: YOUR-AWS-REGION resource_to_telemetry_conversion: enabled: true eks_fargate_container_insights_enabled: true parse_json_encoded_attr_values: ["kubernetes"] dimension_rollup_option: NoDimensionRollup metric_declarations: - dimensions: [ [ClusterName, LaunchType], [ClusterName, Namespace, LaunchType], [ClusterName, Namespace, PodName, LaunchType]] metric_name_selectors: - pod_cpu_utilization_over_pod_limit - pod_cpu_usage_total - pod_cpu_limit - pod_memory_utilization_over_pod_limit - pod_memory_working_set - pod_memory_limit - pod_network_rx_bytes - pod_network_tx_bytes extensions: health_check: service: pipelines: metrics: receivers: [prometheus] processors: [metricstransform/label_1, resourcedetection, metricstransform/rename, filter, cumulativetodelta, deltatorate, experimental_metricsgeneration/1, experimental_metricsgeneration/2, metricstransform/label_2, batch] exporters: [awsemf] extensions: [health_check]
ADOTに接続するためのService設定をClusterIPで設定します。
apiVersion: v1 kind: Service metadata: name: adot-collector-service namespace: fargate-container-insights labels: app: aws-adot component: adot-collector spec: ports: - name: metrics # default endpoint for querying metrics. port: 8888 selector: component: adot-collector type: ClusterIP
上記で設定したConfigmapを元にADOTCollectorをStatefullsetとして作成します。
apiVersion: apps/v1 kind: StatefulSet metadata: name: adot-collector namespace: fargate-container-insights labels: app: aws-adot component: adot-collector spec: selector: matchLabels: app: aws-adot component: adot-collector serviceName: adot-collector-service template: metadata: labels: app: aws-adot component: adot-collector spec: serviceAccountName: adot-collector securityContext: fsGroup: 65534 containers: - image: amazon/aws-otel-collector:v0.15.1 name: adot-collector imagePullPolicy: Always command: - "/awscollector" - "--config=/conf/adot-collector-config.yaml" env: - name: OTEL_RESOURCE_ATTRIBUTES value: "ClusterName=YOUR-EKS-CLUSTER-NAME" resources: limits: cpu: 2 memory: 2Gi requests: cpu: 200m memory: 400Mi volumeMounts: - name: adot-collector-config-volume mountPath: /conf volumes: - configMap: name: adot-collector-config items: - key: adot-collector-config path: adot-collector-config.yaml name: adot-collector-config-volume
Container Insightsでの確認
ここまでの設定を行い環境作成が終わり、
k8s環境にpodを作成すると自動的にContainerInsights上でメトリクスデータが見れるようになっています。
また、こちらのダッシュボードのもととなるメトリクスはCloudwatchメトリクス上でも確認することが可能です。
ADOTをEKSonFargate+EC2に導入するには
ここまでの設定でFargateの情報を取得することができるようになりました。
ただし、EC2交じりの構成を組んでいた場合、EC2上のpodのメトリクスは上記の方法では取得できません。
そこで追加で2つの改変を行うことでEC2上のメトリクスも取得できるようにしてみましょう。
ノードグループにラベルを追加
EC2のノードグループにラベルを追加します。 LaunchType:EC2と追加しましょう。
Configmapの修正
次にConfigmapに以下のように修正を加えましょう。
「processors」は現状ではLaunchType:Fargateとなっているもののデータしか収集しないようになっています。
そのためLanchType:EC2も対象となるようにしましょう。
またmemory_utilizationもFargateで作成した場合はPod_memoryというデータになってしまい、
EC2上のPodの情報がうまく取れないので「container_memory_utilization_over_pod_limit」という名前で追加作成しておきます。
最後に「exporters」上に、先ほど作った「container_memory_utilization_over_pod_limit」と「container_memory_working_set」「container_memory_limit」を追加しておきましょう。
apiVersion: v1 kind: ConfigMap metadata: name: adot-collector-config namespace: fargate-container-insights labels: app: aws-adot component: adot-collector-config data: adot-collector-config: | receivers: 中略 processors: # rename labels which apply to all metrics and are used in metricstransform/rename processor metricstransform/label_1: transforms: - include: .* match_type: regexp action: update operations: - action: update_label label: name new_label: container_id - action: update_label label: kubernetes_io_hostname new_label: NodeName - action: update_label label: eks_amazonaws_com_compute_type new_label: LaunchType # rename container and pod metrics which we care about. # container metrics are renamed to `new_container_*` to differentiate them with unused container metrics metricstransform/rename: transforms: - include: container_spec_cpu_quota new_name: new_container_cpu_limit_raw action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate|EC2"} - include: container_spec_cpu_shares new_name: new_container_cpu_request action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate|EC2"} - include: container_cpu_usage_seconds_total new_name: new_container_cpu_usage_seconds_total action: insert match_type: regexp experimental_match_labels: {"container": "\\S", "LaunchType": "fargate|EC2"} 中略 experimental_metricsgeneration/1: rules: - name: pod_network_total_bytes unit: Bytes/Second type: calculate metric1: pod_network_rx_bytes metric2: pod_network_tx_bytes operation: add - name: pod_memory_utilization_over_pod_limit unit: Percent type: calculate metric1: pod_memory_working_set metric2: pod_memory_limit operation: percent - name: container_memory_utilization_over_pod_limit ←追加 unit: Percent type: calculate metric1: new_container_memory_working_set metric2: new_container_memory_limit operation: percent 中略 exporters: awsemf: log_group_name: '/aws/containerinsights/{ClusterName}/performance' log_stream_name: '{PodName}' namespace: 'ContainerInsights' region: ap-northeast-1 resource_to_telemetry_conversion: enabled: true eks_fargate_container_insights_enabled: true parse_json_encoded_attr_values: ["kubernetes"] dimension_rollup_option: NoDimensionRollup metric_declarations: - dimensions: [ [ClusterName, LaunchType], [ClusterName, Namespace, LaunchType], [ClusterName, Namespace, PodName], [ClusterName, Namespace, PodName, LaunchType]] metric_name_selectors: - pod_cpu_utilization_over_pod_limit - pod_cpu_usage_total - pod_cpu_limit - pod_memory_utilization_over_pod_limit - container_memory_utilization_over_pod_limit ←追加 - pod_memory_working_set - container_memory_working_set ←追加 - pod_memory_limit - container_memory_limit ←追加 - pod_network_rx_bytes - pod_network_tx_bytes 後略
こちらのconfigmapを再度適応したADOTCollectorを展開してみます。
そうするとメトリクスをCloudwatchメトリクス上でも確認することが可能です。
※新しく追加したデータはContainerInsights上では確認が取れませんので注意が必要です。
まとめ
さて、実際の流れを通して ADOT の使い方の一例としてEC2podの情報取得方法についてご案内させていただきました。
今回は既存のADOT設定を踏襲するようにしたため、無駄な設定もありもっと改善の余地はあるかと思います。
本記事を参考にADOT使ってみたけどFargateとEC2の両方のメトリクスはどうやってとればいいんだ!って人の参考になれば幸いです。
参考
https://aws.amazon.com/jp/blogs/news/introducing-amazon-cloudwatch-container-insights-for-amazon-eks-fargate-using-aws-distro-for-opentelemetry/
https://opentelemetry.io/docs/
https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/