
はじめに
初めましてこんにちは。 ngerukatakataです。
営業上がりの未経験エンジニアとしてそこそこの期間を働いております。
最近AWSEKS環境なんてものを触り始めました。
k8s環境に触れるのも初めてなうえに、AWSもそんなに触ったことない人間なので四苦八苦としています。
簡単な面もあればどうしたら実現できるんだ!なんて面にもぶつかったり…
皆さんもk8sに触れるときには同じような苦しみを感じたんじゃないかなぁって思います。
さて、今回は苦労したものの一つAWS_EKSの『メトリクス監視』についてお話しさせていただければと思います。
今回お話しするメトリクス監視ツールは ADOT(AWS Distro for OpenTelemetry) についてとなります。
目次
背景
この度新しい運用基盤へのチャレンジということでk8s環境への取り組みが始まりました。
新しい運用基盤への取り組みが始まるということは、当然のことながら、
新しい監視について検討をしなくてはなりません。
今までの環境は物理または仮想のサーバ環境に対して、Zabbixエージェントを導入して監視を行っておりました。
ただ調べてみると、EKS環境というのは今まで通りzabbixエージェントを仕込んで…というのはどうも難しそう。
今まで通りのやり方を踏襲してやれば楽勝じゃん!とはいかなそうではありました。
そこでいろいろな賢人たちのブログを読み漁りADOTというものに出会いました。
ADOT(AWS Distro for OpenTelemetry)はOpenTelemetryの仕組みを使って、
いい感じにデータを抜き出してくれる仕組み…これでメトリクスも完成だ!としたところで、
どうやら既存のADOT設定はFargate特化、EC2ノードを追加した構成ではうまく動かないことが分かりました。
そこでcadvisorといわれる仕組みを勉強したりOtelの構造を勉強したりなどして、
なんとかFargate+EC2のEKS構成でもADOTを利用したメトリクス監視構成を作成することができました
今回はそんな新しい監視ツールADOTの説明と、
それをEC2同居構成でどのように使えるようにしたかという説明をさせていただければと思います。
EKSの監視を始めよう!
まずは早速ADOTについて説明をさせていただきます。
ADOT とは?
AWS Distro for OpenTelemetry は、AWS がサポートする OpenTelemetry プロジェクトのディストリビューションです。
ADOT CollectorというPodでメトリクスの情報を収集し、Cloudwatchなどに送信するところまでやってくれています。
OpenTelemetry とは?
システム監視におけるメトリクスデータなどの収集や送信を標準化し、
特定のベンダに依存しない形でシンプルに収集/送信をするものです。
ADOTはこちらを利用して収集から送信をAWS用にいい感じにしてくれるものと捉えてもらえればよろしいかと思います。
ADOTをEKSonFargateに導入するには
それでは実際に私が実施した ADOTをつかったメトリクス監視の追加方法について、
実例をもとに説明させていただきます。
今回の実例は、以下に示すように
- FargateProfileの作成
- IAMの作成
- ADOTコレクタの作成
- Container Insightsでの確認
の流れになっていますので、ごらんの皆様もイメージしやすいかと思います!
FargateProfileの作成
ADOTCollectorはFargateで起動するため、
事前にFargateProfileを作成してEKSに認識させなくてはいけません。
弊社ではTerraformを使ってAWSの構成管理を行っているので以下のような記述を作ってFargateProfileを作成しました。
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 18.30.2"
中略
fargate_profiles = {
default = {
name = "default"
selectors = [
{
namespace = "default"
},
{
namespace = "kube-system"
}
]
subnet_ids = var.private_subnets
},
fargate-container-insights = {
name = "fargate-container-insights"
selectors = [
{
namespace = "fargate-container-insights"
}
]
subnet_ids = var.private_subnets
iam_role_additional_policies = ["arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"]
}
}
}
IAMの作成
ADOT Collector から、メトリクスデータを CloudWatch に送信するために IAM アクセス許可が必要です。
Terraformを使って以下のような記述を作ってIAM許可ルールを作成しました。
module "eks-fargate-adot_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc"
version = "3.5.0"
create_role = true
role_name = "${var.cluster.name}-EKS-Fargate-ADOT-ServiceAccount-Role"
role_policy_arns = ["arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy"]
provider_url = module.eks.cluster_oidc_issuer_url
oidc_fully_qualified_subjects = ["system:serviceaccount:fargate-container-insights:adot-collector"]
}
これはrole_nameの名前でCloudWatchAgentServerPolicyの権限を持ったroleを作成しています。
EKS上のnamespace「fargate-container-insights」のpod「adot-collector」が処理をするときに
本roleにassume出来るようにしています。
該当のEKSには以下のようなyamlを実行してnamespaceとServiceAccountを作っておきましょう。
apiVersion: v1
kind: Namespace
metadata:
name: fargate-container-insights
labels:
name: fargate-container-insights
apiVersion: v1
kind: ServiceAccount
metadata:
name: adot-collector
namespace: fargate-container-insights
annotations:
eks.amazonaws.com/role-arn: [IRSAARN]
ここでいう[IRSAARN]には先ほどTerraformで生成したrole_nameのARNを入力します。
ADOTコレクタの作成
次に以下のyamlを実行してStagtefulsetとしてADOT Collectorを作成しましょう。
以下のようなroleを作成し、
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: adotcol-admin-role
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- nodes/metrics
- services
- endpoints
- pods
- pods/proxy
verbs: ["get", "list", "watch"]
- nonResourceURLs: [ "/metrics/cadvisor"]
verbs: ["get", "list", "watch"]
さきほど作ったServiceAccountに権限を付与します。
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: adotcol-admin-role-binding
subjects:
- kind: ServiceAccount
name: adot-collector
namespace: fargate-container-insights
roleRef:
kind: ClusterRole
name: adotcol-admin-role
apiGroup: rbac.authorization.k8s.io
そして重要となるAdotのconfig用のConfigmap、
これがOpenTelemetoryの設定になります。
長すぎるのでコードは閉じておきますが、
「receivers」でデータをどのように受け取るかの設定をし、
「processors」でどのようにデータを取り扱うかの設定をし、
「exporters」で出力先の設定をしています。
ここでは
「receivers」で「cadvisor」というものを使ってk8s環境の情報を取得し、
「processors」で必要な情報をメトリクスデータとして整理し、
「exporters」で「Cloudwatch」宛に出力する設定をしているということだけご認識ください。
>>>>コードを見る<<<<
apiVersion: v1
kind: ConfigMap
metadata:
name: adot-collector-config
namespace: fargate-container-insights
labels:
app: aws-adot
component: adot-collector-config
data:
adot-collector-config: |
receivers:
prometheus:
config:
global:
scrape_interval: 1m
scrape_timeout: 40s
scrape_configs:
- job_name: 'kubelets-cadvisor-metrics'
sample_limit: 10000
scheme: https
kubernetes_sd_configs:
- role: node
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Only for Kubernetes ^1.7.3.
# See: https://github.com/prometheus/prometheus/issues/2916
- target_label: __address__
# Changes the address to Kube API server's default address and port
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
# Changes the default metrics path to kubelet's proxy cadvdisor metrics endpoint
replacement: /api/v1/nodes/$${1}/proxy/metrics/cadvisor
metric_relabel_configs:
# extract readable container/pod name from id field
- action: replace
source_labels: [id]
regex: '^/machine\.slice/machine-rkt\\x2d([^\\]+)\\.+/([^/]+)\.service$'
target_label: rkt_container_name
replacement: '$${2}-$${1}'
- action: replace
source_labels: [id]
regex: '^/system\.slice/(.+)\.service$'
target_label: systemd_service_name
replacement: '$${1}'
processors:
# rename labels which apply to all metrics and are used in metricstransform/rename processor
metricstransform/label_1:
transforms:
- include: .*
match_type: regexp
action: update
operations:
- action: update_label
label: name
new_label: container_id
- action: update_label
label: kubernetes_io_hostname
new_label: NodeName
- action: update_label
label: eks_amazonaws_com_compute_type
new_label: LaunchType
# rename container and pod metrics which we care about.
# container metrics are renamed to `new_container_*` to differentiate them with unused container metrics
metricstransform/rename:
transforms:
- include: container_spec_cpu_quota
new_name: new_container_cpu_limit_raw
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_spec_cpu_shares
new_name: new_container_cpu_request
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_cpu_usage_seconds_total
new_name: new_container_cpu_usage_seconds_total
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_spec_memory_limit_bytes
new_name: new_container_memory_limit
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_cache
new_name: new_container_memory_cache
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_max_usage_bytes
new_name: new_container_memory_max_usage
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_usage_bytes
new_name: new_container_memory_usage
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_working_set_bytes
new_name: new_container_memory_working_set
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_rss
new_name: new_container_memory_rss
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_swap
new_name: new_container_memory_swap
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_failcnt
new_name: new_container_memory_failcnt
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_memory_failures_total
new_name: new_container_memory_hierarchical_pgfault
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "hierarchy"}
- include: container_memory_failures_total
new_name: new_container_memory_hierarchical_pgmajfault
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "hierarchy"}
- include: container_memory_failures_total
new_name: new_container_memory_pgfault
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "container"}
- include: container_memory_failures_total
new_name: new_container_memory_pgmajfault
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "container"}
- include: container_fs_limit_bytes
new_name: new_container_filesystem_capacity
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
- include: container_fs_usage_bytes
new_name: new_container_filesystem_usage
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate"}
# POD LEVEL METRICS
- include: container_spec_cpu_quota
new_name: pod_cpu_limit_raw
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_spec_cpu_shares
new_name: pod_cpu_request
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_cpu_usage_seconds_total
new_name: pod_cpu_usage_seconds_total
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_spec_memory_limit_bytes
new_name: pod_memory_limit
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_cache
new_name: pod_memory_cache
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_max_usage_bytes
new_name: pod_memory_max_usage
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_usage_bytes
new_name: pod_memory_usage
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_working_set_bytes
new_name: pod_memory_working_set
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_rss
new_name: pod_memory_rss
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_swap
new_name: pod_memory_swap
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_failcnt
new_name: pod_memory_failcnt
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate"}
- include: container_memory_failures_total
new_name: pod_memory_hierarchical_pgfault
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "hierarchy"}
- include: container_memory_failures_total
new_name: pod_memory_hierarchical_pgmajfault
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "hierarchy"}
- include: container_memory_failures_total
new_name: pod_memory_pgfault
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgfault", "scope": "container"}
- include: container_memory_failures_total
new_name: pod_memory_pgmajfault
action: insert
match_type: regexp
experimental_match_labels: {"image": "^$", "container": "^$", "pod": "\\S", "LaunchType": "fargate", "failure_type": "pgmajfault", "scope": "container"}
- include: container_network_receive_bytes_total
new_name: pod_network_rx_bytes
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_receive_packets_dropped_total
new_name: pod_network_rx_dropped
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_receive_errors_total
new_name: pod_network_rx_errors
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_receive_packets_total
new_name: pod_network_rx_packets
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_transmit_bytes_total
new_name: pod_network_tx_bytes
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_transmit_packets_dropped_total
new_name: pod_network_tx_dropped
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_transmit_errors_total
new_name: pod_network_tx_errors
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
- include: container_network_transmit_packets_total
new_name: pod_network_tx_packets
action: insert
match_type: regexp
experimental_match_labels: {"pod": "\\S", "LaunchType": "fargate"}
# filter out only renamed metrics which we care about
filter:
metrics:
include:
match_type: regexp
metric_names:
- new_container_.*
- pod_.*
# convert cumulative sum datapoints to delta
cumulativetodelta:
metrics:
- new_container_cpu_usage_seconds_total
- pod_cpu_usage_seconds_total
- pod_memory_pgfault
- pod_memory_pgmajfault
- pod_memory_hierarchical_pgfault
- pod_memory_hierarchical_pgmajfault
- pod_network_rx_bytes
- pod_network_rx_dropped
- pod_network_rx_errors
- pod_network_rx_packets
- pod_network_tx_bytes
- pod_network_tx_dropped
- pod_network_tx_errors
- pod_network_tx_packets
- new_container_memory_pgfault
- new_container_memory_pgmajfault
- new_container_memory_hierarchical_pgfault
- new_container_memory_hierarchical_pgmajfault
# convert delta to rate
deltatorate:
metrics:
- new_container_cpu_usage_seconds_total
- pod_cpu_usage_seconds_total
- pod_memory_pgfault
- pod_memory_pgmajfault
- pod_memory_hierarchical_pgfault
- pod_memory_hierarchical_pgmajfault
- pod_network_rx_bytes
- pod_network_rx_dropped
- pod_network_rx_errors
- pod_network_rx_packets
- pod_network_tx_bytes
- pod_network_tx_dropped
- pod_network_tx_errors
- pod_network_tx_packets
- new_container_memory_pgfault
- new_container_memory_pgmajfault
- new_container_memory_hierarchical_pgfault
- new_container_memory_hierarchical_pgmajfault
experimental_metricsgeneration/1:
rules:
- name: pod_network_total_bytes
unit: Bytes/Second
type: calculate
metric1: pod_network_rx_bytes
metric2: pod_network_tx_bytes
operation: add
- name: pod_memory_utilization_over_pod_limit
unit: Percent
type: calculate
metric1: pod_memory_working_set
metric2: pod_memory_limit
operation: percent
- name: pod_cpu_usage_total
unit: Millicore
type: scale
metric1: pod_cpu_usage_seconds_total
operation: multiply
# core to millicore: multiply by 1000
# millicore seconds to millicore nanoseconds: multiply by 10^9
scale_by: 1000
- name: pod_cpu_limit
unit: Millicore
type: scale
metric1: pod_cpu_limit_raw
operation: divide
scale_by: 100
experimental_metricsgeneration/2:
rules:
- name: pod_cpu_utilization_over_pod_limit
type: calculate
unit: Percent
metric1: pod_cpu_usage_total
metric2: pod_cpu_limit
operation: percent
# add `Type` and rename metrics and labels
metricstransform/label_2:
transforms:
- include: pod_.*
match_type: regexp
action: update
operations:
- action: add_label
new_label: Type
new_value: "Pod"
- include: new_container_.*
match_type: regexp
action: update
operations:
- action: add_label
new_label: Type
new_value: Container
- include: .*
match_type: regexp
action: update
operations:
- action: update_label
label: namespace
new_label: Namespace
- action: update_label
label: pod
new_label: PodName
- include: ^new_container_(.*)$$
match_type: regexp
action: update
new_name: container_$$1
# add cluster name from env variable and EKS metadata
resourcedetection:
detectors: [env, eks]
batch:
timeout: 60s
# only pod level metrics in metrics format, details in https://aws-otel.github.io/docs/getting-started/container-insights/eks-fargate
exporters:
awsemf:
log_group_name: '/aws/containerinsights/{ClusterName}/performance'
log_stream_name: '{PodName}'
namespace: 'ContainerInsights'
region: YOUR-AWS-REGION
resource_to_telemetry_conversion:
enabled: true
eks_fargate_container_insights_enabled: true
parse_json_encoded_attr_values: ["kubernetes"]
dimension_rollup_option: NoDimensionRollup
metric_declarations:
- dimensions: [ [ClusterName, LaunchType], [ClusterName, Namespace, LaunchType], [ClusterName, Namespace, PodName, LaunchType]]
metric_name_selectors:
- pod_cpu_utilization_over_pod_limit
- pod_cpu_usage_total
- pod_cpu_limit
- pod_memory_utilization_over_pod_limit
- pod_memory_working_set
- pod_memory_limit
- pod_network_rx_bytes
- pod_network_tx_bytes
extensions:
health_check:
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [metricstransform/label_1, resourcedetection, metricstransform/rename, filter, cumulativetodelta, deltatorate, experimental_metricsgeneration/1, experimental_metricsgeneration/2, metricstransform/label_2, batch]
exporters: [awsemf]
extensions: [health_check]
ADOTに接続するためのService設定をClusterIPで設定します。
apiVersion: v1
kind: Service
metadata:
name: adot-collector-service
namespace: fargate-container-insights
labels:
app: aws-adot
component: adot-collector
spec:
ports:
- name: metrics # default endpoint for querying metrics.
port: 8888
selector:
component: adot-collector
type: ClusterIP
上記で設定したConfigmapを元にADOTCollectorをStatefullsetとして作成します。
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: adot-collector
namespace: fargate-container-insights
labels:
app: aws-adot
component: adot-collector
spec:
selector:
matchLabels:
app: aws-adot
component: adot-collector
serviceName: adot-collector-service
template:
metadata:
labels:
app: aws-adot
component: adot-collector
spec:
serviceAccountName: adot-collector
securityContext:
fsGroup: 65534
containers:
- image: amazon/aws-otel-collector:v0.15.1
name: adot-collector
imagePullPolicy: Always
command:
- "/awscollector"
- "--config=/conf/adot-collector-config.yaml"
env:
- name: OTEL_RESOURCE_ATTRIBUTES
value: "ClusterName=YOUR-EKS-CLUSTER-NAME"
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 200m
memory: 400Mi
volumeMounts:
- name: adot-collector-config-volume
mountPath: /conf
volumes:
- configMap:
name: adot-collector-config
items:
- key: adot-collector-config
path: adot-collector-config.yaml
name: adot-collector-config-volume
Container Insightsでの確認
ここまでの設定を行い環境作成が終わり、
k8s環境にpodを作成すると自動的にContainerInsights上でメトリクスデータが見れるようになっています。

また、こちらのダッシュボードのもととなるメトリクスはCloudwatchメトリクス上でも確認することが可能です。

ADOTをEKSonFargate+EC2に導入するには
ここまでの設定でFargateの情報を取得することができるようになりました。
ただし、EC2交じりの構成を組んでいた場合、EC2上のpodのメトリクスは上記の方法では取得できません。
そこで追加で2つの改変を行うことでEC2上のメトリクスも取得できるようにしてみましょう。
ノードグループにラベルを追加
EC2のノードグループにラベルを追加します。
LaunchType:EC2と追加しましょう。

Configmapの修正
次にConfigmapに以下のように修正を加えましょう。
「processors」は現状ではLaunchType:Fargateとなっているもののデータしか収集しないようになっています。
そのためLanchType:EC2も対象となるようにしましょう。
またmemory_utilizationもFargateで作成した場合はPod_memoryというデータになってしまい、
EC2上のPodの情報がうまく取れないので「container_memory_utilization_over_pod_limit」という名前で追加作成しておきます。
最後に「exporters」上に、先ほど作った「container_memory_utilization_over_pod_limit」と「container_memory_working_set」「container_memory_limit」を追加しておきましょう。
apiVersion: v1
kind: ConfigMap
metadata:
name: adot-collector-config
namespace: fargate-container-insights
labels:
app: aws-adot
component: adot-collector-config
data:
adot-collector-config: |
receivers:
中略
processors:
# rename labels which apply to all metrics and are used in metricstransform/rename processor
metricstransform/label_1:
transforms:
- include: .*
match_type: regexp
action: update
operations:
- action: update_label
label: name
new_label: container_id
- action: update_label
label: kubernetes_io_hostname
new_label: NodeName
- action: update_label
label: eks_amazonaws_com_compute_type
new_label: LaunchType
# rename container and pod metrics which we care about.
# container metrics are renamed to `new_container_*` to differentiate them with unused container metrics
metricstransform/rename:
transforms:
- include: container_spec_cpu_quota
new_name: new_container_cpu_limit_raw
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate|EC2"}
- include: container_spec_cpu_shares
new_name: new_container_cpu_request
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate|EC2"}
- include: container_cpu_usage_seconds_total
new_name: new_container_cpu_usage_seconds_total
action: insert
match_type: regexp
experimental_match_labels: {"container": "\\S", "LaunchType": "fargate|EC2"}
中略
experimental_metricsgeneration/1:
rules:
- name: pod_network_total_bytes
unit: Bytes/Second
type: calculate
metric1: pod_network_rx_bytes
metric2: pod_network_tx_bytes
operation: add
- name: pod_memory_utilization_over_pod_limit
unit: Percent
type: calculate
metric1: pod_memory_working_set
metric2: pod_memory_limit
operation: percent
- name: container_memory_utilization_over_pod_limit ←追加
unit: Percent
type: calculate
metric1: new_container_memory_working_set
metric2: new_container_memory_limit
operation: percent
中略
exporters:
awsemf:
log_group_name: '/aws/containerinsights/{ClusterName}/performance'
log_stream_name: '{PodName}'
namespace: 'ContainerInsights'
region: ap-northeast-1
resource_to_telemetry_conversion:
enabled: true
eks_fargate_container_insights_enabled: true
parse_json_encoded_attr_values: ["kubernetes"]
dimension_rollup_option: NoDimensionRollup
metric_declarations:
- dimensions: [ [ClusterName, LaunchType], [ClusterName, Namespace, LaunchType], [ClusterName, Namespace, PodName], [ClusterName, Namespace, PodName, LaunchType]]
metric_name_selectors:
- pod_cpu_utilization_over_pod_limit
- pod_cpu_usage_total
- pod_cpu_limit
- pod_memory_utilization_over_pod_limit
- container_memory_utilization_over_pod_limit ←追加
- pod_memory_working_set
- container_memory_working_set ←追加
- pod_memory_limit
- container_memory_limit ←追加
- pod_network_rx_bytes
- pod_network_tx_bytes
後略
こちらのconfigmapを再度適応したADOTCollectorを展開してみます。
そうするとメトリクスをCloudwatchメトリクス上でも確認することが可能です。
※新しく追加したデータはContainerInsights上では確認が取れませんので注意が必要です。
まとめ
さて、実際の流れを通して ADOT の使い方の一例としてEC2podの情報取得方法についてご案内させていただきました。
今回は既存のADOT設定を踏襲するようにしたため、無駄な設定もありもっと改善の余地はあるかと思います。
本記事を参考にADOT使ってみたけどFargateとEC2の両方のメトリクスはどうやってとればいいんだ!って人の参考になれば幸いです。
参考
https://aws.amazon.com/jp/blogs/news/introducing-amazon-cloudwatch-container-insights-for-amazon-eks-fargate-using-aws-distro-for-opentelemetry/
https://opentelemetry.io/docs/
https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/