GCP｜除錯 Cloud Monitoring Agent

前言

這些錯誤訊息是從 GCP 的 APIs & Services Dashboard 觀察到，Cloud Monitoring API 有 10% 左右的錯誤率。

進一步在 Metrics 調查後發現，錯誤的 API Method 主要是 google.monitoring.v3.MetricService.CreateTimeSeries。

因為這只是一個 API Method，光從這裡看不出來是什麼問題，所以直接連到 VM 上一探究竟吧。

連線到 VM 後，有兩個方法可以查 stackdriver-agent 的 log：

sudo grep collectd /var/log/{syslog,messages} | tail
sudo service stackdriver-agent status
sudo cat /var/log/syslog

第 1 個方法與第 2 個方法，可以印出最近幾筆的 log 紀錄，第 3 個方法則可以印出完整的 log 紀錄，這次我以第 3 個方法為主，查到主要的錯誤訊息有以下兩筆：

write_gcm: can not take infinite value
Unsuccessful HTTP request 400 … The start time must be before the end time

write_gcm: can not take infinite value

導致這個錯誤的主要原因，是 stackdriver-agent 使用的設定檔 /etc/stackdriver/collectd.conf 載入了這一段 plugin，使用 swap 的值進行運算：

LoadPlugin swap
<Plugin "swap">
  ValuesPercentage true
</Plugin>

但 Compute Engine 的 VM 沒有 swap，所以會發生 divide by 0 的問題，導致遇到錯誤。

輸入 free -m 可以查詢 VM 的 swap 資訊：

$ free -m
              total        used        free      shared  buff/cache   available
Mem:           3697         733        2090          45         873        2698
Swap:             0           0           0

只要把 /etc/stackdriver/collectd.conf 中 swap 的部分刪除或註解掉即可。

參考資料：GCP Stackdriver Agent: “write_gcm: can not take infinite value” Error

Unsuccessful HTTP request 400 … The start time must be before the end time

這個問題的發生原因是 start_time 的時間與 end_time 一樣所致。從參考資料中發現，這個問題是已知的，但負責該產品的 Google 工程團隊目前還沒有預定的修正時間，所以目前只能從訂閱以下兩個 issue 的方式，當 issue 的處理進度有更新時，才會收到最新的通知。

參考資料：GCP stackdriver-agent installed on VM send strange logs every minute

GCP｜除錯 Cloud Monitoring Agent

前言

write_gcm: can not take infinite value

Unsuccessful HTTP request 400 … The start time must be before the end time

留言

發佈留言取消回覆

GCP｜除錯 Cloud Monitoring Agent

前言

write_gcm: can not take infinite value

Unsuccessful HTTP request 400 … The start time must be before the end time

留言

發佈留言 取消回覆

發佈留言取消回覆