[GKE] 在 Custom VPC 建 VPC-native cluster 要注意的 IP 規劃細節

Jul 12, 2019

整理了一些在 GCP 在 Custom VPC 裡面架 GKE 的話要注意一些 IP 規劃問題。

最近因為碰到了 cluster 無法 scale up 才仔細研究了一下 VPC-native cluster 的需求和限制，也順便來記錄一下。

前情提要

公司開始大量使用 GCP 快要一年惹，隨著越用越多，就開始有跨 AWS 和 GCP 服務和網路的相依性出現。最近跟同事一起整理兩邊的 VPC，想說未來可能想把兩邊 VPC Peer 起來 (可能用 VPN 吧?)，因此捨棄了懶人 Auto Mode，燒腦算了很多 subnet 以及 secondary ranges 之後寫成 terraform 來 provision。

在建立 VPC 的時候，我們也跟著 GCP best practices 的建議用了 Shared VPC 的架構，為每環境統一開了各自的 infra project 建立 VPC 作為 host project，再把 subnets share 到環境下的不同 project 中，包含建立 GKE cluster 也是建立在 shared subnet 上的，這樣就有架構上比較乾淨，統一管理，billing 也好分割優點。詳細怎麼做的可以參考 Shared VPC 的文件。

VPC-native clusters

VPC-native cluster 簡單來說，就是使用 Alias IP 的功能，讓 GCP VPC 直接處理 cluster 內部的連線。在 VPC-native cluster 中，每個 node 都 alias 了各自的 pod IP block，所以 in-cluster networking 就不需要像是以前一樣透過 kube-proxy 操作 iptables 來轉發連線，直接出 node 由 VPC 處理。好處是不需要占用 node 的資源來處理連線，更可以直接整合 firewall, Cloud NAT 等等 VPC 原生的功能。原則上 VPC-native 已經成為 GKE 推薦開啟的 networking 模式。

如果要搭配 custom VPC 建立 VPC-native cluster 的時候，需要給定一個 subnet 以及在 subnet 底下的兩個 secondary ranges，所以一個完整的 VPC-native cluster 需要規劃三個 IP blocks：

Nodes - GKE subnet 的 primary IP block
Pods - GKE subnet 的 secondary range
Services - GKE subnet 的另一個 secondary range

要注意的 IP 規劃細節

第一個限制是，無論 Nodes, Pods, Services 的 IP 都不能跟 172.17.0.0/16 重疊[2]，可能這個網段被 node 內部保留了，之前不小心重疊到的時候 node 出現了各種連線問題，像是連 master 都連不上，慘 XD

另一個要注意的是關於 IP range 大小，需要同時規劃 node 數量以及 pod 數量。建議是先規劃 預設每個 node 最多的 pod 數量 (不過針對新的 node pool 可以各別覆寫這個設定)。 GKE 在建立 node 的時候，每個 node 會根據 max pods per node，決定在 Pods range 中，要割出一塊多大的 IP range 去 alias 在 node 上。GKE 的設計是這個 range 至少要有兩倍的 IP 數。[3]

Maximum Pods per Node	CIDR Range per Node
8	/28
9 to 16	/27
17 to 32	/26
33 to 64	/25
65 to 110	/24

預設的值是 110，也就是每個 node 會從 pods IP range 中挖一塊 /24 作為 IP alias。

有了每個 node 的 pod IP 數量之後，再去乘以總共規劃的最多 node 數量，決定 Pod IP range 大小。所以 Pod IP range 要跟 Node IP range 一起考量，比較理想的的情形是：

Node IP 數量 (Node IP range) * max Pods per Node ≈ Pods IP 數量 (Pod IP range)

如果 max pods per node 維持預設 110 的話，就可以直接參考 [1] 裡面的建議表格

Subnet size for nodes	Maximum nodes	Maximum Pod IP addresses needed	Recommended Pod address range
/29	4	1,024	/21
/28	12	3,072	/20
/27	28	7,168	/19
/26	60	15,360	/18
/25	24	31,744	/17
/24	252	64,512	/16
/23	508	130,048	/15
/22	1,020	261,120	/14
/21	2,044	523,264	/13
/20	4,092	1,047,552	/12
/19	8,188	2,096,128	/11 (max Pod address range)

之前遇到 scale up 的問題就是 Node IP 還夠用，Pod IP 不夠用造成的。

後來改過重建的 cluster 中，因為我們用的 node 都不大，max pods per node 都改成了 64。如果要斤斤計較 VPC 中的 IP 數量的話，這的確緩解了 IP 規劃上的問題。 (PS 我們要斤斤計較是因為 AWS 和 GCP 要共用整個 Private IP，加上我們在全球又佈了很多 region…)

小結

要讀懂和規劃 VPC 其實需要蠻多心力的，架一個 VPC-native 又更是複雜惹。雖然 GCP 的文件還算平易近人，要把好幾篇文件的東西串起來還是花了不少時間啊 (汗)，希望這篇可以幫到也要用 custom VPC 架 GKE 的人～

參考文件

Creating a VPC-native cluster https://cloud.google.com/kubernetes-engine/docs/how-to/alias-ips
Setting up clusters with shared VPC https://cloud.google.com/kubernetes-engine/docs/how-to/cluster-shared-vpc
Optimizing IP address allocation https://cloud.google.com/kubernetes-engine/docs/how-to/flexible-pod-cidr