Kakashi's Notes

Werner Vogels 談 Amazon 怎麼從 Monolith 到 Service-based Architecture

2021-02-16T13:39:46.000Z

Preface

這是一篇 2006 年 AWS CTO Werner Vogels 的 interview，收錄在 ACM queue 上面，過了將近 15 年後來看，令我特別有感觸，也非常佩服我偶像 Werner Vogels 的真知灼見。

我特別驚呀的是在 DevOps, Microservice 這些詞誕生前，就有了這則訪問，而這篇訪談又環繞在這幾點上面，談論不管在團隊或是技術架構上，Amazon 怎麼解決 scale independently 這個問題。

Amazon store 成長帶來的問題

Amazon store 的成長帶來了很多技術上面的挑戰，面對不同的客戶和賣家，不同的 access Amazon services 的方式，Vogels 舉出了下列非常多需要考慮的問題，如何在 ultra-scale 下持續維持 availability 和 performance 變成迫切的需求。

The impact has been on many areas: larger data sets, faster update rates, more requests, more services, tighter SLAs (service-level agreements), more failures, more latency challenges, more service interdependencies, more developers, more documentation, more programs, more servers, more networks, more data centers.

Amazon 架構的轉變

Amazon 起初也是 monolithic application，所以一開始也都是針對系統瓶頸，也就是改善 backend 且 database 的部分，進而去支撐更多的 items, customers 和 orders，直到 2001 年 frontend 變成了瓶頸。很快的他們就發現 scale independently 這件事會被 sharing resources 所影響，沒有清楚的 isolation，沒有 ownership 減緩了開發速度。

there were many complex pieces of software combined into a single system. It couldn’t evolve anymore. The parts that needed to scale independently were tied into sharing resources with other unknown code paths. There was no isolation and, as a result, no clear ownership.

經過了一陣子的 introspection，得出了 service-oriented architecture 是這些問題的解答，不單可以提供 the level of isolation，也可以讓 Amazon 的不同部門，可以快速且獨立的建構不同的 components。

service orientation 在 Amazon 的意義是，資料和 business logic 會被封裝 (encapsulation) 在一起，而且只能透過 公開的 service 介面，不允許直接呼叫 database。Amazon 花了將近五年，把 two-tier monolith 的架構換成 fully-distributed, decentralized services。

每個 service 都有個別的 team 直接相關，這些 service 的死活都是這些 team 直接負責，讓 developers 去 operate services，不論從客戶的角度，或是技術的角度，都直接的提升了 service 的品質，這邊也直接呼應了 AWS 怎麼做 DevOps 的哲學。

that team is completely responsible for the service—from scoping out the functionality, to architecting it, to building it, and operating it.

AWS 技術的選型

訪者在後面也詢問 Vogels，如何看待像是 SOA, WSDL, SOAP, WS-security 等等的 Buzzword 。

而 Vogels 提到 AWS 在 SOA 這個詞還沒火起來前，就在改造內部的 services 了，這個時候內部主要用 WSDL/SOAP，不過有做很多的最佳化在 transport 和 marshalling 上面，不過對外就開始有提供 Rest-like 的介面，因為當時很多 php, perl 的 LAMP 架構的 library 被很多中小型客戶使用，而 SOAP 的介面主要提供給 Java 和 .NET 平台。在 Amazon 的角度來看，提供 Rest or SOAP 的介面選擇不是重點，重點則是客戶使用什麼，因為客戶只想要你提供最間單的 toolkit 給他們建構 application 。

Amazon 商業上面的哲學

Amazon 非常重視客戶的 input，在構思新的產品的時候，一定會把客戶的 feedback 放進 loop 裡面。Amazon 如何 measure 產品是否成功，Vogels 博士提供了一個的觀點，是從客戶的角度出發，像是產品的更動是否有改變客戶的使用行為，像是有沒有減少幾個 steps 去找到他們要的 items，不過這些從人的 behavior 出發的 measure 也更難去偵測，而且要改變人的行為也很困難。

We measure whether or not a new feature is successful in terms of customer satisfaction: Do people find things more easily? If we can improve the convenience of shopping on Amazon, then we have booked a major success.

談技術團隊和招人

Amazonians 每兩年也需要花時間去當一陣子 customer services，進而瞭解客戶的需求和想法，

Amazon 的招人哲學，就是看 candidate 怎麼思考 customer 和 technology。這邊提出 technology is useless if not used for the greater good of serving the customer。另外 working from customer backward 也是 A 社的重要哲學。

心得

這篇訪談其實讓我驚訝的是，在 SOA(2009)、DevOps (2009) 和 microservice(2015) 這些 buzzword 出現並且流行前，Amazon 早就嘗試在內部實現這種嶄新的方法，並且提煉出很多不錯的 practices，像是 service 的設計需要考量 sharing resources，怎麼樣將 service 的功能利用一致的介面 expose 出來，將 service 的 operation 責任丟給 service team 改善 ownership。另外是對於產品的思考角度，總是從客戶的方向開始出發，過去幾年，我在使用 Amazon 的服務的時候，其實也從這些 service 和服務的身上學到了不少，能夠讓團隊更快速的開發產品，更快地迎合客戶的需求，真的是從流程面和技術面都需要值的提升，這個訪談讓我看到了 Vogels 如何看待 large scale 的產品和技術問題，很推薦大家去看原文。

Reference

A Conversation with Werner Vogels

Photo by Bryan Angelo on Unsplash

Integer Encoding Algorithm 筆記

2021-01-30T14:13:14.000Z

Integer Encoding Algorithm 筆記

現今的電腦在 CPU 和 Memory 的速度上有極大的差距，而 Memory 到 Disk 上面的差距就更大了，所以有許多的壓縮演算法被套用在不同的應用上，例如 IoT, big data 和 database 之類的, 一個是為了節省儲存資料的空間，還有是小批的資料要 load 進 Memory 裡面處理也會比較快，這篇筆記探討的是像是 integer 類型的數據結構，有什麼樣的演算法可以套用，還是其中的差距是什麼，說到底要解的問題就是，就是給你一串連續的數字，要怎麼樣在壓縮率，encode/decode 的速度上面做改善。

這篇文章算是自我做的一個筆記，如果有寫錯還請指正，另外要指出以下的演算法都是 lossless data compression。

History

這邊條列一下歷史，另外是我覺得蠻值得筆記的幾個演算法，主要從這篇論文來的 https://arxiv.org/pdf/1908.10598.pdf

1972: variable-byte (https://en.wikipedia.org/wiki/Variable-length_quantity)
1998: frame of reference (FOR)
2006: PForDelta
2009: Variant-GB
2010: simple8b
2011: Variant-G8IU
2014: Roaring
2015: SIMD-BP128, SIMD-FastPFor
2018: Stream-VByte

一些常見的 Algorithm

基本上討論壓縮率應該要從 bit-aligned, bytes-aligned, word-aligned 提起，不過 bit-aligned 演算法例如 Golomb coding or rice coding，雖然壓縮率很好，但是在 encode/decode 上面因為不符合電腦運作的模式[註 1]，壓縮和解壓縮的速度很差，所以運用在資料庫上面的效果並不好。

註 1: 你所不知道的 C 語言：記憶體管理、對齊及硬體特性

電腦的 cpu 又是如何抓取資料呢？cpu 不會一次只抓取 1 byte 的資料，因為這樣太慢了，如果有個資料型態是 int 的資料，如果你只抓取 1 byte , 就必須要抓 4 次(int 為 4 byte)，有夠慢。所以 cpu 通常一次會取 4 byte(要看電腦的規格 32 位元的 cpu 一次可以讀取 32 bit 的資料，64 位元一次可以讀取 64 bit)，並且是按照順序取的

Variable Byte

byte-aligend 顧名思義就是使用最少的 byte 儲存數字，類似原本一個數字是 4 bytes (32bit)，如果可以放進一個 byte 裡面，就可以省下很多的空間，VB 的運作模式很單純，利用 7 個 bits 存放資料，最前面的 1 個 bit 拿來判斷後面是不是有跟著其他 byte，以 128 的例子來看，最後一個 byte 的 binary format 是 10000000，但是我們只有 7bit 能拿來存資料，所以就需要兩個 bytes 把 128 存起來，第一個 byte 的開頭設定為 1，表示這個 byte 後面還有跟著另外一個 byte，到時候要一起拿來 decode 成原始的 binary。

數值	binary (32bit)	Variant
0	00000000 00000000 00000000 00000000	00000000
1	00000000 00000000 00000000 00000001	00000001
127	00000000 00000000 00000000 01111111	01111111
128	00000000 00000000 00000000 10000000	10000001 00000000
16383	00000000 00000000 00111111 11111111	11111111 01111111

以上的例子都是從 wiki 來的。

而 VB 作為一個很廣泛使用，又很好實做的演算法，也是有許多的改進的版本，例如 google 的 Jeff Dean 的 Challenges in Building Large-Scale Information Retrieval Systems 投影片裡面可以看到 Group Varint Encoding，藉由把 control bit 提到第一個 Byte 的前四個 bit，可以有效地減少 branch prediction miss 的 penalty，以提升 encode/decode 的速度。

而 2018 年又有新改良的版本叫做 stream VB (尚未研究)

參考資料: https://en.wikipedia.org/wiki/Variable-length_quantity

Delta of Delta + Variable length coding

delta 也是一個很常見的技巧，像是 influx db 或是 prometheus 這類的 timeseries database 在壓縮 timestamp 時，也是會使用 delta 的方式將數值變小，例如說 [100, 101, 105, 108] 就可以轉成 [100, 1, 5, 8] 的格式，還可以進一步地把 [1, 5, 8] 變成 [1, 4, 3] 這種 delta of delta (DoD) 的格式。

在 Facebook Gorilla paper 裡面也有提到，他們的 timestamp 也是用 DoD 壓縮出來的，主要他們的格式是 (DoD of time, value), (DoD of time, value) … 聯合起來的，產生出來的 DoD 需要使用 Variable length coding (VLC) 的方式產生 tag bits 才能在 decode 的時候，知道那個 DoD 到底是幾個 bit ，像下面這張表:

DoD	tag bits	following bits
0	0	0
[-63, 64]	10	7 bits
[-512,511]	110	10 bits
[-4096,4095]	1110	13 bits
[-32768,32767]	11110	16 bits
else	11111	64 bits

所以在時間那個欄位會長成像下面這樣，透過 tag bits 可以跟 value 的值一起 encode 在一起

1 2	\|time value \| time value \| time value \| \| 100 1 \| '10': 2 1 \| '110': 100 10 \|

Run Length encoding (RLE)

RLE 也是一種常見且簡單的壓縮格式，這邊的內容擷取自這篇文章，

1	5 5 5 5 8 8 8 2 2 2 2 2

可以被 encode 為

1	4 5 3 8 5 2

這樣有了 50% 的壓縮率。但是有個特例像是

1	1 2 3 4 5 6

如果被壓縮為

1	1 1 1 2 1 3 1 4 1 5 1 6

這樣反而比原本的資料慘，這時有個方法是可以使用下面的方法

1	-6 1 2 3 4 5 6

給他一個 indicator 讓他看到 -6 就知道要 copy 後面六個數字出來

Simple family

simple family 也是一個很有名的演算法，因為名字的開頭就叫做 simple，而依據出現的是 simple9, simple16, 最後是 simple8b，simple8b 的 paper 出現在 2010，目前被廣泛使用在 timeseries database，也算是一個相對新的演算法，核心思想是把數字壓縮在一個 64bit 的 word 裡面，而經過驗證後，也可以發現他有不錯的 encoding/decoding 速度，而由於 simple 系列的演算法是把 integer 盡可能用越少的 bit 去壓縮，例如 1 或是 0 就可以用 1 bit 去表示，這比 Variable Byte 的演算法在壓縮率上面更有優勢，而實作上面也不會太困難，這也是為什麼現在那麼多主流的 database 選用 simple8b 的原因。

simple8b 的 64 bit 中，後面 60 個 bit 拿來放 data，前 4 個 bit 會被拿來當作 selector，決定有多少的 integer 可以被壓縮在 64 bit 的 word 中，其中的 encoding mode 可以參考下表:

selector	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
integer coded	240	120	60	30	20	15	12	10	8	7	6	5	4	3	2	1
bits per integer	0	0	1	2	3	4	5	6	7	8	10	12	15	20	30	60

舉個例子，如果我們要存 [1, 2, 1000, 3, 4, 5, 6, …] 這個陣列，透過 simple8b，可以看出我們最大的值是 1000, 對應到 2^10，代表我們最多只能 pack 6 個值進入到 64 bit 的 word 中，也就是 [1, 2, 1000, 3, 4, 5], 則後面的 6 且其他數值會被 pack 到後面的 word。

一般來說，simple8b 的 encoding 速度會比 Variable Byte 來得久，不過 decoding 上面應該都會比 Variable Byte 來得好，而壓縮率則是大勝 Variable Byte。

Binary Packing (BP128)

Binary Packing 是藉由判斷給定的 array 中最大的數字，需要多少 bit 去存，而將其他數字都用同一種 bits 數量去儲存的方法，例如下面這個範例:

1	2, 3, 5, 6, 8, 7, 7, 7

其中最大的數字 8，需要 3 bits 去儲存，這樣我們也把其他的數字也用 3 bits 去 encode，原本一個 integer 需要 32 bits，這樣一來一個 integer 就只需要 3 bits，而 binary packing 還有一個特性是把一組的數字一起壓縮，像是著名的 BP128 講的就是把 128 integer 一起透過 fast bit packing 去做壓縮，如果一個 integer 需要 b 個 bits，這樣總共就需要 128*b bits 就夠了，如果 total 數字沒辦法被 128 整除，就可以透過補零的方式。

另外是 binary packing 牽扯到 bit packing 怎麼實作的部分，在這篇論文Decoding billions of integers per second through vectorization 和作者的 blog 中也有討論，另外是作者的 C++ 實作，只要透過 shift and mask 就可以將不同長度的 bit 壓縮在不同 bytes 裡面，所以壓縮解壓縮的速度非常快，另外是也可以實作出 SIMD 的版本 SIMD-BP128，印象中論文裡面是寫壓縮解壓縮率會比 scalar 的快上一倍。

整體而言 BP128 不管在壓縮率、encode/decode 速度上面都很不錯，也是在 benchmark 中常常被拿來比較的對象。

Frame of reference (FOR)

以上的演算法在壓縮前，都是透過 Differential coding，也就是算 delta 的算法，讓要壓縮前的資料變小，不過這樣的處理方式有個壞處是，沒辦法辦到想快速搜索值，一定都得要把資料先解壓縮回來，而 Frame of reference 的作法可以解決這類的問題。

Frame of reference 基本上是來自這篇 1998 年的論文 Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. 1998. Compressing Relations and Indexes. In Proceedings of
the 14th International Conference on Data Engineering. 370–379.

而這邊的 frame of reference 講的就是找到一個參照系，frame 跟上面提到的 BP 很類似就是 a sequence of integers with the same bit width。

Frame of reference 講解主要我是參考 Lemire 寫的這篇文章，Lemire 出了很多 paper 在討論這個議題，文章內容都寫得很棒，很適合大家去爬。

用例子來說明比較快，給定下列的數列

1	107, 108, 110, 115, 120, 125, 132, 132, 131, 135

每個數字可以用 8 bits，總共需要 80 bits 才能存下全部的數字。frame-of-reference 就是要找出陣列的 range 和最小的數字，這邊我們可以看到是 107 到 135，接著我們對這個陣列減去 107 看看:

1	0, 1, 3, 8, 13, 18, 25, 25, 24, 28

再看一次會發現，現在我們一個數字只需要用 5 bits 就存得下了，不過我們當然還是要使用 8 bits，去存下一開始的 reference 107，還需要另外的 3 bit 的 metadata，來紀錄我們目前的資料長度是 5 bits，這樣總共是 8+3+95=45 bits，比起原本的 80bits 省下不少空間，另外是搜索上面也有幫助，像是你要搜索 1000 這個數字，很快的比對 107 和資料長度 5bits，可以馬上知道這個 block 裡面，並沒有 1000 這個數字，對於我們 search 上面也有幫助。

FOR variant

而 FOR 還有一些變形，像是上面這個數列，還可以透過之前介紹的 delta 的方法進一步壓縮，變成

1	1, 2, 5, 5, 5, 7, 0, -1, 4

現在我們知道除去 -1, 我們可以用 3bits 把東西存下來，但是這個 -1 還是很討厭，幸好還有一個有名的演算法叫做 zigzag，可以將負數都 encode 成正數，在 google 的 protocol buffer 裡面也有用到。

除了 Delta，另外一種做法是可以透過 XOR，也可以把數字變小，而且不會產生負數。

Patched coding

其實有件事在 FOR or BP 裡面都可能會遇到的是，像下面的 integer array 處理不好的狀況

1	1, 4, 255, 4, 3, 12, 4294967295

因為裡面有個超大的數字，變成每個數字又要用回 32bits 去存，那究竟我們有沒有什麼方法可以去改善呢，這邊就有人提出了一個 Patched coding 的方法，這也是常常在論文中看到的 PFor，意思就是決定一個 b bits width 去壓縮，而大於 2^b bits 的數字當作 exception 放在其他的 page 裡面，這邊有蠻多不同的方法去實作這段，詳情還是要看 paper 或是 implementation 如何實作的，不同的實作也是會對不同的測資有影響。

結論

在看了幾個資料1, 2 後發現，壓縮 integer 這擋事不外乎上面講的這幾種方式，不過在 SIMD 慢慢流行下，這些演算法也重新被改良，希望能夠用 SIMD 去加速，這的確是我們蠻樂見的，也有看到 influxdb 想要在 timestamp 壓縮這段，可以從 simple8b 改良成用 SIMD-PFor，這也是因為 SIMD 也漸漸是標準配備了，然後 github 上面可以找到一堆開源的實作，像是 TurboPFor-Integer-Compression 、simple8b 、BP128 和 FastPFor，有需要的可以直接拿來套用結果看看。

而網路上也有幾篇文章蠻值得參考的:

Reference

Photo by JJ Ying on Unsplash

Golang 的 string interning 技巧

2020-12-14T13:49:23.000Z

String Interning

最近在 twitter 上面看到一篇推文

Hacked string interning profiler for #golang:https://t.co/EB2uJwzvtx
Allows to understand where to use interning & exact savings for heap size/garbage rate. May be useful for larger projects.
Yay or nay? pic.twitter.com/71h6tNJHaX
— Dmitry Vyukov (@dvyukov) December 12, 2020

具體在討論這個 CL 要加入 string interning profile，而利用這個可以測量是否該加入 string interning，原本不知道這個是幹嘛的，後來看了一下 comments 內的一些解釋和文章，才知道原來 String interning 是個可以拿來有效減少 memory 使用量的技巧，原理相當簡單，而在其他語言裡面也有這種東西，像是 python 在一些小的數字和文字上面，都是會指向同一組記憶體，藉此來減少 memory allocation 的時間和用量，而 string intening 也是這類技巧的名字(https://en.wikipedia.org/wiki/String_interning)。

Golang 裡面的運用

在 golang 裡面在什麼地方被用到，也可以看下面幾篇的解釋，個人覺得已經很清楚

基本上要做實驗可以透過下面的 function 去看 string 的 pointer 位置，看是不是同一份

func pointer(s string) uintptr {
    p := unsafe.Pointer(&s)
    h := *(*reflect.StringHeader)(p)
    return h.Data
}

func main() {
    b := []byte("hello")
    s := string(b)
    t := string(b)
    fmt.Println(pointer(s), pointer(t))
}

相關的 playgound 連結在這邊，從這個實驗中，我們可以很快發現，兩個字串指向的記憶體不同，而在 golang 裡面比對 string 如果是同一個 pointer 而且 Len 都一樣，就可以加速比對的過程，而不用一個 byte 一個 byte 的去比較，所謂的 interning，也就是如果我們能夠不重複 allocate 記憶體，都用同一個字串。

而有另外一個有趣的點是，這個 playground 裡面可以把 hello 改成 h, 可以看到 pointer 都指向同一個位置，這是透過 compile time constant 的結果，具體可以看這個 CL 的解釋，透過 benchmark 提前先 allocate 一個字元在效率上面可以提升很多。

自幹 string interning

而在 golang 裡面其實並沒有很多地方有做 interning，但是我們在處理一些資料的時候，其實有機會用到這個技巧，像是第一篇文章裡面有寫到，如果我們要處理大量的 text，如果沒有 interning，可能就需要 allocate 很大量的記憶體去儲存這些資料，另外是像從資料庫讀取東西時，也可以有些數據是一直重複出現的，這時也可以應用同樣的技巧。

而一般來說透過類似 cache 的方式可以自幹 string interning，像是第二篇文章的 code

func intern(m map[string]string, b []byte) string { 
    // look for an existing string to re-use 
    c, ok := m[string(b)] 
    if ok { 
        // found an existing string return c 
    } 
    // didn't find one, so make one and store it 
    s := string(b) 
    m[s] = s return s
}

但是麻煩的地方跟自己維護 LRU cache 一樣，如果有不需要用到的字串需要被 evict 掉，要不然也會佔據記憶體，另外是要做一個能夠 concurrent 存取的 LRU cache 也是不容易，所以這個東西沒處理也會變成反效果，而大家都會談論到這個 project https://github.com/josharian/intern，裡面是用 sync.map 實作 interning，但是還是要斟酌一下是不是真的該用。

順帶一提是透過這個 CL，我又追到這個 CL，裡面的討論也蠻不錯的，也讓我了解到設計系統需要考量的一些東西，我蠻建議看下 dsnet 對做 memorize stirngs during decode 的一些看法和實驗，很多地值得借鑒，像是

String 越長，對於 cache 越不友善，有可能 cache hit rate 會下降的很快，而太長的 sring 也需要花更多時間去比對和做 hash，所以只需要選擇 cache 小一點的資料(e.g < 16B)
Go men allocator 其實速度很快，配置 16B 的字串只需要 35ns，所以 cache 的 lookup 和比較應該要快於 35ns 才有賺
JSON 的 object name 有 high degree of locality，所以有個很小的 cache 去存這段東西很值得
而 JSON 的 value 可能就沒那麼值得去 cache，因為 locality 不好

結論

總的來說，string interning 也許不是很常會使用的技巧，但是如果真的有極端的 case，也許拿來使用減少記憶體和 gc 壓力也是不錯的途徑，當然前提還是要經過縝密的 benchmark。

Reference

Photo by jesse orrico on Unsplash

學習使用 compiler vector extension 去寫 SIMD 程式

2020-11-01T12:09:10.000Z

最近強者我 Tead lead Champ Yen 在公司內部做了一次 experience sharing，內容非常的精彩，分享了怎麼使用 compiler vector extensions 去寫 SIMD 的 program，進而將 program 的效率提升，並且可以產出 portable 的 program。

SIMD 到底是什麼

SIMD 的全名是 single instruction multiple data，而顧名思義就是使用一個 instruction 去操作多組 data。

在 Flynn taxonomy 裡面將 information stream 分成了 instruction 和 data，進而對計算機做分類，而普通我們認知的 instruction 操作一個 data (register) 被稱為 SISD，而 SIMD 之所以重要是因為電腦的單核的頻率在古早前就上不去了，詳情可以見下圖

而改善程式的效率的方式，就變成探索如何將其變成 parallelism 的過程，這方面就多了如何善用 Multicore，熟悉 NUMA，以及採用 SIMD 之類的技術。

SIMD 為什麼會比較快

這頁取自交通大學劉志尉老師的課程投影片，從中可以看到 scalar code 和 vector code 各自需要的 instruction 數量，而 scalar code 還要考慮外面有個 loop 迴圈，所以整體需要時間更多。

SIMD instruction 有哪些 type

Load/Store
Per-Lane
- Arithmetic
- Bitwise, Logical
Cross/Inter Lane
- Permute, Select, Shuffle(LUT)
- Alignment
- Pack & UnPack
Reduction (e.g Average of vector)
- Minimum
- Maximum
- Average
Special (特殊的 instruction)
- NN specific ISA
- inter-lane + per lane attributes

為什麼需要 compiler vector extension

可以使用 vector 去提升程式的 performance
比直接使用特定平台的 intrinsics/ASM 來的容易使用
比較容易透過這種方式，去修改已經存在的 C/C++ 程式
portability (大加分)
可以跟 OpenMP 一起使用 (這邊我其實沒很懂，因為沒寫過 openmp)

如何使用呢?

GCC vector type declaration

先來學如何宣告 vector，可以使用下列語法

1	typedef SCALAR_TYPE TYPE_NAME __attribute__((vector_size(SIZE), aligned(1)));

e.g:

typedef int v4si __attritube__ ((vector_size (16), aligned(1)));
typedef float v4sf __attribute__ ((vector_size (16));
typedef double v4df __attribute__ ((vector_size (32)));
typedef unsigned long long v4di __attribute__ ((vector_size (32)));

以上的宣告很簡單，以 v4si 為例，我們宣告了一個 vector_size 為 16bytes 的 vector，其分割成 4 個 int sized unit。我們可以用下列的方式去初始化他們

1
2
3

v4si a = {1,-2,3,-4};
v4sf b = {1.5f,-2.5f,3.f,7.f};
v4di c = {1ULL,5ULL,0ULL,10ULL};

用操作 scalar 的方式使用 SIMD

typedef int v4si __attribute__ ((vector_size (16)));

int main() {
    v4si a = {1,2,3,4};
    v4si b = {3,2,1,4};
    v4si c;

    c = a + b;      /* The result would be {4, 4, 4, 8}  */
    c = a > b;     /* The result would be {0, 0,-1, 0}  */
    c = a == b;     /* The result would be {0,-1, 0,-1}  */
}

再來學下其他的 Compiler Build-in function

__builtin_shuffle
__builtin_convertvector
__builtin_prefetch
__builtin__clear_cache

舉個例子

這是我給 Champ 出的問題，如何高效地把一個 vector 轉成 vector，這邊就使用了 __builtin_convertvector，主要是因為 vector 也是連續的記憶體操作，所以可以使用 pointer 指過去後使用 SIMD 操作。

typedef float v8sf __attribute__ ((vector_size (32)));
typedef int v8si __attribute__ ((vector_size (32)));

std::vector<int> vint(TEST_LEN)
std::vector<float> vfp(TEST_LEN)

srand(time(NULL))
for(int i=0; i < TEST_LEN; i++) {
    vint[i] = rand()
}

int *intp = vint.data();
float *fpp = vfp.data();
struct timeval stime, etime;
gettimeofday(&stime, NULL);
for(int i=0; i+8 < TEST_LEN; i+=8) {
    *((v8sf*)(fpp+i)) = __builtin_convertvector(*(v8si*)(intp + i), v8sf);
}
gettimeofday(&etime, NULL);

Architecture Dependent Compiler Intrinsics

可以使用 union 去撈出 vector register 裡面個別的值，這樣對於 debug 或是真的要轉型就不會那麼麻煩。

gcc:

#include 

typedef unsigned char u8x16 __attribute__ ((vector_size (16)));
typedef unsigned int u32x4 __attribute__ ((vector_size (16)));

typedef union {
    __m128i mm;
    u8x16   u8;
    u32x4   u32;
} v128

LLVM/clang:

use vector extension variables directly

Ref: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

其他的 Tips

Porting & Troubleshoot 的一些方法

需要人工算一下將原本的 loop 切成 fixed size 的 chunk (e.g 8 for int32_t)，接著再把 loop 內部換成 vector operations。

Deployment - Function Multi-Versioning (a.k.a FMV)

Pros:
透過這個方法可以讓編譯出來的 binary 跑在不同的平台上面
Cons:
binary 會變肥大

__attribute__((target_clones("avx2", "avx", "sse4.2", "sse3", "sse2", "default")))
int main(void) {
    v8si v0 = {0, 1, 2, 3, 4, 5, 6, 7};
    v8si v1 = {8, 9, 10, 11, 12, 13, 14, 15};
    
    v8si v2 = v0 + v1;
    return v2[3];
}

有興趣的人可以透過 https://godbolt.org/z/of5d6v 去看看有加這行，會多產生不同的 assembly code，這樣一來就對應不同平台上面的 vector operations。

Ref: https://lwn.net/Articles/691932/

使用 SIMD/Vector 的一些眉眉角角

需要找到 Parallelism 的演算法
可以透過不同的方式使用 SIMD, 不過要考慮 portability 的問題。
unsupported operations
- division, high level function(math functions)
Floating point
- cross device compatibility
Boundary handling 等邊界問題
- 需要使用 padding, predication, 或是 fallback 去使用 scalar
Divergence
Register splitting
需要考慮 Non-Regular Access/Process Pattern 還有 dependency
- 像是 LUT, AoS (Array of structure)

一些心得

通過 Champ 這個分享，我真的終於知道如果安全的使用 SIMD，之前都是看一堆 project 寫 x86 assembly 寫得很爽，或是只能依靠 compiler 的 Automatic vectorization，現在終於知道也可以透過 compiler instrinsic 來寫，另外是 Champ 也提到 SIMD 這個技術已經發展了很多年，而 compiler instrinsic 像是 gcc 也是從 3.1 就開始支援了，所以大家放心的使用，然後害怕的 portability 的問題也是被解決的蠻好的，而我大概查了一下，如果真的想達成 compiler agnostic，也可以使用 libarary instrinsic，不過就各有優缺點了，用 compiler instrinsic 的好處，整體的程式碼還是可以寫得跟處理 scalar 一樣，個人也覺得看起來蠻舒服的。

另外是在搜索相關資料的過程中，看了很多不錯的文章，像是 stackoverflow 的 blog 就有提到一些 SIMD 的應用，但也從這個快快樂樂 SIMD 看到蠻多要注意的地方，而像是 AWS 所提供的 x86 & ARM 機器也都會有提到，他們各自支援的 SIMD 指令，我們如果真的要學習榨效能，這塊的基本概念真的也需要撿起來。

Reference

photo credit from https://unsplash.com/photos/Uf-c4u1usFQ

google 的 swisstable hashmap 筆記

2020-10-17T04:25:14.000Z

Matt Kulukundis 在 cppcon 2017 年給的 talk，我覺得這個 talk 講得非常好，主要是說明如何使用 Swiss Tables ，設計出更符合當代硬體架構的 hash map (flat hash map)，google 內部大量採用這個改寫 std:unordered_map，我看完這個之後有點觀念被刷，覺得還蠻震驚的。

這個 hash map 的實作考量到了 cache-friendly 以及透過 SIMD 來加速，主要讓我學習到的是，以往很多 hash map 實作使用 chaining 是因為在 load factor 很高的情況下，可以減少 key collision 改善插入的時間，而 flat hash map 則是告訴我們使用 open addressing，將 element 都放進一個 flat memory array 裡面，其實是 cache-friendly，另外是透過 SIMD 可以一步就知道結果，而不用在透過好幾次的 cpu cycle 找資料，接著就可以在搜索上面得到巨大的加速。

看到幾篇文章 Swisstable, a Quick and Dirty Description 把概念也講得蠻清楚的，也成功 port 這個 table 到 Rust 上面，稍微想了一下像這類的方法，Go 的架構因為要考慮 Runtime 還有沒有 Generics，似乎就比較難實作這塊，這邊有錯還請指正，

總的來說 google 開源的 Abseil [1] 和 Rust stdlib [2] 都有採用這個的資料結構，不過還是要注意一下自己的資料長怎麼樣，還有 key 通過 hash function 後的 distribution，來找到適合自己使用的 hash map。

另外我現在才學習到有非常多 SIMD 的 algo，簡直是開了我的眼界，感覺還有非常多的東西可以使用這個加速，就讓我們再繼續看下去。

[1] https://github.com/abseil/abseil-cpp
[2] https://github.com/rust-lang/hashbrown
[3] https://rcoh.me/posts/hash-map-analysis/
[4] https://www.youtube.com/watch?v=ncHmEUmJZf4

Coscup sharing - How I contribute golang OSS

2020-09-21T13:09:40.000Z

Coscup is an annual conference held by Taiwan open source communities. This year, I participated in Coscup as a speaker and gave the talk about how I contribute golang OSS. It’s a really good experience to give a talk at Coscup. Not only it’s a biggest open source feast in Taiwan but also you can learn and see a lot of passionate people in this conference. Although recent years you can hear a large amount of questions like What FOSS stands for? Some people consider different groups treat OSS from different purposes. More and more companies and people join and contribute OSS but some of them are for business reasons and some of them just want to gain more reputations. However, I still believe most of us just want to share good things with others. It’s also the reason why I like to join the communities.

Unlike last year, I gave a quite technical topic regarding CNCF project called Thanos. This year I tried to talk about how I participated in golang OSS project called YouTube. One of most interesting thing is that at the end of my talk a lot of people come to ask me more details about this project. Things like how to create a PR properly, how to join an OSS without writing code. This makes me feel good because I potentially help some people get involved in contributing something.

Photo shoot by a friend

Here is all the golang track videos and slides. I highly recommend you guys to take a look at them.

Linux 的 file descriptor 筆記

2020-08-22T14:29:10.000Z

前言

說來慚愧，一直以來都在跟 Linux 打交道，也瞭解 everything in unix is a file 的概念，卻沒有真的好好理解 file descriptor 的基本結構是怎樣，但是在知乎上面看到這篇 Linux file descriptor 演進史，讓我對於他為什麼長這樣有更進一步的認識。(其實原本想找找英文資料，不過這篇講歷史的還蠻清楚的)

基本上這篇文章會筆記目前新版的 file descriptor 結構，也會延伸一些其他看到的資料，基於我對於 Linux kernel 並不是專家，如果有錯的地方希望大家能夠指正。

file descriptor

file descriptor (fd) 基本上是一層介面，可以讓我們去操作 file 和其他 input/output interface (例如 pipe & socket)。

kernel 內的基本結構

每個 process 裡面包含 file descriptor 的 table。
file descriptor 其實只是個指標，指向系統層面 (system-wide) 的 openfile table 的 entry ，而這個 openfile table 在 Posix 裡面稱為 open file description。
fd_table 內的 inode_ptr 在去指向 i-node table 內的 entry。

file descriptor 和 file 之間的關係並不是一對一的。

圖從這個投影片來的 lusp_fileio_slides.pdf，另外要大推作者的書 The Linux Programming Interface，非常值得收藏

對應的 data structure source code

process task_struct 裡面有 file_struct 成員，基本上需要從這個 file_struct 裡面找到對應的 file descriptor。file_struct 的成員原本是直接在 task_struct 內的，現在將它獨立起來，並用指標去存取，主要是因為 linux 在支援 thread 之後，需要以 task_struct 為 thread 單位，可以透過指標共用 file_struct 這種資源。
1
2
3
4
5
struct task_struct {
...
struct files_struct *files;
...
}

files_struct 裡面可以找到 per process fdtable (file descriptor table)，其中使用了很厲害的 RCU 技術，主要是針對讀多寫少的情況下，提升存取寫入 fdtable 效能。
(struct fdtable in include/linux/fdtable.h)

struct files_struct {
  /*
   * read mostly part
   */
atomic_t count;
bool resize_in_progress;
wait_queue_head_t resize_wait;

struct fdtable __rcu *fdt;
struct fdtable fdtab;
  /*
   * written part on a separate cache line in SMP
   */
spinlock_t file_lock ____cacheline_aligned_in_smp;
unsigned int next_fd;
unsigned long close_on_exec_init[1];
unsigned long open_fds_init[1];
unsigned long full_fds_bits_init[1];
struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

struct fdtable {
unsigned int max_fds;
struct file __rcu **fd;      /* current fd array */
unsigned long *close_on_exec;
unsigned long *open_fds;
unsigned long *full_fds_bits;
struct rcu_head rcu;
};

open file table 也稱為 open file descriptions，是系統層級的 table (https://github.com/torvalds/linux/blob/master/include/linux/fs.h#L921)，這個 struct 定義了一些蠻重要的資料像是 file_offset, file_status, 還有最重要的 inode_ptr 去指向對應的 inode。
1
2
3
4
5
6
7
8
struct file {
union {
struct llist_nodefu_llist;
struct rcu_head fu_rcuhead;
} f_u;
struct pathf_path;
struct inode*f_inode;/* cached value */
const struct file_operations*f_op;
open file table 在指向 system-wide 的 inode-table (https://github.com/torvalds/linux/blob/master/include/linux/fs.h#L615)，其中的 i_mode 就記錄了對應的是哪一種檔案類型。
1
2
3
4
5
6
struct inode {
umode_ti_mode;
unsigned short i_opflags;
kuid_ti_uid;
kgid_ti_gid;
unsigned inti_flags;

一些常見的 fd 操作

同一個 process 內通常透過 dup() or dup2() 可以複製 file descriptor，而兩個 fd 就可以指向同一筆 openfile entry (也就是同一個 file)
不同 process 透過 fork() 也會拿到各自的 file descriptor，去指向同一筆 openfile entry
不同 process 去開啟同一份檔案，會用各自的 file descriptor 指向不同的 openfile entry，但最後會指向同一份 inode

其他的經驗分享

在沒有了解 fd 的時候其實在寫程式上面犯了不少錯，像是在曾經在寫一個 socket programming 時，在 main process 內 fork child process ，但是卻沒有使用 close-on-exec flag ，所以把 main process 打開的 fd 也帶過去給 child，所以就算在 main process 去 close socket，對於那個被 child 抓住的 socket 還是沒被釋放，所以就看到前面的 LB 說後端的連線數量沒有下降，接著因為 rate limiting 的緣故，就把外面的連線給擋住了，而其實這時候後端還閒著很，這就是不熟悉 fd 行為而種下的雷，在理解了 fd 後，接著會再做一些筆記來談談 epoll & scm_right 之類的東西怎麼運作的，了解 fd 對於我們寫程式真的蠻重要的啊!

Reference:

https://man7.org/training/download/lusp_fileio_slides.pdf

圖片從https://unsplash.com/photos/o6GEPQXnqMY

Linux 5.1 的 io_uring

2020-08-19T15:11:06.000Z

之前在 Facebook 上面分享不少技術文章的心得，被網友建議說可以放在 blog 上面，其實原本想在 blog 上面放一些比較長且整理過的東西，不過想想如果自己的心得能讓更多人看見，並且有機會交流也是不錯的事情，接下來應該會慢慢將之前的筆記謄過來。

https://www.facebook.com/kkcliu/posts/10157179358206129

io_uring

前陣子在寫 epoll 文章的時候，剛好看到了一個討論串裡面談到 io_uring，其實原本沒聽過這個是什麼，後來查了一下才知道是新版的 Linux kernel 5.1 會加入這個 io_uring，主要目的是可以很好的改善原本 Linux native AIO 的問題，其實一般來說 AIO 的效能應該會比 epoll 還好，簡單一點的比較可以看 stackoverflow 上面寫的，https://stackoverflow.com/questions/5844955/whats-the-difference-between-event-driven-and-asynchronous-between-epoll-and-a

epoll is a blocking operation (epoll_wait()) - you block the thread until some event happens and then you dispatch the event to different procedures/functions/branches in your code.
In AIO, you pass the address of your callback function (completion routine) to the system and the system calls your function when something happens.

簡單來說 epoll 是等待 event 發生，才去做事情，所以 epoll_wait 是個 blocking 的 operation，而 AIO 是把對應的 callback function 交給系統去做，算是真正的 asynchronous， Mysql 的 innodb 也是使用了 native linux AIO，但是看了下原生的 Linux AIO 有蠻多大大小小的問題，所以並不是真的太流行，這邊可以推薦大家看一下 cloudflare 這篇 https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/ ，有介紹怎麼使用 AIO，也提到 AIO 的一些問題，有趣的地方像是提到 Linus 對 AIO 的評價:

AIO is a horrible ad-hoc design, with the main excuse being “other, less gifted people, made that design, and we are implementing it for compatibility because database people - who seldom have any shred of taste - actually use it”. But AIO was always really really ugly.

接著是又看到 Facebook 分享的 slides: https://www.slideshare.net/ennael/kernel-recipes-2019-faster-io-through-iouring 和 Hackernews https://news.ycombinator.com/item?id=19843464 上面的介紹，最重要的是 performance 真的好上不少，從這邊 https://github.com/frevib/io_uring-echo-server/blob/io-uring-feat-fast-poll/benchmarks/benchmarks.md ，可以找到 epoll vs io_uring 的 benchmark ，可以看出 io_uring 的效能可以快到 40% 以上。

然後也看到很多不同的 project 像是 libuv, rust, ceph, rocksdb，正在討論或是進行 io_uring integration，這對 database & cloud 相關的產業會有重大的影響，省下來的成本光用想的就很驚人，雖然要等到大家升到 5.1 不容易，但是越來越期待這個發展了。

後記: 同事 Champ 大大有提點, Linux AIO 的問題是因為只能用在 DIRECT_IO 上面，所以對於很多程式來說，就沒辦法得到系統上面的 page cache 的好處，這也是為什麼 AIO 不好用的原因。

Reference

Header photo is from https://unsplash.com/photos/1XgFFEG_RGA

Improve CPU Utilization of Go app By Using BPF

2020-06-05T13:43:24.000Z

In this post, I’d like to share an experience that how I used BPF to fix a CPU saturation issue in a Go app. Especially this CPU saturation issue happened in the cgo level so that it’s really hard to detect the root cause.

Problem

A few months ago, I worked on cloud cost reduction and found out the CPU utilization of our streaming services kept 80% ~ 90% no matter how many clients connect to them. This is quite weird because normally the CPU utilization should be decreased when the total number of connections goes down. This also makes us not able to do further improvement like applying instance autoscaling according to different workloads.

Triage

Usually, we should be able to use go built-in profiling tool go tool pprof to track the CPU or memory issues. However, because this Go program highly depends on a C library, I was afraid that the bottleneck would be in the C code.

I recently enjoyed watching Brendan Gregg’s BPF talks and thought if we can use it to solve this issue. If you are not familiar with BPF, there are tons of great talks you can find on the youtube:

BCC is a toolkit for BPF-based Linux analysis. BCC includes a lot of good tools helping us analyze the current system without writing the BPF program on our own. I’d highly recommend everyone should try to adopt some BCC tools because using those tools is so simple and useful.

In our case, I used profile.py to profile CPU usage. This profiling tool would take samples of stack tracing at certain timed intervals. It also provides a way to generate a popular Flamegraph format. The most interesting part is it can profile the running process by attaching a given PID. I thought this is really powerful when we need to investigate issues of the online system. I used the following command to generate the Flamegraph.

1
2
3

# ./profile -df -p  5 > out.profile
# git clone https://github.com/brendangregg/FlameGraph
# ./FlameGraph/flamegraph.pl < out.profile > out.svg

Here is what I got:

Before we dive into this chart, I want to note how to analyze this flamegraph. There are 2 important things when we observe flamegraph according to this book Linux Observability with BPF:

The x-axis is ordered alphabetically. This represents the most frequent code consuming CPU in your system.
The y-axis shows the stack traces ordered as the profiler reads them, preserving the trace hierarchy.

With this flame graph, it was now possible to see most CPU time spend on epoll_wait. Because it’s a streaming service, obviously it should adopt epoll to deal with many client connections efficiently. However, I didn’t expect this epoll_wait should be invoked so many times. I also try another BCC tool called syscount to count syscalls.

1	syscount -p

I saw the result like this.

Then I tried to use strace -p to observe what happened in that process. I did see a lot of epoll_wait invokations during process running as following.

epoll_wait(4, [], 1, 0)                  = 0
epoll_wait(10, [], 1, 0)                 = 0
epoll_wait(20, [], 1, 0)                 = 0
epoll_wait(4, [], 1, 20)                 = 0
epoll_wait(10, [{EPOLLIN, {u32=5, u64=140200617443333}}], 1, 50) = 1
...

I was so curious what’s the meaning of epoll_wait function arguments. I checked the man page and realized the signature of that function.

1	int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

It looks to me that the field timeout could be the key to this problem so that I need to figure out the definition of timeout.

Specifying a timeout of -1 causes epoll_wait() to block indefinitely, while specifying a timeout equal to zero cause epoll_wait() to return immediately, even if no events are available.

This is really interesting. I soon went back to check our code. It turns out that the value of timeout is not a fixed number but generated by another function. This function acted too aggressively and set up timeout as 0 in most cases. We also double-checked epoll_wait mostly returned nothing when the timeout is 0. When you write the code using epoll, the pseudo-code might be like

while (1) {
    n = epoll_wait(efd, events, MAXEVENTS, timeout);
    if (n > 0) {
        /* process activity */
    } else {
        /* process inactivity */
    }
}

That means when timeout value is 0, it will create a busy loop and CPU time spent on here for nothing.

Fix

We did a quick small change to modify the timeout value to 1ms when the timeout value equals to 0. This change should reduce total number of invocation of that epoll_wait and following function blocks. We also deployed this fix both on the staging and production system to make sure video streaming still works smoothly. It was a remarkable moment when we saw the impact it had. We successfully reduced CPU usage from 6% to 1% on staging server. There was same trend on production services which improvement is around 30% to 70% deponed on different workloads.

After applying this fix, we tried to use syscount again to verify our system. As you can see the total number of epoll_wait is close to recvform, which means we save a lot of CPU time.

Conclusion

BPF is a really good tool that can help us quickly identify the potential root cause of the problem. The benefit of using BPF is it only introduces very small overhead on your system.
strace is a very powerful tool too. Every programmer should learn how to use it. We can use strace to observe system call of your program on the fly which can give us a lot of information to understand how our program works.
Encourage everyone to try other bcc tools to observe the online system. It’s really fun!

after I fixed this issues with BPF, my feeling is like

Reference

AWS SSM session manager 筆記

2020-04-11T13:09:31.000Z

一直以來，如何登入到 AWS EC2 instance 就是個大問題，以往的方式都是在建立 Instance 的時候，設定其 key pair ，再把 private key 好好的保存下來，不過這個方式對於管理許多機器的人，其實是很煩人的，試問有多少人會乖乖的 rotate 機器上面的 key，而在有很多服務和機器的情況下，對於這些 key 的生命週期管理是非常重要的。

再者，在有些情況下，還是會見到把 EC2 instance 變成所謂的寵物機，然後見到一堆人的 key 被加入到 .ssh/authorized_keys 內，以便大家可以登入存取，這種方式讓我們更難的去顧到機器安全，在人員離開後，也不知道有沒有正確的把那些 key 拔掉。

通常來說，為了安全性，我們都會建立所謂的 Bastion host (跳板機) 還有 ip whitelist 及 VPN，去規範讓有權限的人去存取機器，不過不管是哪種方式也好，其實都增加了管理上的成本。

AWS 推出了 session manager 很好的幫我們解決了這個問題，而去年也推出了一些新服務，可以讓我們 scp EC2 上面的檔案或是利用 portforwarding 的方式，讓我們從 local 機器測試 private VPC 內的服務，這篇筆記會列出該如何使用 session manager 以及相關的 IAM 設定。

安裝

Local machine 需求

遵照官方文件

安裝最新版的 aws cli，版本需要大於等於 1.16.12 才能使用
安裝 session manger plugin

EC2 需求

預設 session manager 是沒有權限可以碰 EC2 的，需要修改 instance profile 和加裝 ssm agent。

Create an IAM instance profile for Systems Manager (https://docs.aws.amazon.com/systems-manager/latest/userguide/setup-instance-profile.html)

需要有 AmazonSSMManagedInstanceCore

另外可以參照 minimul s3 bucket permission

{
     "Effect": "Allow",
     "Action": [
         "s3:GetObject"
     ],
     "Resource": [
         "arn:aws:s3:::aws-ssm-us-east-1/*",
         "arn:aws:s3:::aws-windows-downloads-us-east-1/*",
         "arn:aws:s3:::amazon-ssm-us-east-1/*",
         "arn:aws:s3:::amazon-ssm-packages-us-east-1/*",
         "arn:aws:s3:::us-east-1-birdwatcher-prod/*",
         "arn:aws:s3:::patch-baseline-snapshot-us-east-1/*"
     ]
 }

確認 instance 上面都有安裝好 SSM agent，AWS 上面新版的 ubuntu & amazon linux2 都有先裝好了，不過舊的 AMI 就需要自己去安裝。

使用者設定

根據文件，設定 user 對應的 iam policy 權限，以下是個簡單的範例

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:instance/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:DescribeSessions",
                "ssm:GetConnectionStatus",
                "ssm:DescribeInstanceProperties",
                "ec2:DescribeInstances"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:TerminateSession"
            ],
            "Resource": [
                "arn:aws:ssm:*:*:session/${aws:username}-*"
            ]
        }
    ]
}

進階一點，我們可以使用 tag 去區別用戶能夠存取的環境，像是 staging or production

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ssm:StartSession"
            ],
            "Resource": [
                "arn:aws:ec2:*:*:instance/*"
            ],
            "Condition": {
                "StringLike": {
                    "ssm:resourceTag/Environment": [
                        "staging"
                    ]
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:TerminateSession"
            ],
            "Resource": [
                "arn:aws:ssm:*:*:session/${aws:username}-*"
            ]
        }
    ]
}

設定完以上的基本設定後，就可以透過下列的指令去登入機器

1	aws ssm start-session --target i-0b0d92751733d1234

使用 scp

設定

這邊筆記下如何透過 session manager 去達成 scp ，基本上透過 AWS 文件 session-manager-getting-started-enable-ssh-connections 上的描述，可以得知是利用 Proxycommand 透過 AWS tunnel 直接連接到我們的 EC2 機器上。

編輯 ~/.ssh/config 並加入

1
2
3

# SSH over Session Manager
host i-* mi-*
    ProxyCommand sh -c "aws ssm start-session --target %h --document-name AWS-StartSSHSession --parameters'portNumber=%p'"

就可以使用

1	scp -i -i /path/my-key-pair.pem test123 ubuntu@i-0b0d92751733d1234:~/test123

注意這邊還是要利用一開始設定好的 key pair 去做連線。

進階設定

上面提供的方法雖然可以讓我們使用 scp & ssh，但是有點討厭的是還是得設定 EC2 機器的 key，那有沒有辦法繞過去呢? 答案是有的，只是需要透過一個比較 tricky 的方式。

網路上有人寫好了這個 proxy command 的 script，使用的方式很簡單

下載並且把這個 script 放到 ~/.ssh/aws-ssm-ec2-proxy-command.sh
修改 aws-ssm-ec2-proxy-command.sh 成為可以執行
修改 ~/.ssh/config 裡面的指令

1 2	host i-* mi-* ProxyCommand ~/.ssh/aws-ssm-ec2-proxy-command.sh %h %r %p

就不用在帶一把 key 去做認證了

1	scp test123 ubuntu@i-0b0d92751733d1234:~/test123

其實原理很簡單，利用 aws ec2-instance-connect send-ssh-public-key 去建立一個 short-lived 的 key，這個指令詳細的好處可以看這篇 aws 文章 new-using-amazon-ec2-instance-connect-for-ssh-access-to-your-ec2-instances，接著再使用這把 key 透過原本的 start session 那條路連上遠端的 ec2 機器。

Port forwarding

這邊要再提供一個很有趣的方法，可以讓人透過 port forwarding 去連接 EC2 上面的服務，很多時候我們會把服務都放進 private subnet 內，而 developer 想要測試這些 services 時，往往要利用 VPN 或是開一台在內網的 EC2 去連結，而使用 port forwarding 可以讓我們更容易地達成這個需求。

1	aws ssm start-session --target i-0b0d92751733d1234 --document-name AWS-StartPortForwardingSession --parameters '{"portNumber":["80"],"localPortNumber":["9999"]}'

這樣就可以透過 localhost:9999 去連結到 EC2 上面 service 的 80 port 了，詳細的內容也可以看這篇 AWS 的文章 new-port-forwarding-using-aws-system-manager-sessions-manager

Takeaway

使用 session manger 可以減少 key 的管理，減少資安漏洞
透過 proxycommand 可以讓我們建立 ssh tunnel，進而可以使用 scp 等等工具
port forwarding 可以幫助 developer 測試在 private subnet 的服務
搭配 aws cliv2 可以透過 SSO 增加系統安全

Reference

有關 Cache 的一些筆記

2020-03-27T03:58:07.000Z

前言

最近看了 Amazon 的一篇文章 Caching challenges and strategies，在談論 cache 的種類，還有一些使用的邏輯和策略，剛好就想稍微整理一下有關於 cache 在分散式系統上面的一些筆記，這中間如果還有看到其他內容，還會再把它補起來，這邊強烈推薦大家看一下AWS 原文，還有筆記最後整理的一些 Reference，相信看完大家都可以學習到很多東西。

何時使用 Cache

很多時候，我們會考慮加入的 Cache 的情況，不外乎就是想要加速系統的反應，或是降低 Backend 或是 Database 的負擔。而加入 Cache 往往也是一種挑戰，因為對於整個系統來說，每加入一個環節，其實都帶來了風險和複雜度，像是如果 Cache 整體 hit rate 不高，加入 Cache 也許只是讓 latency 更高，在伴隨著帶來的 Cache Availability, Cache Coherence 和 Cache Invalidation 的問題，都是我們要考慮有沒有必要使用 Cache 的關鍵。

Cache 的種類

Local cache
簡而言之，就是使用系統上面的 Memory 作為 Cache，好處是隨手可得且複雜度低，但壞處是不同機器間的 Cache consistency 的問題，另外剛開機的時候也會有 cold start 的問題。
External cache
External cache 相信大家最熟的就是使用 Memcache & Redis，這個可以解決機器間 Cache consistency 的問題，但還是有可能因為更新快取的方式失敗或錯誤，造成其他 consistency 的問題，另外是加入這個元件，就提升了系統複雜度，需要有另外的機制去監控和管理，還有 availibility 也是需要考量的點，在 exteranl cache 爆炸的時候，application 需要有能力去克服這個情況。

常見的 external cache 的問題

這邊記錄下常見的 external cache 的問題還有解法

Caching Penetration

如果同時間有大量的 requests 打到系統上，但是 cache 裡面沒有相對應的 key，這時候壓力就會全部灌在後端系統上(很可能是 database)，讓系統變得不穩定。

解決方案:

cache empty data:
對於空的資料，也把他們 cache 起來，一般來說只存 cache key 應該佔用的空間不大。
bloom filter:
利用 bloom filter 把一定不可能存在的 request filter 掉，減少 cache penetration 的機會。
observe access pattern:
可以觀察下進來的 request pattern，要對 key 設定一些規範，如果 key 長的樣子不太對，就直接 filter 掉，或是看使用者是不是來掃 database 的，對於那種長期查詢不同值的 pattern 要有防範的方法。

Cache avalanche

Cache 雪崩，這通常發生在 cache 重啟當機，或是有大量的 cache 同時失效，此時，有大量的 requests 打進來落在 backend service 或是 database 上，如果 database 被打掛了，很有可能也沒辦法再開起來，因為會一直被大流量沖垮。

順帶一提，有時候會有大量 cache 同時失效，有可能是因為在 cache 開起來時，過期時間 (TTL) 都設定太過接近。

解決方案:

Cache High Avalability
第一個想法就是確保 Cache 元件也符合 HA，像是 redis 可以使用 cluster 的模式，避免 Cache 的單點當機造成的雪崩
Hystrix (circuit breaker, rate limit)
這個解法是為了保護，當 Cache 大規模失效的時候，後端的壓力會得太巨大，像是資料庫這種東西絕對不能讓他被沖垮，所以可以出動像是 Hystrix 之類的 rate limiter 元件做降級處理，只讓部分的 requests 流到後面去。
Expiry with different TTL
讓 key 的 TTL 都盡量分散，可以減少同時並發打到 database 的壓力。

以上的解法很可能需要混搭才是好的解法

Cache Stampede (Thundering Herd problem)

在 cache 裡面的某個 key，經常被大量存取，屬於 cache 的 hotspot，在 cache miss 的時候，requests 也會一口氣打到後端或是 database 上，這個也是屬於 cache invalidation 怎麼做的範疇。

舉個例子，像是某個熱門商品的 metadata 被放在 cache 上面，cache 失效時，如果同時有 1000 個人同時 request 這個產品，這些 request 就可能會一口氣打到 database 上面，讓 database 被衝垮。

解決方案:
基本上的解法都可以在這篇 What is Cache Stampede 找到

Mutex (Locking)
有些文章會使用 request coalescing 這個詞，這邊達成這個的手段的方式就是利用 lock，讓同時間只有一個 request 可以存取 database 去更新 cache，使用 redis 可以用 setNX 或是 distributed lock 去產生這個鎖，但是使用 lock 時也要注意解鎖之類的問題。
External Computation
使用外部的計算單元，像是用 cronjob 或是 worker + queue 的模式去更新 cache，來處理 cache invalidation 的問題，像是利用 worker 定期去掃 database 的表去更新 cache，或是利用 queue 去 trigger 更新，不過要注意如果類似掃 database 去更新的模式，有可能會存了很多不需要的資料在 cache 裡面。
XFetch
這邊要提供第三種方法是出自一篇論文叫做 Optimal Probabilistic Cache Stampede Prevention，還有一份 slides redisconf17-internet-archive-preventing-cache-stampede-with-redis-and-xfetch 講解，其實核心概念很簡單，就是在 cache 還沒過期前，提前讓 一個 worker 去計算更新值和 TTL，這個方法會那麼高效，是因為不像方法一需要引入一個 lock。
網路上也可以找不同語言的實作:
- golang(https://github.com/Onefootball/xfetch-go)
- rust(https://docs.rs/xfetch/1.0.0/xfetch/)
- ruby(https://github.com/Kache/xfetch)

更新 Cache 的正確姿勢

這也是一件常常被做錯的事情，但是還好網路上真的蠻多不錯的資料，我這邊就直接引用這篇缓存更新的套路，來學習下正確的方式。

考慮到下面這個更新 cache 的策略，非常的直覺，就是沒東西在 Cache 的時候，就跑去 DB 拿資料再去更新 Cache。

ttl = 60
val := cache.Get("key")
if val == nil {
    val := db.Read("key")
    cache.Set("key", val, ttl)
}
return val

乍看之下沒什麼問題，但是在並發情況下，有可能會有出乎意料之外的後果。

從這張圖可以看到最後 Cache 裡面拿到的也許是髒資料 (dirty cache data)，所以在 CoolShell 的文章裡面提到 Facebook 的論文《Scaling Memcache at Facebook》提出的方法是，在寫的時候去把 cache 裡面的值刪掉，然後靠讀的人去更新，雖然還是有可能會發生 dirty cache data，但是在不傷害效能的情況下，已經讓發生的機率下降許多。

其他

要觀察 Data 的生命週期，去調整 TTL，如果有些 Data 不太會更新，但是又經常被存取，可以把 TTL 調高一點，而反之亦然。
更新不頻繁，透過鎖讓 reader 去更新 cache 資料
更新頻繁，可以考慮透過 queue + worker 的方式去刷新 cache。

結語

其實寫這篇文章是幫助我刷新一下記憶，也是把書籤的文章都看過一輪，一般來說我們的 cache 部署通常是很多層的，類似從 browser -> CDN -> local cache -> exteranl cache -> database，所以很多地方都需要這些知識，而 cache coherence 不管在硬體軟體上面都是很經典的問題，了解這些可以幫助我們更好的去架構系統。

另外在過程中也查到了 Java 有 Ehcache，然後 Memcache 的作者也用 Go 寫了一套 Groupcache，都是可以利用本機上面的 Memory 來達成 Distributed cache 的功能，也是蠻推薦大家去看看學習，之後應該會來爬一下 Groupcache 的代碼。

Reference

How I Analyze S3 Upload Latency Issues

2020-03-16T14:34:25.000Z

TLDR;

Recently I helped company to finish S3 bucket migration in order to improve image upload speed of our lambda function. I found out letting lambda function and S3 buckets located in the same region can reduce latency significantly. Because lambda function is charged by the length of execution time and memory size, it’s quite helpful by reducing S3 upload latency.

Objective

This article aims to explain why I need to move our alert bucket from us-east-1 to us-west-2. Because one of our lambda function is the latency-sensitive application, I start digging out why uploading a 60kb image takes more than 600ms. Typically AWS should minimize the end to end latency between 2 regions by utilizing AWS backbone network so that we consider the situation shouldn’t be that bad. Here we perform serveral experiments to find the reason behind the scene.

Background

We store all the alert images on us-east-1 S3 bucket because previously we run everything on us-east-1. However, most AWS outages happen in that region as AWS tends to roll out new services or features in that region frequently. We want to make our services become more robust so that we soon migrate most of our critical services to us-west-2 but leave our S3 bucket in us-east-1. We don’t notice the upload latency between our application to S3 has a huge difference until we observe our edge lambda function.

Experiments

As of 2020 March, we do several experiments with different configs (location and ssl setting) and summarize the result as this table.

	SSL: false	SSL: true
Lambda (us-east-1) → S3 (us-west-2)	300ms ~ 400ms	500ms ~ 550ms
Lambda (us-east-1) → S3 (us-east-1)	60ms ~ 120ms	60ms ~ 133ms

SSL: false

Lambda (us-east-1) → S3 (us-west-2)

Lambda (us-east-1) → S3 (us-east-1)

SSL: true

Lambda (us-east-1) → S3 (us-west-2)

Lambda (us-east-1) → S3 (us-east-1)

Analysis

We use python boto3 to perform S3 uploading and enabled logging level DEBUG for better observing what happened during uploading.

import logging

# Setup logging
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

def main(event, context):
    logging.getLogger().setLevel(logging.DEBUG)
    upload_to_alarm_bucket()
    return

HTTPS connection

When we use s3.put_object to upload our data, you can see first of all our application should establish a new http connection.

When we upload a 30 kb image from us-east-1 to us-west-2, we found out it takes us 250ms to establish https connection! It’s quite weird but actually it makes sense. Let’s take a look at AWS internal latency report from https://www.cloudping.co/. The p50 of round trip time between us-west-2 to us-east-1 is 81.35 ms. Thus, according to this Cloudflare chart, establishing a new https connection needs to take 3 RTTs and it is roughly 240 ms. (80ms * 3)

Wait for 100 continue response

Another interesting thing is when we put a new object to S3, we will see this message Waiting for 100 continue response. According to S3 document, before we upload our data, server-side can do further check like authentication or redirection. Normally we should specify Expect: 100-continue on headers, and server will return 100 continue or 417 response. If we get the 100, we can continue uploading our data. If we get 417, we should stop uploading anything.

This 100 Continue response helps us avoiding send data twice or stop unnecessary uploading. However, one more RTT takes place here, and this also adds another 80ms to our latency.

Dropped connections

The last thing we wanna analyze is dropped connections. We all know that establishing https connection is quite expensive. Typically, we assume this won’t happen frequently because we usually adopt http keep-alive to avoid http connection recreation. Despite S3 SDK boto set keep-alive on header automatically, we still find out there are a lot of resetting dropped connection messages. After searching this on stackoverflow, 1, 2, we understand that s3 server will drop idle keep-alive connections after few seconds. This may help s3 become more robust because:

a lot of idle connections also consume tons of memory
others not able to connect to S3 due to the max connections limit.

Summary

No matter how big is your data, choose the nearest S3 bucket for your application can help you reduce latency significantly.

Reference

Golang 10th Anniversary x GTG 45th 心得

2019-11-13T14:59:26.000Z

上禮拜很開心可以參加 Golang 10th anniversary 的聚會，身為這次聚會的 co-organizer 加上贊助商，不但要參加 Golang 官方的行前會議，還有準備訂 Pizza 加上當講者，要做的事情真的是蠻多的，然後 Pizza 訂太少，讓晚來的同學沒有吃到，真的是蠻抱歉的，下次還有機會訂 Pizza 就知道該怎麼辦了。

這次聚會相關資料如下，之後有興趣參加的人可以先加 meetup 帳號:

社群 Golang Taipei Gathering： https://www.meetup.com/golang-taipei-meetup
本次活動網頁: 活動網址

How I become Go GDE

十週年聚會一開始就是 Golang 社群的扛霸子 Evan 大大主講，然後給的題目我覺得真的很適合十週年聚會，告訴大家了 Golang 這十年來的發展，還有 Evan 為什麼會接觸 Golang ，而其實我會學習 Golang 其實或多或少跟 Evan 也有關係，大概也是五年前，看到 Evan 在開 MIT 6.824 的讀書會時，所使用到的語言，接下來也是越來越多 infra 相關的 tool，像是 etcd, docker, k8s, terraform 等等都是用 Golang 開發的，讓我也對這個語言有了很大的興趣。

然後 Evan 也分享了他的學習過程與方法，寫部落格和分享大概是我從 Evan 身上學到最多的東西，雖然值和量還遠遠不及啊（抱頭），但是真的透過分享和寫作，可以把自己不熟悉的地方重複思考，因為真的是很怕講錯或是寫錯，這樣一來就有機會窺探自己的盲點。

投影片:

Golang taipei #45 10th birthday from Evan Lin

Understanding real world concurrency bugs in go

我給的 talk 主要是想跟大家聊聊一篇論文的內容，他將 Golang 裡面的 bugs 分為 blocking & non-blocking 兩種，並且給出結論是， Golang 的語法或是 practice ，不一定比傳統 mutex 的方式少 blocking 的 bugs，而在 non-blocking 的 bugs 上面，的確會少許多，並且給出了一些真實世界的範例，而這些範例都出自於很熱門的 open source project 像是 Docker, Kubernetes, etcd ，其實我對於讀那些 Bugs 比較有興趣，可以看到有些其實是不熟 Golang 語法而產生的，而這些 Bugs，又是在一般情況下，不太容易被走到的 path，從這些 Bugs 中，我們可以學習到很多有可能會犯錯的場景，相關的內容可以看我的投影片或是到論文的 GitHub repo 找，很推薦大家去讀。

投影片:

Understanding real world concurrency bugs in go (fixed) de cc liu

小結

Golang meetup 也辦到第 45 場了，而且這次還上了 meetup Twitter 的推，感覺真的是蠻棒的，希望台灣有越來越多寫 Golang 的人和公司，然後透過社群來一起學習，另外真心期望明年能夠把 GopherCon 辦成，接下來會更專心的來找贊助的機會，希望大家能多多支持摟。

Coscup 分享 - HA Prometheus Solution Thanos

2019-08-22T02:23:55.000Z

上個月參加了 Coscup，完成了我的 Coscup 講者處女秀，對比三年前當主持人，其實當講者輕鬆了不少，而且看到很多熟面孔的感覺非常好。

這次參加的 SDN x Cloud Native x Golang 議程軌，其實有非常多的好主題，而我也分享了一個跟 CNCF & Golang 有相關的 Opensource Project - Thanos，Thanos 主要就是為了解決 Prometheus 的 Global View & High Availability & Long-term storage 的問題，一直以來 Prometheus 作為 Cloud Native 主要監控元件，在經過社群的努力下，其實以單台 Prometheus 而言，效能和儲存效率上已經獲得很大的改善，目前是非常成熟的方案，但是在談到在部署多個 Prometheus 的情況時，往往會遇到一些問題像是，如何透過 PromQL Query 不同台的 Prometheus 並且 aggregate/merge 其中的資料，另外是 long-term storage 的問題，像是如何將歷史資料保存起來，而不是只有寫在 Prometheus 單體的 SSD 上面，這幾個問題就造就了 Thanos, Cortex, Uber M3 等等 Opensource 的存在。

HA prometheus solution - Thanos

我的投影片分享如下
https://docs.google.com/presentation/d/1KBs4FxYwFL6dsz_JUbPK4ZiKXYjsaLZI21VgVLI54I4

前面幾頁就在講解單體 Prometheus 的問題，而就算使用了 Federate 之後，這個架構還是會有其他的問題，像是資料會被重複儲存在兩個地方，還有被拉取的 Prometheus 機器時也有可能發生 timeout 而很多情況下我們可能不會把所有的東西都拉到該機器上，另外壓力都會落在 Federate 起來的那台機器上面，這樣一來又還是需要到個別的 Prometheus 機器上面去做 Query，造成很多管理上面的不便。

而 Thanos 主要實現了三個願望

Have a global view
Have an HA in place
Unlimited retention

Global View & HA

可以看這張圖比較下使用 Thanos 取代 Prometheus Federate，基本上 Thanos 使用 Sidecar Pattern，就算你有既有的 Prometheus 正在跑著，也可以透過 Sidecar 這個元件去做讀取，Querier 這個元件使用了 Thanos 定義的 StoreAPI 從 Sidecar 中讀取資料，Querier 裡面有 Deduplicate 和 Merge 的功能，所以也不用怕資料散在不同的 Promethues 上面，Deduplicate 主要是可以透過 Label 去認出相同的資料，這樣就不會重複把同一條線畫出來。

透過這種架構，可以很輕鬆的達成 HA，開兩台 Prometheus 去撈取資料，就算一台 Prometheus 掛掉，Thanos Querier 還是可以讀取其中一台，然後兩台都活著的時候， Deduplicate 可以將資料去重。

Unlimited Retention

Thanos 實現 Umlimited Retention 的方式也相當的簡單，就是利用 Sidecar 把 Prometheus 裡面的 Block 讀出來寫到 Object Storage，然後再提供一個 Store 的元件，用來讀取 Object Storage 裡面的 Blocks，這邊很聰明的地方就是 Querier 都是透過 StoreAPI 去做讀取，這個介面一致化後，其實讓 Thanos 變得相當有彈性。

這邊要寫一下要注意的地方，就是 Prometheus 其實預設是每兩個小時才會把 Memory 裡面的 Block 寫進 local storage，之後 Thanos Sidecar 才有機會將其上傳到 block storage，如果在中途你的 Prometheus crush 了，這樣就會有兩個小時的資料遺失了，所以 Thanos 官網上面其實是建議 Prometheus 如果跑在 k8s 上面，最好是掛著專屬的 PVC，這樣 Prometheus 回來的時候，還可以透過 WAL 回復資料。另外一個雷是 Prometheus 原本 Remote Read 沒有提供 Steaming 的格式，而 Sidecar 在讀取的時候，如果拉了一個 range 超大的資料，會造成 Prometheus OOM，而這個問題也在上個月的這個 PR 解決惹。

Other Components

基本上使用 Querier, Sidecar, 還有 Store 就可以完成很多的事情，但其實 Thanos 還有提供更多的功能，這邊我介紹了 Compactor 和 Ruler。

Compactor

Prometheus 的 TSDB 在改寫後，就有提供 Compaction (壓縮) 的功能，基本上就是把 Memory 裡面的資料，透過 delta-of-data & XOR 的方式壓縮，這裡面參考了 Facebook 2015 年發表的論文，有興趣的人可以看看，而透過這種壓縮方式，Prometheus 可以輕易地儲存很多的 series 以及保存很長的一段時間，而 Thanos compactor 乍看之下好像沒什麼用處，但其實它復用了 Prometheus 的 compactor library，並且在上面擴展了 Downsampling 的功能，也就是將這些 Blocks aggregate 成 5mins 和 1 hours，這樣的做法，可以讓讀取長時間的資料時，可以更快的取出資料，使用的 memory 也會變少，舉例來說，你想要看個半年的資料時，其實沒必要看到 raw data 那麼小的 resolution，其實只要透過 1hours 的資料就可以反推出趨勢，另外 Compactor 會幫忙管理資料的刪除，透過一個 Compactor 管理移除 Block Storage 的資料，其實也是比較好的做法。

不過在使用 Compactor 時，其實也有一個雷，就是要把 Prometheus 上面的 Compaction 關閉，要不然 Thanos 的 Compactor 還需要多做一步將資料還原才能做 Downsampling。

Ruler

Ruler 這個元件其實是為了擴展 Alertmanager 而用的，因為使用 Thanos 後，在設定 Prometheus 上，可能會把超過 2 小時的 Block 儲存到 Block Storage 上，然後把 Prometheus 自身的 Retention 關小，如此一來你在 Prometheus 上面設定的 Alert rule 如果觀察的趨勢是超過 2 小時的，就很有可能會失效，另外是在 Query 不同 Cluster 上面，就沒辦法設定一個 Rule 去覆蓋所有的 Cluster，Ruler 給了我們這樣的彈性，可以將 Alert rules 都集中給 Ruler 管理。

Conclusion

在投影片中，我還有列舉了一些官網上面建議的部署模式，可以看到 Thanos 也支援一些複雜的情境，然後其實已經有蠻多大公司都用在 Production 上面，所以算是一個成熟的方案，蠻推薦大家可以玩玩看。

Thanos 在七月的時候，也被捐出來給 CNCF，正式成為 CNCF Sandbox 的專案，有了更多資源後，我們可以預期他會越來越好用，社群的人解 issues 和 feedback 的速度也很快，有心玩 golang open source 的人，我覺得 Thanos 是蠻好的一個專案。

Performance tweaking for fluentd aggregator (EFK stack)

2019-05-02T10:03:09.000Z

Preface

Logging is one of the critical components for developers. Every time when things went wrong, we had no doubt but checked what’s going on in logs. Fluentd is an open source data collector solution which provides many input/output plugins to help us organize our logging layer. There are tons of articles describing the benefits of using Fluentd such as buffering, retries and error handling. In this note I don’t plan to describe it again, instead, I will address more how to tweak the performance of Fluentd aggregator. Especially the case I use the most when fluentd talks to elasticsearch.

The typical way to utilize fluentd is like the following architecture. We can use sidecar fluentd container to collect application logs and transfer logs to fluentd aggregator. By adopting sidecar pattern, fluentd will take care of error handling to deal with network transient failures. Moreover, our application can write logs asynchronously to fluentd sidecar which prevents our application from being affected once remote logging system becomes unstable.

To understand more benefits, I suggest you guys take a look at this youtube video which gives a really great explanation.

Problem

Since many fluentd sidecars write their logs to fluentd aggregator, soon or later you will face some performance issues. For example, if our aggregator attempts to write logs to elasticsearch, but the write compacity of elasticsearch is insufficient. Then you will see a lot of 503 returns from elasticsearch and fluetnd aggregator has no other choices but keep records in the local buffer (in memory or files). The worst scenario is we run out of the buffer space and start dropping our records. There are 2 possible solutions comes to my mind to tackle this situation:

Increase the size of elasticsearch, though it’s easy for me to change elasticsearch size (yes, I use AWS managed elasticsearch), this makes us spend more money on elasticserach nodes.
Tweak the fluentd aggregator parameters to see if we can improve the bottleneck.

So before I increase elasticsearch node size, I tend to try option 2 to see how much performance can be improved by tuning the parameters.

Understand Buffer plugin

This picture borrowed from this official slides. Let’s see how fluentd work internally. Here we only focus on input & buffer

Input

When messages come in, it would be assigned a timestamp and a tag. Messages itself is wrapped as arecord which is structured JSON format. timestamp + tag + record is called event.

Timestamp: 2019-05-04 01:22:12
Tag: app.production
Record: {
"path": "/api/test",
"user": "john"
}

Buffer

According to the document of fluentd, buffer is essentially a set of chunk. Chunk is filled by incoming events and is written into file or memory. Buffer actually has 2 stages to store chunks. These 2 stages are called stage and queue respectively. Typically buffer has an enqueue thread which pushes chunks to queue. Buffer also has a flush thread to write chunks to destination.

Stage

chunk is allocated and filled in the stage level. Here we can specify some parameters to change the behavior of allocation and flushing.

chunk_limit_size decides max size of each chunks
chunk_limit_records the max number of events that each chunks have
flush_interval defines how often it invokes enqueue, this only works when flush_mode being set to interval

The enqueue thread will write chunk to queue based on the size and flush interval so that we can decide if we care more about latency or throughput (send more data or send data more frequent).

queue

queue stores chunks and flush thread dequeues chunk from queue.

flush_thread_count: we can launch more than 1 flush thread, which can help us flush chunk in parallel.
flush_thread_interval define interval to invoke flush thread
flush_thread_burst_interval if buffer queue is nearly full, how often flush thread will be invoked.

Typically we will increase flush_thread_count to increase throughput and also deal with network transient failure. see https://github.com/uken/fluent-plugin-elasticsearch#suggested-to-increase-flush_thread_count-why

Other parameters

total_limit_size total buffer size (chunk size + queue size)
overflow_action when buffer is full, what kind of action we need to take

Note

Buffer plugin is extremely useful when the output destination provides bulk or batch API. So that we are able to flush whole chunk content at once by using those APIs instead of sending request multiple times. It’s the secret why many fluetnd output plugins make use of buffer plugins. For understanding the further detail, I suggest you guys go through the source code.

Tweaking elasticsearch plugins

After we understand how important buffer plugins is, we can go back to see how to tweak our elsticsearch plugin. For our use case, I try to collect logs as much as possible with small elasticsearch node.

The initial setting is like

<buffer>
  @file
  path /fluentd/log/buffer

  # chunk + enqueue
  flush_mode interval
  flush_interval 1s

  flush_thread_count 2
  retry_type exponential_backoff
  retry_timeout 1h
  overflow_action drop_oldest_chunk
buffer>

The problem is that chunk fluentd collects is too small which lead to invoke too many elasticsearch write APIs. This also makes fluend queues many chunks in the disk due to fail requests of elasticsearch.

From AWS ES doc we know that the http payload varies with different instance type. The maximum size of HTTP request payloads of most instance type is 100MB. Thus we should make our chunk limit size bigger but less than 100MB. Plus we should increase the flush_interval so that fluentd is able to create big enough chunk before flushing to queue. Here we also adjust flush_thread_count depending on elasticsearch plugin suggestion.

The modified version:

      @file      path /fluentd/log/buffer      total_limit_size 1024MB      # chunk + enqueue    chunk_limit_size 16MB    flush_mode interval    flush_interval 5s    # flush thread    flush_thread_count 8    retry_type exponential_backoff    retry_timeout 1h    retry_max_interval 30    overflow_action drop_oldest_chunk

Result

After I change the setting, fluentd aggregator no longer complains about the insertion errors and drops the oldest chunks.
As you can see the following pictures show the memory usage drops dramatically so that it proves that fluentd works perfectly.

References

Fluentd Webinar: Best kept secret to unify logging on AWS, Docker, GCP, and more!

Logging directly from microservice makes log storages overloaded
  - too many connections
  - too frequent import API call

Aggregation server
  - make log infra more reliable and scalable
  - connection aggregation
  - buffering for less frequent import API calls
  - data persistence during downtime
  - retry & recovery from down time

透過 IAM access advisor API 來幫 IAM permission 做大掃除

2019-04-08T09:37:08.000Z

Preface

隨著組織慢慢變大，在 AWS 上面常常會遇到一個問題就是，我的 IAM entity 的 permission 是不是開的太大了，這個問題常常發生在 developer 想要快速驗證自己的 application 能不能 work，而作為 admin 的我們有時會給予太大的權限，等到該專案開展到一定程度的時候，其實需要使用到的權限應該是穩定下來了，但又難以找每個專案負責人慢慢 review 權限，這樣一來，其實違反了 least privilege 的原則，也就是只給於需要的權限就好。

IAM access advisor API

AWS 其實有推出一組用來分析 IAM 權限管理的 API，而 AWS 官方的 blog 也有幾篇介紹，完全可以符合我們的需求，把一些用不到的權限限縮。

generate-service-last-accessed-details 針對 IAM ser, role, group, or policy 產生最後存取 (last accessed data) 的資訊，呼叫這個 API 後會拿到一組 JobId，接著要等待一陣子，才能透過 get-service-last-accessed-details 得到資料。
get-service-last-accessed-details 透過這個 API 輸入 JobId 去得到 last accessed 的資料
get-service-last-accessed-details-with-entities 其實跟上面的 API 很類似，只是可以指定 –service-namespaces 去看特定的 service
list-policies-granting-service-access 可以看到這個權限（針對 service) 是從哪個 policy 來的

有了以上這幾組 API 我們就可以實作一個簡單的 script 去掃出是否有權限太大的 IAM entity。

Simple Example

這個範例很大一部分是參考 trek10inc 的 config-excess-access-exorcism 來的，不過有做一些簡單的修改，有了這個程式可以幫我們快速定位，那個 IAM role 開的權限太大，而這個 repo 其實想做到的事情更潮，是將其設定為 AWS config 的 rule，由此一來就可以讓 AWS 幫我們定期去掃 IAM entities。

先透過下面這個 function 拿到該 IAM entity 所有的 service 權限，這邊要注意的是要把 paginate 的資料也拿回來，因為有些權限太多需要好幾個 API call 才拿得齊。

def get_iam_last_access_details(iam, arn):
    job = iam.generate_service_last_accessed_details(Arn=arn)
    job_id = job['JobId']
    service_results = []
    while True:
        result = iam.get_service_last_accessed_details(JobId=job_id)
        if result['JobStatus'] == 'IN_PROGRESS':
            print("Awaiting job")
            continue
        elif result['JobStatus'] == 'FAILED':
            raise Exception(f"Could not get access information for {arn}")
        else:
            service_results.extend(paginate_access_details(job_id, result))
            break
        time.sleep(5)
    return service_results

def paginate_access_details(job_id, result):
    more_data, marker = result['IsTruncated'], result.get('Marker')
    if not more_data:
        return result['ServicesLastAccessed']

    all_service_info = result['ServicesLastAccessed'][:]
    while more_data:
        page = iam.get_service_last_accessed_details(JobId=job_id, Marker=marker)
        more_data, marker = page['IsTruncated'], page['Marker']
        all_service_info.extend(page['ServicesLastAccessed'])
    return all_service_info

來個簡單的測試

1 2	detail = get_iam_last_access_details(iam, "arn:aws:iam::AWS_ACCOUNT:role/service-role/AmazonEC2RunCommandRoleForManagedInstances") pprint(detail)

Output 會長得像這樣:

[{   'ServiceName': 'Amazon CloudWatch',
        'ServiceNamespace': 'cloudwatch',
        'TotalAuthenticatedEntities': 0},
    {   'ServiceName': 'AWS Directory Service',
        'ServiceNamespace': 'ds',
        'TotalAuthenticatedEntities': 0},
    {   'ServiceName': 'Amazon EC2',
        'ServiceNamespace': 'ec2',
        'TotalAuthenticatedEntities': 0},
    {   'LastAuthenticated': datetime.datetime(2019, 4, 8, 9, 41, tzinfo=tzutc()),
        'LastAuthenticatedEntity': 'arn:aws:iam::774915305292:role/service-role/AmazonEC2RunCommandRoleForManagedInstances',
        'ServiceName': 'Amazon Message Delivery Service',
        'ServiceNamespace': 'ec2messages',
        'TotalAuthenticatedEntities': 1},
        ...
]

有了這個 output 我們就可以來開心的來分析啦，主要就是看 LastAuthenticated 這個欄位，如果沒有這個欄位就代表根本沒使用過，這個權限就該被剷除，另外也可以檢查是否這個使用的日期是不是在 180 天前，太久沒用也代表可能不需要了。

def never_accessed_services_check(iam, arn):
    service_results = get_iam_last_access_details(iam, arn)
    never_accessed = [
        x for x in service_results if 'LastAuthenticated' not in x
    ]
    if len(never_accessed) > 0:
        return (
            'NON_COMPLIANT',
            "Services" + ','.join(f"'{x['ServiceNamespace']}'" for x in never_accessed) + "have never been accessed",
        )

    return 'COMPLIANT', 'IAM entity has accessed all allowed services'

def no_access_in_180_days_check(iam, arn):
    import pytz

    service_results = get_iam_last_access_details(iam, arn)

    pp = pprint.PrettyPrinter(indent=4)
    pp.pprint(service_results)

    utc_now = datetime.datetime.utcnow().replace(tzinfo=pytz.UTC)

    older_than_180_days = [
        x for x in service_results
        if 'LastAuthenticated' in x and (utc_now - x['LastAuthenticated']) > datetime.timedelta(days=180)
    ]
    if len(older_than_180_days) > 0:
        return (
            'NON_COMPLIANT',
            "Services" + ','.join(f"'{x['ServiceNamespace']}'" for x in older_than_180_days) + "have not been accessed in the last 180 days",
        )

    return 'COMPLIANT', 'IAM entity has accessed all allowed services in the last 180 days'

在知道是哪個 service 有問題後，還可以用 aws iam list-policies-granting-service-access --arn arn:aws:iam::AWS_ACCOUNT:role/service-role/AmazonEC2RunCommandRoleForManagedInstances --service-namespaces s3 去看這個 service 的權限是從哪個 policy 來的。

{
    "PoliciesGrantingServiceAccess": [
        {
            "ServiceNamespace": "s3",
            "Policies": [
                {
                    "PolicyName": "AmazonEC2RoleforSSM",
                    "PolicyType": "MANAGED",
                    "PolicyArn": "arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM"
                }
            ]
        }
    ],
    "IsTruncated": false
}

sample code 可以用下列的程式碼

def get_policies(iam, arn, service_namespace_list):
    policies = []
    result = iam.list_policies_granting_service_access(Arn=arn, ServiceNamespaces=service_namespace_list)
    policies.extend(paginate_policies(arn, service_namespace_list, result))
    return policies

def paginate_policies(arn, service_namespace_list, result):
    more_data, marker = result['IsTruncated'], result.get('Marker')
    if not more_data:
        return result['PoliciesGrantingServiceAccess']

    all_service_info = result['PoliciesGrantingServiceAccess'][:]
    while more_data:
        page = iam.list_policies_granting_service_access(Arn=arn, ServiceNamespaces=service_namespace_list, Marker=marker)
        more_data, marker = page['IsTruncated'], page['Marker']
        all_service_info.extend(page['PoliciesGrantingServiceAccess'])
    return all_service_info

就可以找出需要修正的 policy 像是這樣

[
   {  
      'ServiceNamespace':'s3',
      'Policies':[
         {  
            'PolicyName':'AmazonEC2RoleforSSM',
            'PolicyType':'MANAGED',
            'PolicyArn':'arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM'
         }
      ]
   }
]

心得

管理 IAM 其實需要相當的心力，透過一些 AWS 的 cli 加上 python boto lib，可以讓我們事倍功半，很推薦大家多試試看這些 API 掃掃看，我也有蠻多意外的發現 XD

Reference

透過 loop invariant 學習怎麼寫正確的 binary search

2019-03-28T09:12:32.000Z

Preface

Binary search 記得是我剛入門寫程式的時候，前幾個回家作業，當時寫出來時，覺得整個程式就很直覺，對這個也不太有什麼疑問，直到最近看到 Programming Pearls 這本書裡面，有寫到大概 90% 的 binary search 都是錯誤的，甚至第一版的 binary search (1946 的版本)，直到 1962 年才發現有 Bug。

I’ve assigned this problem in courses at Bell Labs and IBM. Professional programmers had a couple of hours to convert the above description into a program in the language of their choice; a high-level pseudocode was fine. At the end of the specified time, almost all the programmers reported that they had correct code for the task. We would then take thirty minutes to examine their code, which the programmers did with test cases. In several classes and with over a hundred programmers, the results varied little: ninety percent of the programmers found bugs in their programs (and I wasn’t always convinced of the correctness of the code in which no bugs were found).

I was amazed: given ample time, only about ten percent of professional programmers were able to get this small program right. But they aren’t the only ones to find this task difficult: in the history in Section 6.2.1 of his Sorting and Searching, Knuth points out that while the first binary search was published in 1946, the first published binary search without bugs did not appear until 1962.

其實 google 也有一篇文章在探討 binary search，先來看下面這個 binary search 的程式。

func Search(input_arr []int, target int) int {
    s := 0
    e := len(input_arr) - 1

    for s <= e {
        m := (s + e) / 2

        if input_arr[m] < target {
            s = m
        } else {
            e = m - 1
        }
    }

    return s
}

這個範例明眼人一看就知道 m := (s + e) / 2 會有溢位的問題，而通常會有兩種改法:

m := s + (e - s)/2
m := int(uint(s+e) >> 1)

但是除了這個之外，其實我寫的這個例子還有其他問題，最主要的就是 Off-by-one errors 這個問題，如果把 [1,2,3,4] 當作 input，然後 target 為 3 的情況，其實會跑進無窮迴圈：

s=0, e=3, m=1 且 input_arr[1] = 2 < 3，所以 s = m
s=1, e=3, m=2 且 input_arr[2] = 3 >= 3 ，所以 e = m - 1
s=1, e=1, m=1 此時這個程式，因為一直維持 s <= e 就會跑進無窮迴圈

而這個邊界條件，就是要調整 +1, -1 的問題，非常的難搞，這裡有好幾個地方要配合才行

e 的邊界是 len(input_arr) or len(input_arr) - 1
s <= e or s < e
s = m or s = m + 1
e = m or e = m - 1

網路上甚至可以找到範本，專門拿來對付 leetcode 上面的問題，雖然也是有人講可以直接在迴圈中判斷 if input_arr[m] == target 做跳出就行了，但是這樣的寫法顯然無法解決從找出 sorted array 中找出 lower_bound or upper_bound，這就讓我想知道是否有更科學的方法可以幫助我們。

Loop invariant to the rescue

很幸運的，在網路上找到幾篇文章 (都列在 reference 了) 幫助我理解怎麼使用 loop invariant 去解決這個問題，我也查了下 Introduction to Algorithm 裡面的 loop invariant 定義:

We use loop invariants to help us understand why an algorithm is correct. We must show three things about a loop invariant:

1. Initialization: It is true prior to the first iteration of the loop.
2. Maintenance: If it is true before an iteration of the loop, it remains true before the next iteration.
3. Termination: When the loop terminates, the invariant gives us a useful property that helps show that the algorithm is correct.

整個看下來有點歸納法的意味，就是定義一個性質，在 loop 開始前，執行完一次 loop interation，和結束時都可以保證這個性質成立，這樣就可以得到正確的程式結果。

先看看下面這個簡單的例子

func find_max(a []int) {
    max = -INF

    for i:=0; i < len(a); i++) {
        if (a[i] > max)
            max = a[i]
    }

    return max
}

以這個例子來說，我們的 loop invariant condition 可以設定為 max 總是在給予的 a array 前 i 個元素中，然後去驗證每次跑迴圈的時候，都符合這個條件，就可以確定這個演算法是正確的。

透過 loop invariant 寫 binary search

前面提的那個例子，大家一定會覺得有點太簡單，實在不知道對我們寫程式有什麼幫助，接下來透過 binary search 的例子，相信大家可以有更不一樣的感受。

首先要來定義我們的問題:

Pre condition:
在 binary search 中，我們會有一個 sorted list，然後從中找到 target。
sorted list = [3, 5, 6, 13, 18, 21, 23]
target = 18
Post condition:
找出 key 是否在 list 中

而定義 list 的區間其實有四種方法

1. A[low] <  A[i] <  A[high]
2. A[low] <= A[i] <  A[high]
3. A[low] <  A[i] <= A[high]
4. A[low] <= A[i] <= A[high]

看過許多資料後了解方法二是比較好的選擇， i ∈ [low,high)，也就是左閉右開這個方法，也就是右邊的值並沒有包含在這個區間內，其實也是最直覺的方法，這邊很推薦大家看這份知乎的文章: 二分找查有幾種寫法?去了解為什麼要取這個區間，其實我以下很多內容也是看這篇文章而通透的。

而選擇了這個區間後，我們先來個基本版的 binary search 實做，才容易解釋 loop invaraint

func Search(input_arr []int, target int) int {
    low := 0
    high := len(input_arr)  // 符合 i ∈ [low,high)

    for low < high {
        mid := low + (high - low) / 2

        if input_arr[mid] == target {
            return mid
        } else if input_arr[mid] < target {  // target 在 mid 右側
            low = mid + 1
        } else {                           // target 在 mid 左側
            high = mid
        }
    }

    return -1
}

我們這裡設定的 loop invariant 性質，跟區間很有關係

搜索區間 [low, high) 不為空的話，low < high 才會成立，反之為空的話，low == high 會離開迴圈
找出來的 sub range 搜索區間都是 [low, high)

有了這些條件後，我們可以分析下迴圈結束的 boundary condition，來先個比較小的測資，來模擬測試區間變小的情況。

範例 1

如果我們有個 array 裡面只有一個元素 [0]，然後我們要找的 target 為 1 時，透過以下的 step

我們的初始搜索區間為 [0, 1)，low = 0, high = 1, mid = 0
因為 input_arr[mid] = 0 < 1，所以 low = mid + 1 ，此時 high & low 皆為 1 且重合，搜索區間為空集合，離開迴圈。
回傳 -1 代表這個 array 沒有我們要的值

已上面這個例子，我們可以得知，如果把跳出的條件寫成 low <= high 或是 low 寫成 mid 都會出問題，因為會不符合 loop invaraint ，這邊要理解的就是搜索區間變成空集合在這個程式中，是怎麼表示才是正確的。

範例 2

在了解怎麼離開迴圈後，讓我們再看看比較長的測資，[3, 5, 6, 13, 18, 21, 23]，從中間找 18 這個值

從這個過程中我們可以看到，不管是找右區間還是左區間，我們的 L & H 的移動法則都是要保持搜索區間為 [L, H)，然後慢慢把搜索區間變小。

再看一下這個例子，如果我把 18 改成 19，一樣是搜索 18 這個值，會發現結束時，我們的 low == high 並且跳出回圈回傳 1，就跟範例 1 的情況一樣，這時我們的 [low, high) 就成為空集合了。

透過 loop invariant 寫 lower bound

以上我們的 binary search 的例子，只能找出 target 是否在 sorted array 或是不在 sorted array，但是如果要找 lower bound or upper bound 就無法使用了，下面給個例子什麼是 lower bound & upper bound。

            upper bound
                +
[0, 1, 2, 2, 2, 3, 4, 5]
       ^
       lower bound

如果要找 lower bound 其實就是稍微改寫下我們的 binary search

func Search(input_arr []int, target int) int {
    low := 0
    high := len(input_arr)  // 符合 i ∈ [low,high)

    for low < high {
        mid := low + (high - low) / 2

        if input_arr[mid] < target {  
            low = mid + 1
        } else {                     
            high = mid
        }
    }

    return low
}

這邊的 loop invariant 跟之前的很相似，不過有些小變形

搜索區間 [low, high) 不為空的話，low < high 才會成立，反之為空的話，low == high 會離開迴圈
找出來的 sub range 搜索區間都是 [low, high)
- 右邊的區間 [high', high) 都是 >= target 的值
- 左邊的區間 [low, low') 都是 < target 的值

接著直接看圖說故事:

一樣維持搜索區間為 [L, H) (藍色)

因為 array[mid] >= target，所以走到 H = mid，這裡其實產生了右邊的區間 [high', high) (粉色)，我們可以知道這個區間其實有著 >= target 的特性，所以 target 也有可能落在這個區間內，到最後要找答案的時候這個區間很重要。

接著看到 array[mid] < target，這代表了 [low, mid] 的這個區間都是小於 target 的，所以我們選擇讓 L = mid + 1，這樣產生出來的 [low, low'）的區間 (綠色) 才符合我們所定義的特性，但是可以發現藍色區間還是 [Low', high')，我們的目標是要讓藍色區間縮小到不見，並保持 loop invariant。

因為 array[mid] == target 所以繼續拓展右邊的區間，記得這個區間內的值都是 >= target 的

結束時跟之前的例子一樣 L=H 會重合，這邊我們要的答案其實不管回傳 L 或是 H 的 index 都是一樣的結果，但是其實可以想成是取出粉紅色的第一個值，就會是我們要找的 lower bound。

心得

其實 binary search 的變化真的很多，但是只要了解自己要搜索的區間長怎麼樣，就比較不會卡來卡去在那邊 +1, -1, 而最後寫的 lower bound 的方法其實也適用於一般的 binary search，可說是比較簡單又不容易錯的版本，不過要了解這個 loop invariant 怎麼定義區間，怎麼移動 low, high 去產生新的搜索區間，我還是建議大家用紙筆自己畫畫看，其實會比較有感覺，也可以拿 A[low] <= A[i] <= A[high] 這個為例子看看程式要怎麼寫才對，這篇文章的圖文寫得比較快，如果有不清楚或是錯誤的地方在請大家指正 :)

Reference

AWS Shuffle Sharding

2019-03-04T01:37:42.000Z

Preface

Colm MacCárthaigh 是 AWS 的 Senior Principal Engineer，如果常在追他的 Twitter 帳號會看到很多有趣的 AWS 內部的 architecture 設計，像是最近有人在 og-aws.slack.com 的討論區問到為什麼 AWS 的 status alert 不一定會影響到該 region 的全部 customer 呢? 我隨機找了一個 alert 的內容:

Beginning at 11:54 AM PST some Amazon Aurora clusters experienced increased database create times and cluster unavailability in the AP-SOUTHEAST-2 Region. Elevated create times were resolved at 2:27 PM PST, at which point some existing clusters continued to experience availability issues. As of 5:35 PM PST both issues have been resolved and the service is operating normally. In total, the event impacted a little less than 3% of the Aurora databases in the region.

可以看得出來，這個問題只影響了 3% 的 Aurora database，然後 AWS 這邊會建議每個用戶使用 Personal Health Dashboard 去看是否真的有受影響，這邊就讓很多人好奇 AWS 的底層，到底是怎麼去做 isolation 且提供 multi-tenancy 的服務，不讓一些故障的 servers 影響到全部人，而我這篇文章就是從 Colm MacCárthaigh 的 tweet 展開，有興趣的人也可以直接去看他的 tweet。

It's no good sharing everything if a single "noisy neighbor" can cause everyone to have a bad experience. We want the opposite! At AWS we are super into compartmentalization and isolation, and mature remediation procedures. Shuffle Sharding is one of our best techniques. O.k. ..
— Colm MacCárthaigh (@colmmacc) August 28, 2018

Shuffle Sharding

其實 Colm MacCarthaigh 早在 2014 年的時候，就在 AWS architecture blog 上面揭露過 Shuffle Sharding 這個概念，而下面的例子我是從 reinvent 2018 的 slides 裡面擷取出來的。

Basic architecture

假設你有一組 service，裡面共有八個 nodes，這些 nodes 都接在一組 LB 後面，此時有八組不同的 customer 上門，如果 Diamond 這個 request 進到系統後，因為某些原因，也許是剛好碰到系統的某個 Bug 或是某種 workload 不小心把一組 node 打垮了，又好巧不巧的，它因為沒有接受到想要的回應，不斷的 retry 也把其他的 nodes 也打垮了，這時候我們要討論的 Term 叫做 Blast Radius，也就是針對 customers 的爆炸範圍，以我們這個例子來看

也就是全部的 customer 都被炸翻了！這也是最糟糕的狀況，AWS 在建構它們的服務時極力的避免這種情況。

Cell-based architecture

為了避免 Diamond 直接把全部 nodes 都弄爛，其實簡單一點的方法可以直接把 nodes 分組，切成不同的 cell，兩兩成群，而針對不同的 cells，我們也會分配兩個 customer，這樣 Diamond 頂多把其中兩台給弄掛掉，而以這個例子來看頂多愛心這個倒霉的 customer 一起中招，這樣一來針對 Blast Radis 就可以得到 4x 的改進，從 100% 下降到 25%，也就是只有 25% 的 customer 受到影響。

這個方法在 AWS 內部稱作 cellularization，其實套用在很多不同的服務上面，像是 isolated regions 還有 availability zones。

Shuffle Sharding

有了以上概念後，可以再回到 Shuffle Sharding，其實非常的簡單，我們不一定要讓 customer 在固定的 cell 裡面，其實目標只是要分配 customer 的 requests 到不同的兩個 node 上面，而通過 random 的分配不同的 nodes 上面，透過下面這張圖我們可以發現，這個方法的威力真的很大，Diamond 雖然也是讓兩個 nodes 直接掛掉，但是在上面的 customer 其實分別是愛心和梅花，而他們的 request 還有其他的 node 可以服務，所以愛心和梅花，還是可以通過 retry 去達到 fault tolerance，所以整體的 Blast Radius 降低到只影響一個 customer。

這個圖是比較簡化的，其實 8 個的 nodes 去隨機選出 2 個 node 的 combination 是 28 組，也就是有 28 種分配方式，而 Blast Radius 的算法是像下面這樣去考慮某一組 combination 壞掉的機率:

slides 中也提供了一個 table 告訴我們，採取了 Suffle sharding 會讓 % customer impacted 降到 3% ! 這也是為什麼 AWS 的 service 有問題時，會推薦你看 personal health dashboard ，因為爆炸範圍真的沒那麼廣。

Overlap	% customer impacted
0	53.6%
1	42.8%
2	3.6%

講到這邊，其實已經覺得很厲害了，不過 AWS 因為客戶非常的多，所以還是無法容忍這麼高的影響率，所以 AWS 設計了 100 個 Nodes，shard size 為 5 的架構，這邊再來算個數學

Overlap	% customer impacted
0	77%
1	21%
2	1.8%
3	0.06%
4	0.0006%
5	0.0000013%

整體的數字下降到 0.0000013%!

Conclusion

在使用 Shuffle Sharding 中，Client 端的 retry 也是很重要的，然後可以透過數學知道 Node & Shard 的數量產生的機率，再去設計你的架構，從 Shuffle Sharding 再來看 AWS 怎麼處理自身內部的 deployment，就變得異常合理和安全，AWS 的部署方式是先從某個 region 中的一個 AZ 來部署，如果 monitoring 的結果都沒問題，在慢慢 rollout 到不同 AZ 接著到不同的 region，這樣一但有問題，受到影響的 customer 數量也是極少，透過瞭解 AWS 底層也可以讓我們了解，為什麼 Multi-AZ 的部署那麼重要，因為透過 AWS 底層的這種技術，再加上 application 有做到良好的 retry，其實是可以提昇整體 service 的 reliability 的。

Reference

利用 Helm 在 EKS 上安裝 Prometheus

2019-02-25T08:39:36.000Z

Preface

最近把玩了 EKS 一陣子，基本上 EKS 就是 AWS 提供的 Managed Kubernetes，主要是幫你管理 Kubernetes 的 master node，我們只需要管理 worker node 就好了，所以很多的服務還是可以用原本的 helm chart 裝起來，這篇文章會介紹怎麼在 EKS 上面利用 helm 安裝 Prometheus 相關的套件，還有一些簡單的設定。

這篇文章會包含以下內容

利用 helm 安裝 Prometheus-operator 再透過 Operator 去部署 prometheus & alertmanager
如何設定 helm value 去避免一些 EKS 上面的錯誤問題
Troubleshooting 的一些 tips

利用 helm 安裝 prometheus

因為 coreos/prometheus-operator 的 helm chart 已經被 deprecated 掉了，所以我們這邊會使用 stable/prometheus-operator 去做安裝，而這包 chart 其實有包含蠻多 components 像是 prometheus & alertmanager ，還會幫你裝好 prometheus 需要監控用的 node-exporter 等等東西，所以非常大一包，很建議大家裝好後，可以回過頭來看看到底被安裝了哪些東西。

確認 stable/prometheus-operator 版本

1	$ helm search -l stable/prometheus-operator

可以看到目前最新的 Chart 版本是 4.0.0

NAME                             CHART VERSION   APP VERSION     DESCRIPTION
stable/prometheus-operator      4.0.0           0.29.0          Provides easy monitoring definitions for Kubernetes servi...
stable/prometheus-operator      3.0.0           0.29.0          Provides easy monitoring definitions for Kubernetes servi...
stable/prometheus-operator      2.6.0           0.27.0          Provides easy monitoring definitions for Kubernetes servi...

安裝，這邊我們把安裝的名字取作 prom-op

1	$ helm install --name prom-op --namespace monitoring stable/prometheus-operator

透過以下的指令可以得知安裝了些什麼東西

1	$ kubectl --namespace monitoring get pods

NAME                                                     READY   STATUS    RESTARTS   AGE
alertmanager-prom-op-prometheus-operato-alertmanager-0   2/2     Running   0          1m
prom-op-grafana-5c59ddfb9d-zqfqt                         2/2     Running   0          2m
prom-op-kube-state-metrics-76786cc9b4-8q4bj              1/1     Running   0          2m
prom-op-prometheus-node-exporter-6jclc                   1/1     Running   0          2m
prom-op-prometheus-node-exporter-bxr49                   1/1     Running   0          2m
prom-op-prometheus-node-exporter-mxtht                   1/1     Running   0          2m
prom-op-prometheus-node-exporter-xd54m                   1/1     Running   0          2m
prom-op-prometheus-operato-operator-6cbf5d5cfd-z6fz4     1/1     Running   0          2m
prometheus-prom-op-prometheus-operato-prometheus-0       3/3     Running   1          1m

因為我這台 k8s cluster 有起了 4 個 node，所以會安裝 4 個 node operator，然後還會安裝 prometheus-operator, alertmanager, grafana 和 kube-state-metrics。

Customizing the Chart

透過 port forward 讀取 localhost:9090 可以看到 prometheus 裡面的資訊

1	$ kubectl port-forward svc/prom-op-prometheus-operato-prometheus -n monitoring 9090

其中我們會看到以下這些錯誤

因為我們無法監控到 EKS 的 master node，所以關於 master 上面的 services 像是 etcd, kube-apiserver, controller-manager, kube-schedule 都會在 prometheus 中發生錯誤，這也是為什麼我們需要客製化我們的 chart file。

1	$ cp https://raw.githubusercontent.com/helm/charts/master/stable/prometheus-operator/values.yaml values.yaml

修改完後可以使用以下指令去覆寫

1	helm upgrade --install prom-op stable/prometheus-operator --namespace monitoring -f values.yaml

這邊筆記下我有更改的部分，master 上面的 services 像是 etcd, kube-apiserver, controller-manager, kube-schedule 等等的 monitoring 機制需要被關閉

kubeApiServer
  enabled: false

kubeControllerManager
  enabled: false

kubeEtcd
  enabled: false

kubeScheduler
  enabled: false

kubelet 的話根據這個 issue，在 EKS 上面使用的話，我們需要把 https 的部分 enable 起來

kubelet:
  enabled: true
  namespace: kube-system

  serviceMonitor:
    https: true

EKS 上面的 coreDns 的 label 有點怪，還是用 k8s-app:kube-dns 而不是 coredns

coreDns:
  enabled: true
  service:
    port: 9153
    targetPort: 9153
    selector:
      k8s-app: kube-dns

還有一些 resource 的部分記得要調整下

1
2
3

resources:
  requests:
    memory: 400Mi

設定 addtional scrape config

Prometheus 除了可以用來 monitor Kubernetes 內部的 service 外，其實也有提供一些方法去 scrape 外面的 service，像是有一些程式跑在既有的 EC2 上面，我們可以透過相對應的 EC2 service discovery 的方法去拉取資料，要達成相關的任務，則需要去設定 addtional config。

方法很簡單，需要先在 chart 的 value 中把原本的 additionalScrapeConfigs

1	additionalScrapeConfigs: []

改寫為需要另外掛上去的 config

additionalScrapeConfigs:
  - job_name: placeholder
    metrics_path: /probe
    params:
    module: [http_2xx]
    static_configs:
      - targets:
        - https://sentry.umbocv.com/_health/?full

但是這種做法需要一直更改 helm chart 的 value，而這邊也提供另外一種方法可以直接更改 config，讓 prometheus config reloader 去讀取，使用

1	kubectl get secret -n monitoring

會看到有

1 2	NAME TYPE DATA AGE prom-op-prometheus-scrape-confg Opaque 1 30s

我們可以透過直接更改這個 secret 的內容而改動 addtional-scrape-config，而以下這個 addtional-scrape-configs.yaml 以上面的例子會長成這樣

- job_name: placeholder
  metrics_path: /probe
  params:
  module: [http_2xx]
  static_configs:
    - targets:
      - https://sentry.umbocv.com/_health/?full

接著透過這行指令把這個 addtional-scrape-configs.yaml 轉成 k8s 認得的 secret yaml，在 apply 上去

1
2

$ kubectl create secret generic prom-op-prometheus-scrape-confg --from-file=additional-scrape-configs.yaml --dry-run -oyaml > prometheus-additional-scrape-configs.yaml
$ kubectl apply -f prometheus-additional-scrape-configs.yaml -n monitoring

設定 alert manager template

在使用完 prometheus-operator 的 helm 部署完後，其實可以從 UI 中的 status -> rules 中看到許多內建好的 prometheus 的 rule，而如果想要把這個警告發到 slack 上面還需要設定 alertmanager 的 route config，而內建的 config 其實沒做任何事情，都是導到 null 而已

config:
  global:
    resolve_timeout: 5m
  route:
    group_by: ['job']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    receiver: 'null'
    routes:
    - match:
        alertname: Watchdog
      receiver: 'null'
  receivers:
  - name: 'null'

而這邊我們可以參考 Monza 的 alertmanager slack template ，這個 template 的好處就是可以幫 alert 都合併為一個發出來，然後也有吃內建的 rule 的 format，舉個例子像下面的這個 rule，裡面用到的 labels 是 serverity: critical，然後 annotations 裡面是 message & runbook_url

alert: KubeAPIDown
expr: absent(up{job="apiserver"}
  == 1)
for: 15m
labels:
  severity: critical
annotations:
  message: KubeAPI has disappeared from Prometheus target discovery.
  runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown

而透過 Monza 的 template 我們可以先設定 alertmanager 的 endpoint

receivers:
###################################################
## Slack Receivers
- name: slack-code-owners
  slack_configs:
  - channel: '#{{- template"slack.monzo.code_owner_channel". -}}'
    send_resolved: true
    title: '{{ template"slack.monzo.title". }}'
    icon_emoji: '{{ template"slack.monzo.icon_emoji". }}'
    color: '{{ template"slack.monzo.color". }}'
    text: '{{ template"slack.monzo.text". }}'
    actions:
    - type: button
      text: 'Runbook :green_book:'
      url: '{{ (index .Alerts 0).Annotations.runbook_url }}'
    - type: button
      text: 'Query :mag:'
      url: '{{ (index .Alerts 0).GeneratorURL }}'
    - type: button
      text: 'Dashboard :grafana:'
      url: '{{ (index .Alerts 0).Annotations.dashboard }}'
    - type: button
      text: 'Silence :no_bell:'
      url: '{{ template"__alert_silence_link". }}'
    - type: button
      text: '{{ template"slack.monzo.link_button_text". }}'
      url: '{{ .CommonAnnotations.link_url }}'

在透過定義好的 template 中，我們可以看到已經有確認收到的警告是 .Annotations.message 會被顯示出來，這樣一來就可以把相關的 rule alert 打到 slack 上了。

# This builds the silence URL.  We exclude the alertname in the range
# to avoid the issue of having trailing comma separator (%2C) at the end
# of the generated URL
{{ define "__alert_silence_link" -}}
    {{ .ExternalURL }}/#/silences/new?filter=%7B
    {{- range .CommonLabels.SortedPairs -}}
        {{- if ne .Name "alertname" -}}
            {{- .Name }}%3D"{{- .Value -}}"%2C%20
        {{- end -}}
    {{- end -}}
    alertname%3D"{{ .CommonLabels.alertname }}"%7D
{{- end }}

{{ define "__alert_severity_prefix" -}}
    {{ if ne .Status "firing" -}}
    :lgtm:
    {{- else if eq .Labels.severity "critical" -}}
    :fire:
    {{- else if eq .Labels.severity "warning" -}}
    :warning:
    {{- else -}}
    :question:
    {{- end }}
{{- end }}

{{ define "__alert_severity_prefix_title" -}}
    {{ if ne .Status "firing" -}}
    :lgtm:
    {{- else if eq .CommonLabels.severity "critical" -}}
    :fire:
    {{- else if eq .CommonLabels.severity "warning" -}}
    :warning:
    {{- else if eq .CommonLabels.severity "info" -}}
    :information_source:
    {{- else -}}
    :question:
    {{- end }}
{{- end }}


{{/* First line of Slack alerts */}}
{{ define "slack.monzo.title" -}}
    [{{ .Status | toUpper -}}
    {{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{- end -}}
    ] {{ template "__alert_severity_prefix_title" . }} {{ .CommonLabels.alertname }}
{{- end }}


{{/* Color of Slack attachment (appears as line next to alert )*/}}
{{ define "slack.monzo.color" -}}
    {{ if eq .Status "firing" -}}
        {{ if eq .CommonLabels.severity "warning" -}}
            warning
        {{- else if eq .CommonLabels.severity "critical" -}}
            danger
        {{- else -}}
            #439FE0
        {{- end -}}
    {{ else -}}
    good
    {{- end }}
{{- end }}


{{/* Emoji to display as user icon (custom emoji supported!) */}}
{{ define "slack.monzo.icon_emoji" }}:prometheus:{{ end }}

{{/* The test to display in the alert */}}
{{ define "slack.monzo.text" -}}
    {{ range .Alerts }}
        {{- if .Annotations.message }}
            {{ .Annotations.message }}
        {{- end }}
        {{- if .Annotations.description }}
            {{ .Annotations.description }}
        {{- end }}
    {{- end }}
{{- end }}



{{- /* If none of the below matches, send to #monitoring-no-owner, and we 
can then assign the expected code_owner to the alert or map the code_owner
to the correct channel */ -}}
{{ define "__get_channel_for_code_owner" -}}
    {{- if eq . "platform-team" -}}
        platform-alerts
    {{- else if eq . "security-team" -}}
        security-alerts
    {{- else -}}
        monitoring-no-owner
    {{- end -}}
{{- end }}

{{- /* Select the channel based on the code_owner. We only expect to get
into this template function if the code_owners label is present on an alert.
This is to defend against us accidentally breaking the routing logic. */ -}}
{{ define "slack.monzo.code_owner_channel" -}}
    {{- if .CommonLabels.code_owner }}
        {{ template "__get_channel_for_code_owner" .CommonLabels.code_owner }}
    {{- else -}}
        monitoring
    {{- end }}
{{- end }}

{{ define "slack.monzo.link_button_text" -}}
    {{- if .CommonAnnotations.link_text -}}
        {{- .CommonAnnotations.link_text -}}
    {{- else -}}
        Link
    {{- end }} :link:
{{- end }}

這邊還有一個很重要的步驟，讓我卡了蠻久的，其實 template 也是一樣定義在 prometheus-operator 的 helm chart value.yaml 裡面，在定義完 template 後，一定要加上

1 2	templates: - '/etc/alertmanager/config/*.tmpl'

大概的範例長得像這樣

config
  global:
    resolve_timeout: 5m
  ... 略

templates:
  - '/etc/alertmanager/config/*.tmpl'
   
templateFiles:
    template_monzo.tmpl: |-

       {{ define "__alert_silence_link" -}}
          {{ .ExternalURL }}/#/silences/new?filter=%7B
          {{- range .CommonLabels.SortedPairs -}}
              {{- if ne .Name "alertname" -}}
                  {{- .Name }}%3D"{{- .Value -}}"%2C%20
              {{- end -}}
          {{- end -}}
          alertname%3D"{{ .CommonLabels.alertname }}"%7D
      {{- end }}
      ... 略

Troubleshoot

如果一直沒收到 alert 的話，有可能是 alertmanager 的 template 寫錯，可以透過 kubectl logs -f po/ -n monitoring -c alertmanager 去確認下是不是有產生一些 error log。

想要確認 alertmanager template 的語法的話，可以使用下面這個 script 去測試，主要是從這個 gist 看來的，這樣就可以邊改 template 邊驗證，不用真的去產生一些錯誤條件出來。

#!/bin/bash

name=$RANDOM
url='http://localhost:9093/api/v1/alerts'

echo "firing up alert $name" 

# change url o
curl -XPOST $url -d "[{ 
\"status\": \"firing\",
\"labels\": {
\"alertname\": \"$name\",
\"service\": \"my-service\",
\"severity\":\"warning\",
\"instance\": \"$name.example.net\"
},
\"annotations\": {
\"summary\": \"High latency is high!\"
},
\"generatorURL\": \"http://prometheus.int.example.net/\"
}]"

echo ""

echo"press enter to resolve alert"
read

echo"sending resolve"
curl -XPOST $url -d"[{ 
\"status\": \"resolved\",
\"labels\": {
\"alertname\": \"$name\",
\"service\": \"my-service\",
\"severity\":\"warning\",
\"instance\": \"$name.example.net\"
},
\"annotations\": {
\"summary\": \"High latency is high!\"
},
\"generatorURL\": \"http://prometheus.int.example.net/\"
}]"

或是用

#!/bin/bash

alerts='[
  {
    "labels": {
       "alertname": "instance_down",
       "instance": "example1"
     },
     "annotations": {
        "info": "The instance example1 is down",
        "summary": "instance example1 is down"
      }
  }
]'

URL="https://alertmanager.mydomain.com"

curl -XPOST -d"$alerts" $URL/api/v1/alerts

可以使用看看是否自己的 secret 內容是正確的

1	kubectl get secret -n monitoring alertmanager-prom-op-alertmanager -o go-template='{{ index .data"alertmanager.yaml"}}' \| base64

完整移除 prometheus-operator

$ helm delete --purge 
$ kubectl delete crd prometheuses.monitoring.coreos.com
$ kubectl delete crd prometheusrules.monitoring.coreos.com
$ kubectl delete crd servicemonitors.monitoring.coreos.com
$ kubectl delete crd alertmanagers.monitoring.coreos.com

後記

原本使用 prometheus-operator 其實還有個雷就是 servicemonitor 需要打上 release: ，這樣 operator 才真的會去吃這個 service monitor，但是隨著 4.0.0 的更新也把這個惱人的東西修掉了，所以建議大家常常去看下到底更新了什麼，其實 prometheus & alertmanager 的版本也是一直推進很快的，而接下來有想到什麼更多的內容，還會繼續更新這篇。

Reference

Deploy Prometheus Operator With Thanos

2019-02-10T13:53:12.000Z

Preface

Prometheus is widely adopted as a standard monitoring tool with Kubernetes because it provides many useful features such as dynamic service discovery, powerful queries, and seamless alert notification integration. There are many applications and client libraries support Prometheus which makes the operation’s life easier. Although things are going pretty well with prometheus, the original prometheus deployment is not able to easily achieve High Availablity and long term storage.

Thanos comes to the rescue

Thanos is developed by improbable which can be integrated with prometheus transparently and solve HA and long term storage issues without hurting performance. The idea of Thanos is to run sidecar component of prometheus, therefore meaning that sidecar components can interact with prometheus to upload or query metrics. Also, prometheus operator supports thanos natively which make us easier to deploy our promtheus cluster along with thanos. This solution seems pretty elegant when you choose prometheus operator to provision prometheus cluster.

This article includes the following contents

How to deploy the prometheus operator on the kubernetes
How to deploy the thanos sidecar w/ prometheus.
Achieve HA: using thanos querier
Query historical data: thanos store
Reduce data size: thanos compactor

Install Prometheus through Prometheus operator

There are tons of article introducing why we need to adopt prometheus-operator to provision prometheus. I recommend you read the following references[2] if you are not familiar with prometheus-operator.

1. Install Helm in your environment

MacOS: brew install kubernetes-helm
Linux: sudo snap install helm

2. Initialize helm and install tiller

1	$ helm init

3. Install coreos prometheus operator

Note that we are using stable/prometheus-operator because coreos/prometheus-operator helm is going to be deprecated. We later need to modify chart value to provision prometheus cluster along with thanos sidecar. To install a stable helm chart with custom value, you need to download values.yaml from github repo.

In this example, we named our prometheus operator as prom-op and install it under monitoring namespace.

1	$ helm upgrade --install prom-op stable/prometheus-operator --namespace monitoring -f values.yaml

Use the following command to verify if prometheus-operator is provisioning successfully.

1	kubectl --namespace monitoring get pods -l "release=prom-op"

Thanos Deployment

NEED TO KNOW
prometheus-operator should be greater than 0.28.0 to support Thanos 2.0

Thanos Architecture

Official Architecture of Thanos

Our deployment steps

According to the above picture, there are several components of thanos:

Sidecar
Querier
Store
Compactor

The deployment steps:

Prometheus should be deployed with thanos Sidecar.
Deploy Thanos Querier which is able to talks to prometheus Sidecar through gossip protocol.
Make sure Thanos Sidecar is able to upload prometheus metrics to the given S3 bucket.
Establish the Thanos Store for retrieving long term storage.
Set up the Compactor to shrink historical data.

Install Thanos sidecar

To install Thanos sidecar along with prometheus-operator, we should specify thanos sidecar in the chart value as following:

thanos:
    baseImage: improbable/thanos
    version: v0.2.1
    peers: thanos-peers.monitoring.svc:10900
    objectStorageConfig:
      key: thanos.yaml
      name: thanos-objstore-config

objectStorageConfig can be configured through configuration file thanos.yaml

type: s3
config:
  bucket: test-prometheus-thanos
  endpoint: s3.us-west-2.amazonaws.com
  encryptsse: true

Creating the kubernetes secret by applying following command

1	kubectl -n monitoring create secret generic thanos-objstore-config --from-file=thanos.yaml=/tmp/thanos-config.yaml

Warn: endpoint needs to be set in order to specify bucket located in which region.

Verify Thanos Sidecar

1	$ kubectl get po -n monitoring

1	kubectl describe po/prometheus-prom-op-prometheus-0 -n monitoring

If everything goes well, we could find out there is thanos-sidecar in the prometheus pod

thanos-sidecar:
  Container ID:  docker://e52df9fda7b0c43eea297d273169cf33e4aa49780fd8d5192c23f497c78b2007
  Image:         improbable/thanos:v0.2.1
  Image ID:      docker-pullable://improbable/thanos@sha256:4ee0774316a5d57f78d243fe4afb10e9e889670d3facfdda70aae76f7165a16b
  Ports:         10902/TCP, 10901/TCP, 10900/TCP
  Host Ports:    0/TCP, 0/TCP, 0/TCP
  Args:
    sidecar
    --prometheus.url=http://127.0.0.1:9090
    --tsdb.path=/prometheus
    --cluster.address=[$(POD_IP)]:10900
    --grpc-address=[$(POD_IP)]:10901
    --cluster.peers=thanos-peers.monitoring.svc.cluster.local:10900
  State:          Running
    Started:      Fri, 01 Feb 2019 12:24:38 +0800
  Ready:          True
  Restart Count:  0
  Environment:
    POD_IP:   (v1:status.podIP)
  Mounts:
    /prometheus from prometheus-prom-op-prometheus-db (rw)
    /var/run/secrets/kubernetes.io/serviceaccount from prom-op-prometheus-token-7gvcp (ro)

and if you check the log of sidecar, you will see following messages.

1	kubectl log -f po/prometheus-prom-op-prometheus-0 -n monitoring -c thanos-sidecar

level=info ts=2019-02-01T09:33:15.173007261Z caller=flags.go:90 msg="StoreAPI address that will be propagated through gossip" address=10.11.29.191:10901
level=info ts=2019-02-01T09:33:20.178094001Z caller=main.go:256 component=sidecar msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-02-01T09:33:20.178211091Z caller=factory.go:39 msg="loading bucket configuration"
level=info ts=2019-02-01T09:33:20.17855779Z caller=sidecar.go:280 msg="starting sidecar" peer=
level=info ts=2019-02-01T09:33:20.179145313Z caller=sidecar.go:220 component=sidecar msg="Listening for StoreAPI gRPC" address=[10.11.29.191]:10901
level=info ts=2019-02-01T09:33:20.179187469Z caller=main.go:308 msg="Listening for metrics" address=0.0.0.0:10902
level=info ts=2019-02-01T12:33:50.282222532Z caller=shipper.go:201 msg="upload new block" id=01D2MGSADK1860F4APSD7CFZ7C

Install Thanos Querier

Thanos Querier Layer provides the ability to retrieve metrics from all prometheus instances at once. It’s fully compatible with original prometheus PromQL and HTTP APIs so that it can be used along with Grafana.

Since there are too many yaml files, I put everything in my github repo

$ cd thanos
$ kubectl apply -f querier-deployment.yaml
$ kubectl apply -f querier-service.yaml
$ kubectl apply -f querier-service-monitor.yaml
$ kubectl apply -f thanos-peers-svc.yaml

Install Thanos Store

Thanos Store collaborates with querier for retrieving historical data from the given bucket. It will join the Thanos cluster on setup.

1	$ kubectl apply -f thanos-store.yaml

Install Thanos Compactor

Thanos Compactor will do downsampling for your all historical data. It’s a really useful component which can reduce file size. Recommend everyone read this well explained article.

1
2
3

$ kubectl apply -f thanos-compactor.yaml
$ kubectl apply -f thanos-compactor-service.yaml
$ kubectl apply -f thanos-compactor-service-monitor.yaml

Troubleshooting

Peering service didn’t set up properly

you will see this kind of message of thanos component

level=error ts=2019-02-01T05:11:40.805153721Z caller=cluster.go:269 component=cluster msg="Refreshing memberlist" err="join peers thanos-peers.monitoring.svc.cluster.local:10900 : 1 error occurred:\n\t* Failed to resolve thanos-peers.monitoring.svc.cluster.local:10900: lookup thanos-peers.monitoring.svc.cluster.local on 172.20.0.10:53: no such host\n\n"

1	$ kubectl apply -f thanos-peers-svc.yaml