innodb_stats_persistent_sample_pages と innodb_stats_transient_sample_pages の違い - 日報 #148

思えば「Windows 95」という地の果てのようなOSから、
色々なバージョンのOSを使ってきましたが、
最近では
「バージョンアップってなんかワクワクするよね」 という感覚がやや薄れてしまったように思えます。
あくまで個人的な話ですが。
今日はMySQL5.5から5.6へ移行した際に、
注意が必要なパラメータについて少し。

MySQLのインデックス運用で少し触れた、
innodb_stats_sample_pages
についてです。
MySQL5.5 innodbエンジンでは、インデックス統計計算にこの値の数だけサンプリングして推計します。
つこって、インデックス統計値は毎回「ちょこちょこっと」変わります。
ですが、これはクエリオプティマイザからすればちょっと困るわけです。
MySQL 5.6ではこのパラメータは廃止されて、下記の2つにわかれています。

innodb_stats_persistent_sample_pages
innodb_stats_transient_sample_pages

persistentとtransient、
直訳すると「永続的」と「一時的」ということなので、
一目瞭然です。
要はきっちり選択できるようになったんですね。
インデックスの統計値計算は、transientは都度計算、
persistentは「あるタイミング」で統計値計算をしたものを再利用する、
という2つに分かれているようです。
persistentかtransientのどちらにするかはinnodb_stats_persistentというパラメータでスイッチできるようです。
で、ちなみにMySQLではpersistent推奨のようです。
下記参照です。
http://dev.mysql.com/doc/refman/5.6/en/innodb-persistent-stats.html
英語で分かりづらいので、重要な部分だけ抜粋します。

まず、統計値の更新のタイミングは？

The configuration option innodb_stats_auto_recalc determines whether the statistics are calculated
automatically whenever a table undergoes substantial changes (to more than 10% of the rows).

対象テーブル行数の10%以上の変化があった時に、バックグラウンドで計算されるようです。

If innodb_stats_auto_recalc is disabled, ensure the accuracy of optimizer statistics
by issuing the ANALYZE TABLE statement for each applicable table after making substantial changes to indexed columns.
You might run this statement in your setup scripts after representative data has been loaded into the table, and run it periodically after DML operations significantly change the contents of indexed columns, or on a schedule at times of low activity. When a new index is added to an existing table, index statistics are calculated and added to the innodb_index_stats table regardless of the value of innodb_stats_auto_recalc.

でも innodb_stats_auto_recalc がonになってないと、手作業でやんないとだめみたいですよ。
MySQL 5.5のときはデフォルトサンプリングページ数が8でしたが、
innodeb_stats_persistent_sample_pagesは20と高めの数値になっています。
妥当なサンプリング数は母比率の区間推定の時の妥当サンプリング数を用いるといいでしょう。
母比率xで、精度σ、信頼率1−αなら、

n ≧ x(1-x) ∑|α/2 /σ²
(ただし ∑|α/2 は標準正規分布の確率表から抜粋)

信頼率5%で、母比率0.5で±10%程度の精度なら、大体

0.5 x 0.5 x 1.96 / 0.01 = 49

ということなので、50個くらいを指定すればいいのじゃないでしょうか。
偏りが激しい場合はまぁ色々考え方があると思いますが。

つぅ感じで、
5.5 to 5.6でも色々と変化があって楽しいですね！

俺の報告

RoomClipを運営するエンジニアの日報（多分）です。

innodb_stats_persistent_sample_pages と innodb_stats_transient_sample_pages の違い - 日報 #148