28.5. WALの設定

PostgreSQL 18.0文書
		第28章信頼性と先行書き込みログ（WAL）	誤訳等の報告
前へ	上へ	28.5. WALの設定	次へ

28.5. WALの設定 #

<title><acronym>WAL</acronym> Configuration</title>

There are several <acronym>WAL</acronym>-related configuration parameters that affect database performance. This section explains their use. Consult <xref linkend="runtime-config"/> for general information about setting server configuration parameters. データベースの性能に影響するようなWALに関連した設定パラメータが複数あります。本節では、その使い方を説明します。サーバ設定パラメータの設定方法についての詳細は第19章を参照してください。

<firstterm>Checkpoints</firstterm><indexterm><primary>checkpoint</primary></indexterm> are points in the sequence of transactions at which it is guaranteed that the heap and index data files have been updated with all information written before that checkpoint. At checkpoint time, all dirty data pages are flushed to disk and a special checkpoint record is written to the WAL file. (The change records were previously flushed to the <acronym>WAL</acronym> files.) In the event of a crash, the crash recovery procedure looks at the latest checkpoint record to determine the point in the WAL (known as the redo record) from which it should start the REDO operation. Any changes made to data files before that point are guaranteed to be already on disk. Hence, after a checkpoint, WAL segments preceding the one containing the redo record are no longer needed and can be recycled or removed. (When <acronym>WAL</acronym> archiving is being done, the WAL segments must be archived before being recycled or removed.) チェックポイントは、一連のトランザクションにおいて、そのチェックポイント以前に書かれた全ての情報によりヒープとインデックスファイルがすでに更新されていることを保証する時点です。チェックポイント時に、全てのダーティページデータはディスクにフラッシュされ、特殊なチェックポイントレコードがWALファイルに書き込まれます。 (変更されたレコードは以前にWALファイルにフラッシュされています。) クラッシュした時、クラッシュリカバリ処理は最新のチェックポイントレコードを見つけ、WALの中でどのレコード(これはredoレコードと呼ばれています)からREDOログ操作を開始すべきかを決定します。このチェックポイント以前になされたデータファイルの変更は、すでにディスク上にあることが保証されています。従って、チェックポイント後、redoレコード内のそのチェックポイント以前のWALセグメントは不要となり、再利用または削除することができます。 (WALアーカイブが行われる場合、このWALセグメントは削除もしくは再利用される前に保存されなければなりません。)

The checkpoint requirement of flushing all dirty data pages to disk can cause a significant I/O load. For this reason, checkpoint activity is throttled so that I/O begins at checkpoint start and completes before the next checkpoint is due to start; this minimizes performance degradation during checkpoints. チェックポイント処理は、全てのダーティデータページをディスクへフラッシュするため、大きなI/O負荷を発生させます。チェックポイント処理においては、I/Oはチェックポイント開始時に始まり、次のチェックポイントが開始する前に完了するように調節されます。これは、チェックポイント処理中の性能劣化を極力抑える効果があります。

The server's checkpointer process automatically performs a checkpoint every so often. A checkpoint is begun every <xref linkend="guc-checkpoint-timeout"/> seconds, or if <xref linkend="guc-max-wal-size"/> is about to be exceeded, whichever comes first. The default settings are 5 minutes and 1 GB, respectively. If no WAL has been written since the previous checkpoint, new checkpoints will be skipped even if <varname>checkpoint_timeout</varname> has passed. (If WAL archiving is being used and you want to put a lower limit on how often files are archived in order to bound potential data loss, you should adjust the <xref linkend="guc-archive-timeout"/> parameter rather than the checkpoint parameters.) It is also possible to force a checkpoint by using the SQL command <command>CHECKPOINT</command>. サーバのチェックポインタプロセスは、自動的にチェックポイントを時々実行します。 checkpoint_timeout秒が経過するか、またはmax_wal_sizeに達するか、どちらかの条件が最初に満たされるとチェックポイントが開始されます。デフォルトの設定では、それぞれ5分と1GBとなっています。前回のチェックポイント以降書き出すWALがない場合、checkpoint_timeoutが経過したとしても新しいチェックポイントが飛ばされます。 (WALアーカイブ処理を使用しており、かつ、データ損失の可能性を限定するためにファイルのアーカイブ頻度の下限を設定したい場合、チェックポイント関連のパラメータよりも、archive_timeoutパラメータを調節するべきです。) また、CHECKPOINT SQLコマンドで強制的にチェックポイントを作成することもできます。

Reducing <varname>checkpoint_timeout</varname> and/or <varname>max_wal_size</varname> causes checkpoints to occur more often. This allows faster after-crash recovery, since less work will need to be redone. However, one must balance this against the increased cost of flushing dirty data pages more often. If <xref linkend="guc-full-page-writes"/> is set (as is the default), there is another factor to consider. To ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content. In that case, a smaller checkpoint interval increases the volume of output to the WAL, partially negating the goal of using a smaller interval, and in any case causing more disk I/O. checkpoint_timeoutまたはmax_wal_size、あるいはその両者を減少させると、チェックポイントはより頻繁に行われます。これにより、やり直しに要する処理量が少なくなるので、クラッシュ後の修復は高速になります。しかし、変更されたデータページのフラッシュがより頻繁に行われることにより増大するコストとバランスを考えなければなりません。 full_page_writesが設定されている（デフォルトです）場合、他に考慮しなければならない点があります。データページの一貫性を保証するために、各チェックポイント後の最初に変更されるデータページは、そのページ全体の内容がログに保存されることになります。このような場合、チェックポイントの間隔を少なくすることは、WALへの出力を増加させ、間隔を短くする目的の一部を無意味にします。また、確実により多くのディスクI/Oが発生します。

Checkpoints are fairly expensive, first because they require writing out all currently dirty buffers, and second because they result in extra subsequent WAL traffic as discussed above. It is therefore wise to set the checkpointing parameters high enough so that checkpoints don't happen too often. As a simple sanity check on your checkpointing parameters, you can set the <xref linkend="guc-checkpoint-warning"/> parameter. If checkpoints happen closer together than <varname>checkpoint_warning</varname> seconds, a message will be output to the server log recommending increasing <varname>max_wal_size</varname>. Occasional appearance of such a message is not cause for alarm, but if it appears often then the checkpoint control parameters should be increased. Bulk operations such as large <command>COPY</command> transfers might cause a number of such warnings to appear if you have not set <varname>max_wal_size</varname> high enough. チェックポイントはかなり高価なものです。 1番の理由は、この処理は現時点の全てのダーティバッファを書き出す必要があること、2番目の理由は、上記のようにその後に余計なWALの書き込みが発生することです。そのため、チェックポイント用のパラメータを高くし、チェックポイントがあまりにも頻発することがないようにすることを勧めます。簡単なチェックポイント用のパラメータの健全性検査として、checkpoint_warningパラメータを設定することができます。チェックポイントの発生間隔がcheckpoint_warning秒未満の場合、max_wal_sizeの増加を勧めるメッセージがサーバのログに出力されます。このメッセージが稀に現れたとしても問題にはなりませんが、頻出するようであれば、チェックポイントの制御パラメータを増加させるべきです。 max_wal_sizeを十分高く設定していないと、大規模なCOPY転送などのまとまった操作でこうした警告が多く発生するかもしれません。

To avoid flooding the I/O system with a burst of page writes, writing dirty buffers during a checkpoint is spread over a period of time. That period is controlled by <xref linkend="guc-checkpoint-completion-target"/>, which is given as a fraction of the checkpoint interval (configured by using <varname>checkpoint_timeout</varname>). The I/O rate is adjusted so that the checkpoint finishes when the given fraction of <varname>checkpoint_timeout</varname> seconds have elapsed, or before <varname>max_wal_size</varname> is exceeded, whichever is sooner. With the default value of 0.9, <productname>PostgreSQL</productname> can be expected to complete each checkpoint a bit before the next scheduled checkpoint (at around 90% of the last checkpoint's duration). This spreads out the I/O as much as possible so that the checkpoint I/O load is consistent throughout the checkpoint interval. The disadvantage of this is that prolonging checkpoints affects recovery time, because more WAL segments will need to be kept around for possible use in recovery. A user concerned about the amount of time required to recover might wish to reduce <varname>checkpoint_timeout</varname> so that checkpoints occur more frequently but still spread the I/O across the checkpoint interval. Alternatively, <varname>checkpoint_completion_target</varname> could be reduced, but this would result in times of more intense I/O (during the checkpoint) and times of less I/O (after the checkpoint completed but before the next scheduled checkpoint) and therefore is not recommended. Although <varname>checkpoint_completion_target</varname> could be set as high as 1.0, it is typically recommended to set it to no higher than 0.9 (the default) since checkpoints include some other activities besides writing dirty buffers. A setting of 1.0 is quite likely to result in checkpoints not being completed on time, which would result in performance loss due to unexpected variation in the number of WAL segments needed. ページ書き出しの集中によるI/Oシステムの溢れを防ぐために、チェックポイント期間のダーティバッファの書き出しは一定の期間に分散されます。この期間はcheckpoint_completion_targetにより制御され、checkpoint_timeoutによって設定されるチェックポイント間隔の割合として指定されます。 I/Oの割合は、チェックポイントの起動時からcheckpoint_timeout秒が経過した時、あるいはmax_wal_sizeを超えた時、このどちらかが発生するとすぐに、チェックポイントが完了するように調整されます。デフォルトの0.9という値では、PostgreSQLは次のチェックポイントが始まる少し前に、前回のチェックポイント期間の約90%程度の時間で各チェックポイントが完了するものと想定できることになります。これにより、チェックポイントのI/O負荷がチェックポイント期間を通して一定になるように、I/Oが可能な限り分散されます。この欠点は、延長されたチェックポイントがリカバリ時間に影響をあたえることです。リカバリ時に使用できるように、より多くのWALセグメントを保持する必要があるためです。リカバリに必要な時間を気にするユーザは、checkpoint_timeoutを減らして、チェックポイントをより頻繁に発生しながらも、チェックポイント間隔全体にI/Oを分散させることを望むかもしれません。または、checkpoint_completion_targetを減らすこともできますが、この場合、チェックポイント中のI/Oが多い時間帯と、チェックポイント完了後から次に予定されているチェックポイントの前までのI/Oの少ない時間帯が発生しますので、推奨されません。 checkpoint_completion_targetを最大の1.0に設定することもできますが、チェックポイントにはダーティバッファを書き出す以外の活動も含まれているため、通常はデフォルトの0.9以下に設定することをお勧めします。 1.0という設定は、ある時点でチェックポイントが完了しなくなるという結果に陥ります。これは必要なWALセグメント数が想定以上に変動することになり、性能の劣化が発生することになります。

On Linux and POSIX platforms <xref linkend="guc-checkpoint-flush-after"/> allows you to force OS pages written by the checkpoint to be flushed to disk after a configurable number of bytes. Otherwise, these pages may be kept in the OS's page cache, inducing a stall when <literal>fsync</literal> is issued at the end of a checkpoint. This setting will often help to reduce transaction latency, but it also can have an adverse effect on performance; particularly for workloads that are bigger than <xref linkend="guc-shared-buffers"/>, but smaller than the OS's page cache. LinuxおよびPOSIXプラットフォームでは、チェックポイントによって書かれたページを、設定したバイト数の後にディスクにフラッシュさせるようにcheckpoint_flush_afterを使ってOSに強制させることができます。この設定がない場合はこのページはOSのページキャッシュに保持されるかもしれず、チェックポイントの最後にfsyncが発行された際の速度低下を招きます。この設定は、しばしばトランザクションの遅延を減少させるのに役立ちます。しかし、とりわけワークロードがshared_buffersよりも大きく、かつOSのページキャッシュよりも小さい場合には性能上不利になることもあります。

The number of WAL segment files in <filename>pg_wal</filename> directory depends on <varname>min_wal_size</varname>, <varname>max_wal_size</varname> and the amount of WAL generated in previous checkpoint cycles. When old WAL segment files are no longer needed, they are removed or recycled (that is, renamed to become future segments in the numbered sequence). If, due to a short-term peak of WAL output rate, <varname>max_wal_size</varname> is exceeded, the unneeded segment files will be removed until the system gets back under this limit. Below that limit, the system recycles enough WAL files to cover the estimated need until the next checkpoint, and removes the rest. The estimate is based on a moving average of the number of WAL files used in previous checkpoint cycles. The moving average is increased immediately if the actual usage exceeds the estimate, so it accommodates peak usage rather than average usage to some extent. <varname>min_wal_size</varname> puts a minimum on the amount of WAL files recycled for future usage; that much WAL is always recycled for future use, even if the system is idle and the WAL usage estimate suggests that little WAL is needed. pg_walディレクトリ内のWALセグメントファイルの数は、min_wal_size、max_wal_size、それに前回のチェックポイントで生成されたWALの量に依存します。古いWALセグメントファイルが不要になると、削除または再利用(連番のうち、今後利用される予定の番号に名前が変更されます)されます。 WALの出力レートが短期間にピークを迎えたためにmax_wal_sizeを超えた場合、この制限以下になるまで不要なセグメントファイルが削除されます。この制限以下になると、次のチェックポイントまでは、システムは見積もりを満たすだけのWALファイルを再利用します。この見積は、前回のチェックポイントの際に使用されたWALファイルの移動平均に基づいています。もし実際の使用量が見積もりを上回ると、移動平均は直ちに増加します。これにより、平均需要というよりは、ピーク時の需要をある程度満たすことができるわけです。 min_wal_sizeは、今後のために再利用されるWALファイル数の最小値を設定します。システムがアイドル状態にあり、WALの使用量を見積った結果、少ないWALしか必要ないとなったとしても、こうした量のWALファイルは必ず再利用されます。

Independently of <varname>max_wal_size</varname>, the most recent <xref linkend="guc-wal-keep-size"/> megabytes of WAL files plus one additional WAL file are kept at all times. Also, if WAL archiving is used, old segments cannot be removed or recycled until they are archived. If WAL archiving cannot keep up with the pace that WAL is generated, or if <varname>archive_command</varname> or <varname>archive_library</varname> fails repeatedly, old WAL files will accumulate in <filename>pg_wal</filename> until the situation is resolved. A slow or failed standby server that uses a replication slot will have the same effect (see <xref linkend="streaming-replication-slots"/>). Similarly, if <link linkend="runtime-config-wal-summarization"> WAL summarization</link> is enabled, old segments are kept until they are summarized. max_wal_sizeに関わらず、最新のwal_keep_sizeメガバイトのWALファイルに加えて、もう一つのWALファイルが常に保持されます。また、WALアーカイブを利用している場合は、古いセグメントは、アーカイブされるまでは削除も再利用もされません。 WALが生成されるペースにWALのアーカイブ処理が追いつかなかったり、archive_commandやarchive_libraryが連続して失敗すると、事態が解決するまでWALファイルはpg_walの下に蓄積されていきます。レプリケーションスロットを使用しているスタンバイサーバが低速だったり、失敗すると、同じ現象が起きます（26.2.6を参照のこと）。同様に、WAL要約が有効な場合、古いセグメントは要約されるまで保持されます。

In archive recovery or standby mode, the server periodically performs <firstterm>restartpoints</firstterm>,<indexterm><primary>restartpoint</primary></indexterm> which are similar to checkpoints in normal operation: the server forces all its state to disk, updates the <filename>pg_control</filename> file to indicate that the already-processed WAL data need not be scanned again, and then recycles any old WAL segment files in the <filename>pg_wal</filename> directory. Restartpoints can't be performed more frequently than checkpoints on the primary because restartpoints can only be performed at checkpoint records. A restartpoint can be demanded by a schedule or by an external request. The <structfield>restartpoints_timed</structfield> counter in the <link linkend="monitoring-pg-stat-checkpointer-view"><structname>pg_stat_checkpointer</structname></link> view counts the first ones while the <structfield>restartpoints_req</structfield> the second. A restartpoint is triggered by schedule when a checkpoint record is reached if at least <xref linkend="guc-checkpoint-timeout"/> seconds have passed since the last performed restartpoint or when the previous attempt to perform the restartpoint has failed. In the last case, the next restartpoint will be scheduled in 15 seconds. A restartpoint is triggered by request due to similar reasons like checkpoint but mostly if WAL size is about to exceed <xref linkend="guc-max-wal-size"/> However, because of limitations on when a restartpoint can be performed, <varname>max_wal_size</varname> is often exceeded during recovery, by up to one checkpoint cycle's worth of WAL. (<varname>max_wal_size</varname> is never a hard limit anyway, so you should always leave plenty of headroom to avoid running out of disk space.) The <structfield>restartpoints_done</structfield> counter in the <link linkend="monitoring-pg-stat-checkpointer-view"><structname>pg_stat_checkpointer</structname></link> view counts the restartpoints that have really been performed. アーカイブリカバリもしくはスタンバイモードにおいて、サーバでは定期的に通常運用でのチェックポイント処理と似たリスタートポイント処理を行います。これは、すでに再生されたWALを再度読み込む必要がないよう、ディスクに現在の状態を強制的に書き込み、pg_controlファイルを更新します。またpg_walディレクトリの中の古いWALセグメントを再利用できるようにします。リスタートポイント処理はチェックポイントレコードに基づいてのみ実行されるため、プライマリ側のチェックポイント処理よりも頻繁に実行されることはありません。リスタートポイントはスケジュールまたは外部リクエストによって要求されます。 pg_stat_checkpointerビューの中のrestartpoints_timedカウンタは最初のリスタートポイントを数え、restartpoints_reqは2番目のリスタートポイントを数えます。リスタートポイントは、チェックポイントレコードに到達し、前回のリスタートポイント処理からcheckpoint_timeout秒以上経過している場合、または前回のリスタートポイント処理が失敗した場合にスケジュールされます。この場合、次のリスタートポイントは15秒後にスケジュールされます。リスタートポイントは、チェックポイントと同様の理由で要求されることがありますが、ほとんどはWALサイズがmax_wal_sizeを超えそうな場合です。しかし、リスタートポイント処理が実行できるタイミングに制限があるため、リカバリ中には、1回のチェックポイント分のWAL相当がmax_wal_sizeを超えることが頻繁にあります。（どのみちmax_wal_sizeはハードリミットではないので、ディスクスペースを使い尽くしてしまわないように、常に十分な余裕を持っておくべきです。） pg_stat_checkpointerビューの中のrestartpoints_doneカウンタは、実際に実行されたリスタートポイントを数えます。

In some cases, when the WAL size on the primary increases quickly, for instance during massive <command>INSERT</command>, the <structfield>restartpoints_req</structfield> counter on the standby may demonstrate a peak growth. This occurs because requests to create a new restartpoint due to increased WAL consumption cannot be performed because the safe checkpoint record since the last restartpoint has not yet been replayed on the standby. This behavior is normal and does not lead to an increase in system resource consumption. Only the <structfield>restartpoints_done</structfield> counter among the restartpoint-related ones indicates that noticeable system resources have been spent. 場合によっては、プライマリのWALサイズが急速に増加すると、例えば大規模なINSERT中のインスタンスでは、スタンバイのrestartpoints_reqカウンタがピークの増加を示すことがあります。これは、最後のリスタートポイント以降の安全なチェックポイントレコードがスタンバイでまだ再生されていないため、WAL消費の増加による新規リスタートポイントの作成要求を実行できないために発生します。この動作は正常であり、システムリソースの消費の増加にはつながりません。リスタートポイントに関連するカウンタの中で、顕著なシステムリソースの消費を示すのはrestartpoints_doneカウンタのみです。

There are two commonly used internal <acronym>WAL</acronym> functions: <function>XLogInsertRecord</function> and <function>XLogFlush</function>. <function>XLogInsertRecord</function> is used to place a new record into the <acronym>WAL</acronym> buffers in shared memory. If there is no space for the new record, <function>XLogInsertRecord</function> will have to write (move to kernel cache) a few filled <acronym>WAL</acronym> buffers. This is undesirable because <function>XLogInsertRecord</function> is used on every database low level modification (for example, row insertion) at a time when an exclusive lock is held on affected data pages, so the operation needs to be as fast as possible. What is worse, writing <acronym>WAL</acronym> buffers might also force the creation of a new WAL segment, which takes even more time. Normally, <acronym>WAL</acronym> buffers should be written and flushed by an <function>XLogFlush</function> request, which is made, for the most part, at transaction commit time to ensure that transaction records are flushed to permanent storage. On systems with high WAL output, <function>XLogFlush</function> requests might not occur often enough to prevent <function>XLogInsertRecord</function> from having to do writes. On such systems one should increase the number of <acronym>WAL</acronym> buffers by modifying the <xref linkend="guc-wal-buffers"/> parameter. When <xref linkend="guc-full-page-writes"/> is set and the system is very busy, setting <varname>wal_buffers</varname> higher will help smooth response times during the period immediately following each checkpoint. よく使われる2つの内部用WAL関数があります。 XLogInsertRecordとXLogFlushです。 XLogInsertRecordは共有メモリ上のWALバッファに新しいレコードを挿入します。新しいレコードを挿入する余地がない時は、XLogInsertRecordは、満杯になったWALバッファを書き込み（カーネルキャッシュに移動）しなければいけません。これは望ましいことではありません。なぜなら、データベースへの低レベルの変更（例えば行の挿入）の度にXLogInsertRecordが呼ばれますが、そのような場合には変更を受けたページに対して排他ロックがかかっており、それゆえこの操作は可能な限り高速に実行されなければなりません。さらに悪いことには、WALバッファへの書き込みの際に、さらに時間がかかる、強制的な新しいWALセグメントの生成が必要となるかもしれません。通常、WALの書き込み、フラッシュはXLogFlush要求で実施されます。これはたいていの場合、トランザクションコミットの際に永続的な記憶領域にトランザクションレコードがフラッシュされることを保証するために行われます。 WAL出力が大量に行われるシステムでは、XLogInsertRecordによって必要となる書き込みを防ぐほどにはXLogFlush要求が頻繁に起こらないかもしれません。そういうシステムでは、wal_buffersパラメータを変更してWALバッファの数を増やしてください。 full_page_writesが設定され、かつ、システムが高負荷状態である場合、wal_buffersを高くすることで、各チェックポイントの直後の応答時間を滑らかにすることができます。

The <xref linkend="guc-commit-delay"/> parameter defines for how many microseconds a group commit leader process will sleep after acquiring a lock within <function>XLogFlush</function>, while group commit followers queue up behind the leader. This delay allows other server processes to add their commit records to the WAL buffers so that all of them will be flushed by the leader's eventual sync operation. No sleep will occur if <xref linkend="guc-fsync"/> is not enabled, or if fewer than <xref linkend="guc-commit-siblings"/> other sessions are currently in active transactions; this avoids sleeping when it's unlikely that any other session will commit soon. Note that on some platforms, the resolution of a sleep request is ten milliseconds, so that any nonzero <varname>commit_delay</varname> setting between 1 and 10000 microseconds would have the same effect. Note also that on some platforms, sleep operations may take slightly longer than requested by the parameter. commit_delayパラメータは、XLogFlush内でロックを取得してからグループコミット上位者が何マイクロ秒休止するかを定義します。一方、グループコミット追従者は上位者の後に並びます。すべてが上位者の結果として生ずる同期操作によりフラッシュされるように、この遅延は他のサーバプロセスがそれらのコミットレコードをWALバッファに追加することを許容します。 fsyncが有効でないか、またはcommit_siblingsより少ない他のセッションがその時点で活動しているトランザクションであれば休止は行われません。他の何らかのセッションが直ぐにでもコミットするという起こりそうにない時の休止を避けるものです。いくつかのプラットフォームにおいて、休止要求の分解能は10ミリ秒で、１から10000マイクロ秒の間のcommit_delayの設定は、どの値でも同じ効果となることを覚えておいてください。いくつかのプラットフォームで、休止操作はパラメータによって要求された時間よりわずかに長くなることも覚えておいてください。

Since the purpose of <varname>commit_delay</varname> is to allow the cost of each flush operation to be amortized across concurrently committing transactions (potentially at the expense of transaction latency), it is necessary to quantify that cost before the setting can be chosen intelligently. The higher that cost is, the more effective <varname>commit_delay</varname> is expected to be in increasing transaction throughput, up to a point. The <xref linkend="pgtestfsync"/> program can be used to measure the average time in microseconds that a single WAL flush operation takes. A value of half of the average time the program reports it takes to flush after a single 8kB write operation is often the most effective setting for <varname>commit_delay</varname>, so this value is recommended as the starting point to use when optimizing for a particular workload. While tuning <varname>commit_delay</varname> is particularly useful when the WAL is stored on high-latency rotating disks, benefits can be significant even on storage media with very fast sync times, such as solid-state drives or RAID arrays with a battery-backed write cache; but this should definitely be tested against a representative workload. Higher values of <varname>commit_siblings</varname> should be used in such cases, whereas smaller <varname>commit_siblings</varname> values are often helpful on higher latency media. Note that it is quite possible that a setting of <varname>commit_delay</varname> that is too high can increase transaction latency by so much that total transaction throughput suffers. commit_delayの目的は、それぞれのフラッシュ操作のコストを並列にコミット中のトランザクションに（潜在的にはトランザクションの待ち時間と引き換えに）分散させることにあり、うまく設定を行うためには、まずそのコストを測る必要があります。そのコストが高ければ高いほど、トランザクションのスループットがある程度向上するという意味において、commit_delayの効果がより増すことが期待できます。 pg_test_fsyncプログラムは、一つのWALフラッシュが必要とするマイクロ秒単位の平均時間を計測するために使用可能です。プログラムが報告する単一の8kB書き込み操作のあとのフラッシュ平均時間の２分の１の値は、しばしばcommit_delayの最も効果的な設定です。従って、この値は特定の作業負荷のための最適化を行うときに使用するための手始めとして推奨されます。 WALが高遅延の回転ディスクに格納されているときは、commit_delayのチューニングは特に有効ですが、半導体ドライブまたはバッテリバックアップされている書き込みキャッシュ付きのRAIDアレイのような、特に同期時間が高速な格納メディア上であっても大きなメリットがある場合があります。しかし、このことは、代表的作業負荷に対してきちんと検証しておくべきです。 commit_siblingsの高い値は、これらの状況で使用すべきで、一方より小さなcommit_siblingsの値は高遅延メディア上でしばしば有用です。余りにも高い値のcommit_delayを設定すると、トランザクション遅延を増加させかねないことになり、トランザクションの総スループットが低下します。

When <varname>commit_delay</varname> is set to zero (the default), it is still possible for a form of group commit to occur, but each group will consist only of sessions that reach the point where they need to flush their commit records during the window in which the previous flush operation (if any) is occurring. At higher client counts a <quote>gangway effect</quote> tends to occur, so that the effects of group commit become significant even when <varname>commit_delay</varname> is zero, and thus explicitly setting <varname>commit_delay</varname> tends to help less. Setting <varname>commit_delay</varname> can only help when (1) there are some concurrently committing transactions, and (2) throughput is limited to some degree by commit rate; but with high rotational latency this setting can be effective in increasing transaction throughput with as few as two clients (that is, a single committing client with one sibling transaction). commit_delayが(デフォルトの)ゼロに設定されても、グループコミットが起こることがあります。しかし、それぞれのグループは前回のフラッシュ操作（あった場合）が発生していた期間中に、それぞれのコミットレコードをフラッシュする必要に至ったセッションのみから成ります。クライアントが多い状況では、「gangway effect」が起こる傾向があり、そのためcommit_delayがゼロであってもグループコミットの効果が著しく、従って、commit_delayを明示的に設定しても役立ちません。 commit_delayの設定は（１）複数の同時にコミット中のトランザクションが存在すること、そして（２）コミット頻度によりある程度までスループットが制限されている場合に役立ちます。しかし、回転待ち時間が長い場合、この設定はわずか二つのクライアントにおいてさえトランザクションスループットを向上させる効果があるかもしれません（言いかえれば、一つの兄弟（sibling）トランザクションを所有する単一のコミット中のクライアントです）。

The <xref linkend="guc-wal-sync-method"/> parameter determines how <productname>PostgreSQL</productname> will ask the kernel to force <acronym>WAL</acronym> updates out to disk. All the options should be the same in terms of reliability, with the exception of <literal>fsync_writethrough</literal>, which can sometimes force a flush of the disk cache even when other options do not do so. However, it's quite platform-specific which one will be the fastest. You can test the speeds of different options using the <xref linkend="pgtestfsync"/> program. Note that this parameter is irrelevant if <varname>fsync</varname> has been turned off. wal_sync_methodパラメータはPostgreSQLがカーネルに対してWAL更新のディスクへの書き込みを要求する方法を決定します。 fsync_writethroughを除き、どういう設定でも信頼性は同じはずです。fsync_writethroughは他のオプションがそうしないときでも、時々ディスクキャッシュの書き出しを強制することができます。しかしながら、プラットフォームによってどれが一番速いのかがまったく違います。 pg_test_fsyncプログラムを使って異なるオプションの速度テストを行うことができます。ちなみに、このパラメータはfsyncが無効になっている場合は役に立ちません。

Enabling the <xref linkend="guc-wal-debug"/> configuration parameter (provided that <productname>PostgreSQL</productname> has been compiled with support for it) will result in each <function>XLogInsertRecord</function> and <function>XLogFlush</function> <acronym>WAL</acronym> call being logged to the server log. This option might be replaced by a more general mechanism in the future. wal_debug設定パラメータを有効にすることで、XLogInsertRecordとXLogFlushというWAL呼び出しは毎回サーバログにログが残ります（このパラメータをサポートするようにPostgreSQLをコンパイルする必要があります）。将来このオプションはより一般的な機構に置き換わる可能性があります。

There are two internal functions to write WAL data to disk: <function>XLogWrite</function> and <function>issue_xlog_fsync</function>. When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total amounts of time <function>XLogWrite</function> writes and <function>issue_xlog_fsync</function> syncs WAL data to disk are counted as <varname>write_time</varname> and <varname>fsync_time</varname> in <xref linkend="pg-stat-io-view"/> for the <varname>object</varname> <literal>wal</literal>, respectively. <function>XLogWrite</function> is normally called by <function>XLogInsertRecord</function> (when there is no space for the new record in WAL buffers), <function>XLogFlush</function> and the WAL writer, to write WAL buffers to disk and call <function>issue_xlog_fsync</function>. <function>issue_xlog_fsync</function> is normally called by <function>XLogWrite</function> to sync WAL files to disk. If <varname>wal_sync_method</varname> is either <literal>open_datasync</literal> or <literal>open_sync</literal>, a write operation in <function>XLogWrite</function> guarantees to sync written WAL data to disk and <function>issue_xlog_fsync</function> does nothing. If <varname>wal_sync_method</varname> is either <literal>fdatasync</literal>, <literal>fsync</literal>, or <literal>fsync_writethrough</literal>, the write operation moves WAL buffers to kernel cache and <function>issue_xlog_fsync</function> syncs them to disk. Regardless of the setting of <varname>track_wal_io_timing</varname>, the number of times <function>XLogWrite</function> writes and <function>issue_xlog_fsync</function> syncs WAL data to disk are also counted as <varname>writes</varname> and <varname>fsyncs</varname> in <structname>pg_stat_io</structname> for the <varname>object</varname> <literal>wal</literal>, respectively. WALデータをディスクに書き込むための2つの内部関数があります。 XLogWriteとissue_xlog_fsyncです。 track_wal_io_timingが有効な場合、XLogWriteがWALデータをディスクに書き込む合計時間と、issue_xlog_fsyncがWALデータをディスクに同期する合計時間は、それぞれpg_stat_ioにおけるobject walのwrite_timeおよびfsync_timeとして数えられます。 XLogWriteは、WALバッファをディスクに書き込んでissue_xlog_fsyncを呼び出すために、通常はXLogInsertRecord（WALバッファに新しいレコード用の領域がない場合）、XLogFlush、WALライタによって呼び出されます。 issue_xlog_fsyncは通常、WALファイルをディスクに同期するためにXLogWriteによって呼び出されます。 wal_sync_methodがopen_datasyncまたはopen_syncの場合、XLogWriteでの書き込み操作はディスクに書き込まれたWALデータの同期を保証し、issue_xlog_fsyncは何も行いません。 wal_sync_methodがfdatasync、fsync、またはfsync_writethroughのいずれかの場合、書き込み操作はWALバッファをカーネルキャッシュに移動し、issue_xlog_fsyncはそれらをディスクに同期します。 track_wal_io_timingの設定に関係なく、XLogWriteの書き込み回数とissue_xlog_fsyncのディスクへのWALデータの同期回数も、それぞれpg_stat_ioにおけるobject walのwritesとfsyncsとしてカウントされます。

The <xref linkend="guc-recovery-prefetch"/> parameter can be used to reduce I/O wait times during recovery by instructing the kernel to initiate reads of disk blocks that will soon be needed but are not currently in <productname>PostgreSQL</productname>'s buffer pool. The <xref linkend="guc-maintenance-io-concurrency"/> and <xref linkend="guc-wal-decode-buffer-size"/> settings limit prefetching concurrency and distance, respectively. By default, it is set to <literal>try</literal>, which enables the feature on systems that support issuing read-ahead advice. recovery_prefetchパラメータは、すぐに必要になるが現在PostgreSQLのバッファプールにないディスクブロックの読み取りを開始するようカーネルに指示することにより、リカバリ中の入出力待ち時間を減らすために使用できます。 maintenance_io_concurrencyとwal_decode_buffer_sizeの設定は、プリフェッチの並列度と先読み量をそれぞれ制限します。デフォルトではtryに設定されており、先読み指示の発行をサポートするシステムでこの機能が有効になります。

前へ	上へ	次へ
28.4. 非同期コミット	ホーム	28.6. WALの内部