65.2. TOAST

PostgreSQL 17.5文書
		第65章データベースの物理的な格納	誤訳等の報告
前へ	上へ	65.2. TOAST	次へ

65.2. TOAST #

This section provides an overview of <acronym>TOAST</acronym> (The Oversized-Attribute Storage Technique). 本節ではTOAST（過大属性格納技法：The Oversized-Attribute Storage Technique）の概要について説明します。

<productname>PostgreSQL</productname> uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as <acronym>TOAST</acronym> (or <quote>the best thing since sliced bread</quote>). The <acronym>TOAST</acronym> infrastructure is also used to improve handling of large data values in-memory. PostgreSQLは固定長のページサイズ（通常8キロバイト）を使用し、複数ページにまたがるタプルを許しません。そのため、大規模なフィールド値を直接格納できません。この限界を克服するため、大規模なフィールド値を圧縮したり、複数の物理的な行に分割したりしています。これはユーザからは透過的に発生し、また、バックエンドのコード全体には小さな影響しか与えません。この技法はTOAST（または「パンをスライスして以来最善のもの」）という愛称で呼ばれます。 [訳注：TOASTはパンのトーストと綴りが同じなので、スライスしたパンを美味しく食べる方法に掛けて洒落ています。] TOASTの基盤は大きなデータ値のインメモリで処理の改善にも使用されています。

Only certain data types support <acronym>TOAST</acronym> — there is no need to impose the overhead on data types that cannot produce large field values. To support <acronym>TOAST</acronym>, a data type must have a variable-length (<firstterm>varlena</firstterm>) representation, in which, ordinarily, the first four-byte word of any stored value contains the total length of the value in bytes (including itself). <acronym>TOAST</acronym> does not constrain the rest of the data type's representation. The special representations collectively called <firstterm><acronym>TOAST</acronym>ed values</firstterm> work by modifying or reinterpreting this initial length word. Therefore, the C-level functions supporting a <acronym>TOAST</acronym>-able data type must be careful about how they handle potentially <acronym>TOAST</acronym>ed input values: an input might not actually consist of a four-byte length word and contents until after it's been <firstterm>detoasted</firstterm>. (This is normally done by invoking <function>PG_DETOAST_DATUM</function> before doing anything with an input value, but in some cases more efficient approaches are possible. See <xref linkend="xtypes-toast"/> for more detail.) 一部のデータ型のみがTOASTをサポートします。大規模なフィールド値を生成することがないデータ型にオーバーヘッドを負わせる必要はありません。 TOASTをサポートするためには、データ型は可変長（varlena）表現を持たなければなりません。通常は、格納する値の最初の4バイトワードには値の長さ（このワード自体を含む）がバイト単位で含まれます。 TOASTは残りのデータ型の表現について制限しません。 TOAST化された値として集合的に呼ばれる特別な表現は、この先頭の長さのワードを更新または再解釈することで動作します。したがって、TOAST可能なデータ型をサポートするC言語関数は、潜在的にTOAST化されている入力値の扱い方に注意しなければなりません。つまり、入力がTOAST解除されなければ、それは実際には4バイトの長さのワードと内容から構成されていないかもしれないのです。（通常これは、入力に対して何か作業をする前にPG_DETOAST_DATUMを呼び出すことで行われますが、もっと効率的な方法が可能な場合もあります。詳しくは36.13.1を参照してください。）

<acronym>TOAST</acronym> usurps two bits of the varlena length word (the high-order bits on big-endian machines, the low-order bits on little-endian machines), thereby limiting the logical size of any value of a <acronym>TOAST</acronym>-able data type to 1 GB (2<superscript>30</superscript> - 1 bytes). When both bits are zero, the value is an ordinary un-<acronym>TOAST</acronym>ed value of the data type, and the remaining bits of the length word give the total datum size (including length word) in bytes. When the highest-order or lowest-order bit is set, the value has only a single-byte header instead of the normal four-byte header, and the remaining bits of that byte give the total datum size (including length byte) in bytes. This alternative supports space-efficient storage of values shorter than 127 bytes, while still allowing the data type to grow to 1 GB at need. Values with single-byte headers aren't aligned on any particular boundary, whereas values with four-byte headers are aligned on at least a four-byte boundary; this omission of alignment padding provides additional space savings that is significant compared to short values. As a special case, if the remaining bits of a single-byte header are all zero (which would be impossible for a self-inclusive length), the value is a pointer to out-of-line data, with several possible alternatives as described below. The type and size of such a <firstterm>TOAST pointer</firstterm> are determined by a code stored in the second byte of the datum. Lastly, when the highest-order or lowest-order bit is clear but the adjacent bit is set, the content of the datum has been compressed and must be decompressed before use. In this case the remaining bits of the four-byte length word give the total size of the compressed datum, not the original data. Note that compression is also possible for out-of-line data but the varlena header does not tell whether it has occurred — the content of the <acronym>TOAST</acronym> pointer tells that, instead. TOASTはvarlenaの長さワードの2ビット(ビッグエンディアンのマシンでは上位ビット、リトルエンディアンのマシンでは下位ビット)を勝手に使用します。そのため、すべてのTOAST可能なデータ型の値の論理サイズは1ギガバイト（2³⁰ - 1バイト）までになります。両ビットが0の場合、値はそのデータ型の普通のTOAST化されていない値となり、長さワードの残りのビットはデータの（長さワードを含む）総サイズ（バイト単位）となります。上位側または下位側のどちらか片方のビットが設定された場合、値は通常の4バイトのヘッダを持たず1バイトのヘッダを持ちます。また、そのバイトの残りビットはデータの（長さワードを含む）総サイズ（バイト単位）となります。この方式により、127バイトより短い値の効率的な格納をサポートする一方で、データ型が必要なら1GBにまで大きくなることを可能にしています。 1バイトのヘッダを持つ値は特定の境界に整列されませんが、4バイトのヘッダを持つ値は少なくとも4バイト境界の上に整列されます。このように整列のためのパディングを省略することで、短い値と比べて重要な追加のスペース節約ができます。特殊な状況として、1バイトのヘッダの残りビットがすべて0（自身の長さを含む場合はありえません）の場合、その値は行外データへのポインタで、以下に述べるようにいくつかの可能性があります。そのようなTOASTポインタの型とサイズはデータの2番目のバイトに格納されるコードによって決定されます。最後に上位側または下位側のビットが0で隣のビットが設定されている場合、データの内容は圧縮され、使用前に伸長しなければなりません。この場合、4バイトの長さワードの残りビットは元データのサイズではなく圧縮したデータの総サイズになります。圧縮が行外データでも起こりえますが、varlenaヘッダには圧縮されているかどうかについての情報がないことに注意してください。その代わりTOASTポインタの内容にこの情報が含まれています。

The compression technique used for either in-line or out-of-line compressed data can be selected for each column by setting the <literal>COMPRESSION</literal> column option in <command>CREATE TABLE</command> or <command>ALTER TABLE</command>. The default for columns with no explicit setting is to consult the <xref linkend="guc-default-toast-compression"/> parameter at the time data is inserted. 行内あるいは行外の圧縮データで使用される圧縮技術は、CREATE TABLEまたはALTER TABLEでCOMPRESSION列オプションを設定することで各列に対して選択できます。明示的な設定のない列に対するデフォルトは、データが挿入されるときにdefault_toast_compressionパラメータを参照することです。

As mentioned, there are multiple types of <acronym>TOAST</acronym> pointer datums. The oldest and most common type is a pointer to out-of-line data stored in a <firstterm><acronym>TOAST</acronym> table</firstterm> that is separate from, but associated with, the table containing the <acronym>TOAST</acronym> pointer datum itself. These <firstterm>on-disk</firstterm> pointer datums are created by the <acronym>TOAST</acronym> management code (in <filename>access/common/toast_internals.c</filename>) when a tuple to be stored on disk is too large to be stored as-is. Further details appear in <xref linkend="storage-toast-ondisk"/>. Alternatively, a <acronym>TOAST</acronym> pointer datum can contain a pointer to out-of-line data that appears elsewhere in memory. Such datums are necessarily short-lived, and will never appear on-disk, but they are very useful for avoiding copying and redundant processing of large data values. Further details appear in <xref linkend="storage-toast-inmemory"/>. 前に触れたように、TOASTポインタデータにはいくつかの型があります。最も古くて一般的な型はTOASTテーブルに格納されている行外データへのポインタです。 TOASTテーブルは、TOASTポインタデータ自体を含むテーブルとは別の、しかし関連付けられるテーブルです。これらのディスク上のポインタデータは、ディスク上に格納されるタプルが、そのまま格納するには大きすぎる時に、TOAST管理コード（access/common/toast_internals.cにあります）によって作られます。更なる詳細は65.2.1に記述されています。あるいはTOASTポインタデータは、メモリ内のどこかにある行外データへのポインタのこともあります。そのようなデータは短命で、ディスク上に現れることは決してありませんが、大きなデータ値を複製し、余分な処理をするのを避けるために有用です。更なる詳細は65.2.2に記述されています。

65.2.1. 行外ディスク上のTOAST格納 #

<title>Out-of-Line, On-Disk TOAST Storage</title>

If any of the columns of a table are <acronym>TOAST</acronym>-able, the table will have an associated <acronym>TOAST</acronym> table, whose OID is stored in the table's <structname>pg_class</structname>.<structfield>reltoastrelid</structfield> entry. On-disk <acronym>TOAST</acronym>ed values are kept in the <acronym>TOAST</acronym> table, as described in more detail below. テーブルの列に1つでもTOAST可能なものがあれば、そのテーブルには連携したTOASTテーブルがあり、そのOIDがテーブルのpg_class.reltoastrelidエントリに格納されます。ディスク上のTOAST化された値は以下で詳しく説明する通り、TOASTテーブルに保持されます。

Out-of-line values are divided (after compression if used) into chunks of at most <symbol>TOAST_MAX_CHUNK_SIZE</symbol> bytes (by default this value is chosen so that four chunk rows will fit on a page, making it about 2000 bytes). Each chunk is stored as a separate row in the <acronym>TOAST</acronym> table belonging to the owning table. Every <acronym>TOAST</acronym> table has the columns <structfield>chunk_id</structfield> (an OID identifying the particular <acronym>TOAST</acronym>ed value), <structfield>chunk_seq</structfield> (a sequence number for the chunk within its value), and <structfield>chunk_data</structfield> (the actual data of the chunk). A unique index on <structfield>chunk_id</structfield> and <structfield>chunk_seq</structfield> provides fast retrieval of the values. A pointer datum representing an out-of-line on-disk <acronym>TOAST</acronym>ed value therefore needs to store the OID of the <acronym>TOAST</acronym> table in which to look and the OID of the specific value (its <structfield>chunk_id</structfield>). For convenience, pointer datums also store the logical datum size (original uncompressed data length), physical stored size (different if compression was applied), and the compression method used, if any. Allowing for the varlena header bytes, the total size of an on-disk <acronym>TOAST</acronym> pointer datum is therefore 18 bytes regardless of the actual size of the represented value. 行外の値は（圧縮される場合は圧縮後に）最大TOAST_MAX_CHUNK_SIZEバイトの塊に分割されます（デフォルトではこの値は4チャンク行が1ページに収まり、およそ2000バイトになるように選ばれます）。各塊は、データを持つテーブルと連携するTOASTテーブル内に個別の行として格納されます。すべてのTOASTテーブルはchunk_id列（特定のTOAST化された値を識別するOID）、chunk_seq列（値の塊に対する連番）、chunk_data（塊の実際のデータ）列を持ちます。 chunk_idとchunk_seqに対する一意性インデックスは値の抽出を高速化します。したがって、行外のディスク上のTOAST化された値を示すポインタデータには、検索先となるTOASTテーブルのOIDと指定した値のOID(chunk_id)を格納しなければなりません。簡便性のために、ポインタデータには論理データサイズ（元々の非圧縮のデータ長）、物理的な格納サイズ（圧縮時には異なります）、そして、利用されているのであれば、その圧縮方式も格納されます。 varlenaヘッダバイトに収納するためにディスク上のTOASTポインタデータの総サイズは、表現される値の実サイズに関係なく、18バイトになります。

The <acronym>TOAST</acronym> management code is triggered only when a row value to be stored in a table is wider than <symbol>TOAST_TUPLE_THRESHOLD</symbol> bytes (normally 2 kB). The <acronym>TOAST</acronym> code will compress and/or move field values out-of-line until the row value is shorter than <symbol>TOAST_TUPLE_TARGET</symbol> bytes (also normally 2 kB, adjustable) or no more gains can be had. During an UPDATE operation, values of unchanged fields are normally preserved as-is; so an UPDATE of a row with out-of-line values incurs no <acronym>TOAST</acronym> costs if none of the out-of-line values change. TOAST管理のコードは、テーブル内に格納される値がTOAST_TUPLE_THRESHOLDバイト（通常2キロバイト）を超える時にのみ実行されます。 TOASTコードは、行の値がTOAST_TUPLE_TARGETバイト（こちらも通常2キロバイト、調整可能）より小さくなるかそれ以上の縮小ができなくなるまで、フィールド値の圧縮や行外への移動を行います。更新操作中、変更されない値は通常そのまま残ります。行外の値を持つ行の更新では、行外の値の変更がなければTOASTするコストはかかりません。

The <acronym>TOAST</acronym> management code recognizes four different strategies for storing <acronym>TOAST</acronym>-able columns on disk: TOAST管理のコードでは、ディスク上にTOAST可能な列を格納するために、以下の4つの異なる戦略を認識します。

<literal>PLAIN</literal> prevents either compression or out-of-line storage. This is the only possible strategy for columns of non-<acronym>TOAST</acronym>-able data types. PLAINは圧縮や行外の格納を防止します。これはTOAST化不可能のデータ型の列に対してのみ取り得る戦略です。
<literal>EXTENDED</literal> allows both compression and out-of-line storage. This is the default for most <acronym>TOAST</acronym>-able data types. Compression will be attempted first, then out-of-line storage if the row is still too big. EXTENDEDでは、圧縮と行外の格納を許します。これはほとんどのTOAST可能のデータ型のデフォルトです。圧縮がまず行われ、それでも行が大き過ぎるのであれば行外の格納をします。
<literal>EXTERNAL</literal> allows out-of-line storage but not compression. Use of <literal>EXTERNAL</literal> will make substring operations on wide <type>text</type> and <type>bytea</type> columns faster (at the penalty of increased storage space) because these operations are optimized to fetch only the required parts of the out-of-line value when it is not compressed. EXTERNALは非圧縮の行外の格納を許します。 EXTERNALを使用すると、textとbytea列全体に対する部分文字列操作が高速化されます。こうした操作は非圧縮の行外の値から必要な部分を取り出す時に最適化されるためです（格納領域が増加するという欠点があります）。
<literal>MAIN</literal> allows compression but not out-of-line storage. (Actually, out-of-line storage will still be performed for such columns, but only as a last resort when there is no other way to make the row small enough to fit on a page.) MAINは圧縮を許しますが、行外の格納はできません。（実際にはこうした列についても行外の格納は行われます。しかし、他に行を縮小させページに合わせる方法がない場合の最後の手段としてのみです。）

Each <acronym>TOAST</acronym>-able data type specifies a default strategy for columns of that data type, but the strategy for a given table column can be altered with <link linkend="sql-altertable"><command>ALTER TABLE ... SET STORAGE</command></link>. TOAST可能なデータ型はそれぞれ、そのデータ型の列用のデフォルトの戦略を指定します。しかしALTER TABLE ... SET STORAGEを使用して、あるテーブル列の戦略を変更することができます。

<symbol>TOAST_TUPLE_TARGET</symbol> can be adjusted for each table using <link linkend="sql-altertable"><command>ALTER TABLE ... SET (toast_tuple_target = N)</command></link> TOAST_TUPLE_TARGETはALTER TABLE ... SET (toast_tuple_target = N)を使って各テーブルで調整できます。

This scheme has a number of advantages compared to a more straightforward approach such as allowing row values to span pages. Assuming that queries are usually qualified by comparisons against relatively small key values, most of the work of the executor will be done using the main row entry. The big values of <acronym>TOAST</acronym>ed attributes will only be pulled out (if selected at all) at the time the result set is sent to the client. Thus, the main table is much smaller and more of its rows fit in the shared buffer cache than would be the case without any out-of-line storage. Sort sets shrink also, and sorts will more often be done entirely in memory. A little test showed that a table containing typical HTML pages and their URLs was stored in about half of the raw data size including the <acronym>TOAST</acronym> table, and that the main table contained only about 10% of the entire data (the URLs and some small HTML pages). There was no run time difference compared to an un-<acronym>TOAST</acronym>ed comparison table, in which all the HTML pages were cut down to 7 kB to fit. この機構には、ページをまたがる行の値を許可するといった素直な手法に比べて多くの利点があります。通常問い合わせは比較的小さなキー値に対する比較で条件付けされるものと仮定すると、エグゼキュータの仕事のほとんどは主だった行の項目を使用して行われることになります。 TOAST化属性の大規模な値は、（それが選択されている時）結果集合をクライアントに戻す時に引き出されるだけです。このため、主テーブルは行外の格納を使用しない場合に比べて、かなり小さくなり、その行は共有バッファキャッシュにより合うようになります。ソート集合もまた小さくなり、ソートが完全にメモリ内で行われる頻度が高くなります。小規模な試験結果ですが、典型的なHTMLページとそのURLを持つテーブルでは、TOASTテーブルを含め、元々のデータサイズのおよそ半分で格納でき、さらに、主テーブルには全体のデータのおよそ10%のみ（URLと一部の小さなHTMLページ）が格納されました。すべてのHTMLページを7キロバイト程度に切り詰めたTOAST化されない比較用テーブルと比べ、実行時間に違いはありませんでした。

65.2.2. 行外インメモリのTOAST格納 #

<title>Out-of-Line, In-Memory TOAST Storage</title>

<acronym>TOAST</acronym> pointers can point to data that is not on disk, but is elsewhere in the memory of the current server process. Such pointers obviously cannot be long-lived, but they are nonetheless useful. There are currently two sub-cases: pointers to <firstterm>indirect</firstterm> data and pointers to <firstterm>expanded</firstterm> data. TOASTポインタは、ディスク上にあるデータだけでなく、現在のサーバプロセスのメモリ内の場所を指すこともできます。そのようなポインタは明らかに短命ですが、それでも有用です。現在のところ、間接データへのポインタと、展開データへのポインタの2つのケースがあります。

Indirect <acronym>TOAST</acronym> pointers simply point at a non-indirect varlena value stored somewhere in memory. This case was originally created merely as a proof of concept, but it is currently used during logical decoding to avoid possibly having to create physical tuples exceeding 1 GB (as pulling all out-of-line field values into the tuple might do). The case is of limited use since the creator of the pointer datum is entirely responsible that the referenced data survives for as long as the pointer could exist, and there is no infrastructure to help with this. 間接TOASTポインタは、単にメモリ上のどこかに格納されている間接的でないvarlena値を指すだけです。このケースは元々は単なる概念実証として作られたのですが、現在はロジカルデコーディング時に1GBを越える物理的タプルを作成する可能性を防ぐために使用されています。（すべての行外フィールド値をタプルに持ってこようとすると、そうなるかもしれません。）このケースでは、ポインタデータの作成者はポインタが存在可能な限り参照データが存在し続けることに全責任を負うため、利用が限られ、またこれを支援するための基盤もありません。

Expanded <acronym>TOAST</acronym> pointers are useful for complex data types whose on-disk representation is not especially suited for computational purposes. As an example, the standard varlena representation of a <productname>PostgreSQL</productname> array includes dimensionality information, a nulls bitmap if there are any null elements, then the values of all the elements in order. When the element type itself is variable-length, the only way to find the <replaceable>N</replaceable>'th element is to scan through all the preceding elements. This representation is appropriate for on-disk storage because of its compactness, but for computations with the array it's much nicer to have an <quote>expanded</quote> or <quote>deconstructed</quote> representation in which all the element starting locations have been identified. The <acronym>TOAST</acronym> pointer mechanism supports this need by allowing a pass-by-reference Datum to point to either a standard varlena value (the on-disk representation) or a <acronym>TOAST</acronym> pointer that points to an expanded representation somewhere in memory. The details of this expanded representation are up to the data type, though it must have a standard header and meet the other API requirements given in <filename>src/include/utils/expandeddatum.h</filename>. C-level functions working with the data type can choose to handle either representation. Functions that do not know about the expanded representation, but simply apply <function>PG_DETOAST_DATUM</function> to their inputs, will automatically receive the traditional varlena representation; so support for an expanded representation can be introduced incrementally, one function at a time. 展開TOASTポインタは、ディスク上の表現が計算目的にあまり適さない複雑なデータ型で有用です。例えばPostgreSQLの配列の標準varlena表現には、次元の情報、NULLの要素があればNULLのビットマップ、そしてすべての要素の値が順番どおりに含まれます。要素型自体が可変長だと、N番目の要素を探す唯一の方法は前にある要素のすべてをスキャンすることです。この表現は、そのサイズの小ささからディスク上の記録には適していますが、配列を使った計算では、すべての要素の開始位置が特定されている「展開」または「解体」された表現があるとずっと良いです。 TOASTポインタの機構では、参照渡しのデータが、標準のvarlena値（ディスク上の表現）あるいはメモリ上のどこかにある展開表現を指すTOASTポインタを指すことを許すことで、このニーズに応えています。この展開表現の詳細はデータ型に依存しますが、標準ヘッダを持ち、src/include/utils/expandeddatum.hにある他のAPIの要求を満たす必要があります。データ型を処理するc言語の関数は、どちらかの表現を扱うことを選ぶことができます。展開表現を認識せず、入力データに単にPG_DETOAST_DATUMを適用するだけの関数は、自動的に伝統的なvarlena表現を受け取ります。従って、展開表現のサポートは徐々に、1回に1つの関数だけ導入することができます。

<acronym>TOAST</acronym> pointers to expanded values are further broken down into <firstterm>read-write</firstterm> and <firstterm>read-only</firstterm> pointers. The pointed-to representation is the same either way, but a function that receives a read-write pointer is allowed to modify the referenced value in-place, whereas one that receives a read-only pointer must not; it must first create a copy if it wants to make a modified version of the value. This distinction and some associated conventions make it possible to avoid unnecessary copying of expanded values during query execution. 展開された値へのTOASTポインタは、さらに読み書きのポインタと読み取りのみのポインタに分類されます。指された先の表現はどちらでも同じですが、読み書きのポインタを受け取った関数は、そこにある参照値を変更できるのに対し、読み取りのみのポインタを受け取った関数では変更が許されないため、値を変更したバージョンを作りたければ、まずその複製を作る必要があります。この区別と、関連したいくつかの慣習により、問い合わせの実行時に展開された値を不必要に複製するのを避けることが可能になります。

For all types of in-memory <acronym>TOAST</acronym> pointer, the <acronym>TOAST</acronym> management code ensures that no such pointer datum can accidentally get stored on disk. In-memory <acronym>TOAST</acronym> pointers are automatically expanded to normal in-line varlena values before storage — and then possibly converted to on-disk <acronym>TOAST</acronym> pointers, if the containing tuple would otherwise be too big. すべてのタイプのインメモリのTOASTポインタについて、TOAST管理のコードはそのようなポインタデータが偶然、ディスクに保存されてしまうことが決して起こらないようにします。インメモリのTOASTポインタは保存される前に自動的に展開されて通常の行内のvarlena値になります。その後、含んでいるタプルが大きすぎるような時には、ディスク上のTOASTポインタに変換されることもあります。

前へ	上へ	次へ
65.1. データベースファイルのレイアウト	ホーム	65.3. 空き領域マップ