12.9. テキスト検索に好ましいインデックス種類

PostgreSQL 17.5文書
		第12章全文検索	誤訳等の報告
前へ	上へ	12.9. テキスト検索に好ましいインデックス種類	次へ

12.9. テキスト検索に好ましいインデックス種類 #

<title>Preferred Index Types for Text Search</title>

There are two kinds of indexes that can be used to speed up full text searches: <link linkend="gin"><acronym>GIN</acronym></link> and <link linkend="gist"><acronym>GiST</acronym></link>. Note that indexes are not mandatory for full text searching, but in cases where a column is searched on a regular basis, an index is usually desirable. 全文検索を高速化するために、2種類のインデックスが使えます。 GINとGiSTです。全文検索でインデックスが必須ではありませんが、日常的に検索される列には、インデックスを使った方が良いでしょう。

To create such an index, do one of: このようなインデックスを作成するには、次のいずれかを実行します。

CREATE INDEX name ON table USING GIN (column);: Creates a GIN (Generalized Inverted Index)-based index. The <replaceable>column</replaceable> must be of <type>tsvector</type> type. GIN (Generalized Inverted Index)インデックスを作ります。 columnはtsvector型でなければなりません。
CREATE INDEX name ON table USING GIST (column [ { DEFAULT | tsvector_ops } (siglen = number) ] );: Creates a GiST (Generalized Search Tree)-based index. The <replaceable>column</replaceable> can be of <type>tsvector</type> or <type>tsquery</type> type. Optional integer parameter <literal>siglen</literal> determines signature length in bytes (see below for details). GiST (Generalized Search Tree)インデックスを作ります。columnは tsvector またはtsquery 型です。オプションの整数パラメータsiglenは署名の長さをバイト単位で決定します(詳細は以下を参照してください)。

GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows that are lacking additional words. GIN indexes store only the words (lexemes) of <type>tsvector</type> values, and not their weight labels. Thus a table row recheck is needed when using a query that involves weights. GINインデックスの方がより好ましいテキスト検索インデックス形式です。転置インデックスなので、マッチした位置の圧縮されたリストと合わせて各単語(語彙素)へのインデックスエントリを含みます。複数単語での検索は最初のマッチを見つけることができ、その後、追加の単語がない行を削除するのにインデックスを使えます。 GINインデックスはtsvector値の単語(語彙素)のみを格納しており、重み付けラベルは格納していません。そのため、重みを含む問い合わせを使う場合には、テーブル行の再検査が必要です。

A GiST index is <firstterm>lossy</firstterm>, meaning that the index might produce false matches, and it is necessary to check the actual table row to eliminate such false matches. (<productname>PostgreSQL</productname> does this automatically when needed.) GiST indexes are lossy because each document is represented in the index by a fixed-length signature. The signature length in bytes is determined by the value of the optional integer parameter <literal>siglen</literal>. The default signature length (when <literal>siglen</literal> is not specified) is 124 bytes, the maximum signature length is 2024 bytes. The signature is generated by hashing each word into a single bit in an n-bit string, with all these bits OR-ed together to produce an n-bit document signature. When two words hash to the same bit position there will be a false match. If all words in the query have matches (real or false) then the table row must be retrieved to see if the match is correct. Longer signatures lead to a more precise search (scanning a smaller fraction of the index and fewer heap pages), at the cost of a larger index. GiSTインデックスは、非可逆です。つまり、インデックスは間違った結果を返すかも知れないので、間違った結果を排除するために、テーブルの行をチェックすることが必要です。 (PostgreSQLはこの処理が必要とされた時に自動的に行います。) GiSTインデックスが非可逆なのは、インデックス中の各文書が固定長の署名によって表現されているからです。署名のバイト単位の長さはオプションの整数のパラメータsiglenの値で決まります。 (siglenが指定されない場合)デフォルトの署名の長さは124バイトで、最大の署名の長さは2024バイトです。署名は、各々の単語をハッシュして単一なビットにして、これらのビットをnビットの文書署名にORし、nビットの列中のビットにすることで実現されています。 2つの単語が同じビット位置を生成すると、間違った一致が起こります。問い合わせ対象のすべての単語が照合すると(それが正しいか間違っているかは別として)、その照合が正しいものかどうかテーブルの行を取得して調べなければなりません。長い署名では、インデックスはより大きくなってしまいますが、(インデックスのより小さな部分とより少ないヒープページをスキャンすることで)検索がより正確になります。

A GiST index can be covering, i.e., use the <literal>INCLUDE</literal> clause. Included columns can have data types without any GiST operator class. Included attributes will be stored uncompressed. GiSTインデックスはカバリングにできます、すなわちINCLUDE句を使えます。列には、GiST演算子クラスを持たないデータ型をINCLUDEで含めることができます。含まれる属性は圧縮されずに格納されます。

Lossiness causes performance degradation due to unnecessary fetches of table records that turn out to be false matches. Since random access to table records is slow, this limits the usefulness of GiST indexes. The likelihood of false matches depends on several factors, in particular the number of unique words, so using dictionaries to reduce this number is recommended. 非可逆性は、間違った照合によるテーブルからの不必要なデータ取得のため、性能を劣化させます。テーブルへのランダムアクセスは遅いので、GiSTインデックスの有用性は制限されています。誤った照合がどの位あるかという可能性はいくつか要因によりますが、とりわけユニークな単語の数に依存します。ですから、辞書を使ってユニークな単語の数を減らすことをお勧めします。

Note that <acronym>GIN</acronym> index build time can often be improved by increasing <xref linkend="guc-maintenance-work-mem"/>, while <acronym>GiST</acronym> index build time is not sensitive to that parameter. GINインデックスの構築時間はmaintenance_work_memを増やすことによってしばしば改善することができることに注意してください。一方GiSTインデックスの構築時間にはあまりそのパラメータは効きません。

Partitioning of big collections and the proper use of GIN and GiST indexes allows the implementation of very fast searches with online update. Partitioning can be done at the database level using table inheritance, or by distributing documents over servers and collecting external search results, e.g., via <link linkend="ddl-foreign-data">Foreign Data</link> access. The latter is possible because ranking functions use only local information. 大きなデータをパーティショニングし、GIN、GiSTインデックスを適切に使うことによってオンラインの更新を伴いながら、非常に高速な検索を実現することができます。パーティショニングは、以下のどちらかの方法でデータベースレベルで実現できます。(1)テーブルの継承を使う。(2)文書を複数のサーバに分散させ、外部の検索結果を集約する。たとえば外部データアクセスを使います。 2の方法は、ランキング関数がローカルな情報しか使わないため可能です。

前へ	上へ	次へ
12.8. テキスト検索のテストとデバッグ	ホーム	12.10. psqlサポート