12.4. 追加機能

PostgreSQL 17.5文書
		第12章全文検索	誤訳等の報告
前へ	上へ	12.4. 追加機能	次へ

12.4. 追加機能 #

<title>Additional Features</title>

This section describes additional functions and operators that are useful in connection with text search. この節では、全文検索に関連する便利な追加の関数と演算子を説明します。

12.4.1. 文書の操作 #

<title>Manipulating Documents</title>

<xref linkend="textsearch-parsing-documents"/> showed how raw textual documents can be converted into <type>tsvector</type> values. <productname>PostgreSQL</productname> also provides functions and operators that can be used to manipulate documents that are already in <type>tsvector</type> form. 12.3.1に、もとのテキスト形式の文書がどのようにしてtsvectorに変換されるのか書いてあります。また、PostgreSQLではtsvector形式に変換済の文書を操作する関数と演算子が提供されています。

tsvector || tsvector

The <type>tsvector</type> concatenation operator returns a vector which combines the lexemes and positional information of the two vectors given as arguments. Positions and weight labels are retained during the concatenation. Positions appearing in the right-hand vector are offset by the largest position mentioned in the left-hand vector, so that the result is nearly equivalent to the result of performing <function>to_tsvector</function> on the concatenation of the two original document strings. (The equivalence is not exact, because any stop-words removed from the end of the left-hand argument will not affect the result, whereas they would have affected the positions of the lexemes in the right-hand argument if textual concatenation were used.) tsvectorの結合演算子で、2つのベクトルの語彙素と位置情報を合成し、ベクトルを返します。位置と重み付けラベルは、結合では維持されます。右辺のベクトルの位置は左辺のベクトルの一番大きな位置情報のオフセットになります。その結果、この関数の結果は、元の2つの文書文字列を結合したものにto_tsvectorを適用したものとほぼ同じになります。 (まったく同じと言うわけではありません。左辺の引数の最後から取り除かれたストップワードは結果に影響を与えないのに対し、テキストの結合が行われた場合は、右辺の引数にある語彙素位置に影響を与えるからです。)

One advantage of using concatenation in the vector form, rather than concatenating text before applying <function>to_tsvector</function>, is that you can use different configurations to parse different sections of the document. Also, because the <function>setweight</function> function marks all lexemes of the given vector the same way, it is necessary to parse the text and do <function>setweight</function> before concatenating if you want to label different parts of the document with different weights. to_tsvectorを適用する前のテキストを結合するよりも、ベクトルを結合することの利点の一つは、文書の異なる部分をパースするために、異なる設定を使うことができることです。なお、setweight関数は与えられたベクトルのすべての語彙素を同じ方法でマーク付けするため、もしも文書に異なる部分に別の重み付けを行いたいなら、結合する前に文書をパースしてsetweightを適用することが必要です。

setweight(vector tsvector, weight "char") returns tsvector

<function>setweight</function> returns a copy of the input vector in which every position has been labeled with the given <replaceable>weight</replaceable>, either <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or <literal>D</literal>. (<literal>D</literal> is the default for new vectors and as such is not displayed on output.) These labels are retained when vectors are concatenated, allowing words from different parts of a document to be weighted differently by ranking functions. setweightは、A, B, C, Dのいずれかの与えられたweightを入力のベクトル中の位置にラベル付けし、そのコピーを返します。 (Dは新しいベクトルのデフォルトで、出力する際には表示されません。) これらのラベルはベクトルが結合される際に保存されるので、ランキング関数によって文書中の異なる部分の語を別々に重み付けできます。

Note that weight labels apply to <emphasis>positions</emphasis>, not <emphasis>lexemes</emphasis>. If the input vector has been stripped of positions then <function>setweight</function> does nothing. なお、重み付けラベルは語彙素ではなく位置に与えられることに注意してください。入力のベクトルから位置が削除されていると、setweightは何もしません。

length(vector tsvector) returns integer

Returns the number of lexemes stored in the vector. ベクトル中に格納されている語彙素の数を返します。

strip(vector tsvector) returns tsvector

Returns a vector that lists the same lexemes as the given vector, but lacks any position or weight information. The result is usually much smaller than an unstripped vector, but it is also less useful. Relevance ranking does not work as well on stripped vectors as unstripped ones. Also, the <literal><-></literal> (FOLLOWED BY) <type>tsquery</type> operator will never match stripped input, since it cannot determine the distance between lexeme occurrences. 入力のベクトルと同じ語彙素のリストを持つが、位置と重みの情報が全くないベクトルを返します。その結果は、通常は情報を削除されていないベクトルよりもずっと小さくなりますが、有用性も低くなります。また、tsquery演算子<-> (FOLLOWED BY)は情報を削除した入力とマッチすることはありません。なぜなら語彙素が発生する間の距離を決定できないからです。

A full list of <type>tsvector</type>-related functions is available in <xref linkend="textsearch-functions-table"/>. tsvectorに関連した関数の完全なリストが表 9.43にあります。

12.4.2. 問い合わせを操作する #

<title>Manipulating Queries</title>

<xref linkend="textsearch-parsing-queries"/> showed how raw textual queries can be converted into <type>tsquery</type> values. <productname>PostgreSQL</productname> also provides functions and operators that can be used to manipulate queries that are already in <type>tsquery</type> form. 12.3.2は、元のテキストがいかにしてtsquery値に変換されるかを解説しています。またPostgreSQLは、tsquery形式に変換済の問い合わせを操作するために使用できる関数と演算子を提供しています。

tsquery && tsquery

Returns the AND-combination of the two given queries. 2つの問い合わせをANDで結合したものを返します。

tsquery || tsquery

Returns the OR-combination of the two given queries. 2つの問い合わせをORで結合したものを返します。

!! tsquery

Returns the negation (NOT) of the given query. 与えられた問い合わせの否定を返します。

tsquery <-> tsquery

Returns a query that searches for a match to the first given query immediately followed by a match to the second given query, using the <literal><-></literal> (FOLLOWED BY) <type>tsquery</type> operator. For example: 1番目の問い合わせにマッチし、その直後に2番目の問い合わせにマッチするものを検索する問い合わせを、tsquery演算子<-> (FOLLOWED BY) を使って返します。例を示します。

SELECT to_tsquery('fat') <-> to_tsquery('cat | rat');
          ?column?
----------------------------
 'fat' <-> ( 'cat' | 'rat' )

tsquery_phrase(query1 tsquery, query2 tsquery [, distance integer ]) returns tsquery

Returns a query that searches for a match to the first given query followed by a match to the second given query at a distance of exactly <replaceable>distance</replaceable> lexemes, using the <literal><<replaceable>N</replaceable>></literal> <type>tsquery</type> operator. For example: 1番目の問い合わせにマッチし、その後にちょうどdistance個の語彙素の距離で2番目の問い合わせにマッチするものを検索する問い合わせを、tsquery演算子<N>を使って返します。例を示します。

SELECT tsquery_phrase(to_tsquery('fat'), to_tsquery('cat'), 10);
  tsquery_phrase
------------------
 'fat' <10> 'cat'

numnode(query tsquery) returns integer

Returns the number of nodes (lexemes plus operators) in a <type>tsquery</type>. This function is useful to determine if the <replaceable>query</replaceable> is meaningful (returns > 0), or contains only stop words (returns 0). Examples: tsquery中のノード(語彙素と演算子)の数を返します。この関数は、問い合わせが意味のあるものか(返却値 > 0)、ストップワードだけを含んでいるか(返却値 0)を判断するのに役に立ちます。例を示します。

SELECT numnode(plainto_tsquery('the any'));
NOTICE:  query contains only stopword(s) or doesn't contain lexeme(s), ignored
 numnode
---------
       0

SELECT numnode('foo & bar'::tsquery);
 numnode
---------
       3

querytree(query tsquery) returns text

Returns the portion of a <type>tsquery</type> that can be used for searching an index. This function is useful for detecting unindexable queries, for example those containing only stop words or only negated terms. For example: インデックス検索の際に使用できるtsqueryの部分を返します。この関数は、たとえばストップワードのみ、あるいは否定語だけのように、インデックス検索できない問い合わせを検出するのに役立ちます。例を示します。

SELECT querytree(to_tsquery('defined'));
 querytree
-----------
 'defin'

SELECT querytree(to_tsquery('!defined'));
 querytree
-----------
 T

12.4.2.1. 問い合わせの書き換え #

<title>Query Rewriting</title>

The <function>ts_rewrite</function> family of functions search a given <type>tsquery</type> for occurrences of a target subquery, and replace each occurrence with a substitute subquery. In essence this operation is a <type>tsquery</type>-specific version of substring replacement. A target and substitute combination can be thought of as a <firstterm>query rewrite rule</firstterm>. A collection of such rewrite rules can be a powerful search aid. For example, you can expand the search using synonyms (e.g., <literal>new york</literal>, <literal>big apple</literal>, <literal>nyc</literal>, <literal>gotham</literal>) or narrow the search to direct the user to some hot topic. There is some overlap in functionality between this feature and thesaurus dictionaries (<xref linkend="textsearch-thesaurus"/>). However, you can modify a set of rewrite rules on-the-fly without reindexing, whereas updating a thesaurus requires reindexing to be effective. ts_rewrite関連の関数は、与えられたtsqueryから目的の副問い合わせ部分を探し、それを代わりの副問い合わせに置き換えます。本質的には、この操作は、部分文字列置き換えのtsquery版です。置き換え候補と置き換え内容の組は、問い合わせ書き換えルールであると考えることができます。そのような書き換えルールの集合は、強力な検索ツールとなり得ます。たとえば、同義語(たとえばnew york, big apple, nyc, gotham)を使って問い合わせをより広範囲にしたり、逆によりホットな話題にユーザを導くために問い合わせを狭い範囲に絞ったりすることができます。この機能と、同義語辞書(12.6.4)の間には、機能的な重複があります。しかし、再インデックス付けすることなしに、その場で書き換えルールを変更できるのに対し、同義語辞書の更新が有効になるためには、再インデックス付けを行わなければなりません。

ts_rewrite (query tsquery, target tsquery, substitute tsquery) returns tsquery

This form of <function>ts_rewrite</function> simply applies a single rewrite rule: <replaceable class="parameter">target</replaceable> is replaced by <replaceable class="parameter">substitute</replaceable> wherever it appears in <replaceable class="parameter">query</replaceable>. For example: この形式の ts_rewrite は、単純に単一の書き換えルールを適用します。query中に表れるtargetは、substituteですべて置き換えられます。例を示します。

SELECT ts_rewrite('a & b'::tsquery, 'a'::tsquery, 'c'::tsquery);
 ts_rewrite
------------
 'b' & 'c'

ts_rewrite (query tsquery, select text) returns tsquery

This form of <function>ts_rewrite</function> accepts a starting <replaceable>query</replaceable> and an SQL <replaceable>select</replaceable> command, which is given as a text string. The <replaceable>select</replaceable> must yield two columns of <type>tsquery</type> type. For each row of the <replaceable>select</replaceable> result, occurrences of the first column value (the target) are replaced by the second column value (the substitute) within the current <replaceable>query</replaceable> value. For example: この形式のts_rewriteは、開始queryと、テキスト文字列で与えられるSQLのselectコマンドを受け取ります。 selectは、tsquery型の2つの列を出力しなければなりません。現在のquery値は、selectのそれぞれの結果行中の最初の列の結果(ターゲット)が、2番目の列の結果(置き換え値)に、置き換えられます。例を示します。

CREATE TABLE aliases (t tsquery PRIMARY KEY, s tsquery);
INSERT INTO aliases VALUES('a', 'c');

SELECT ts_rewrite('a & b'::tsquery, 'SELECT t,s FROM aliases');
 ts_rewrite
------------
 'b' & 'c'

Note that when multiple rewrite rules are applied in this way, the order of application can be important; so in practice you will want the source query to <literal>ORDER BY</literal> some ordering key. なお、複数の書き換えルールを適用する際は、適用する順番が重要です。ですから、実際には並べ替えのキーを適用するORDER BYを問い合わせに入れておくのがよいでしょう。

Let's consider a real-life astronomical example. We'll expand query <literal>supernovae</literal> using table-driven rewriting rules: 天文学上の実際的な例を考えてみます。テーブル駆動の書き換えルールを使って、supernovaeを展開します。

CREATE TABLE aliases (t tsquery primary key, s tsquery);
INSERT INTO aliases VALUES(to_tsquery('supernovae'), to_tsquery('supernovae|sn'));

SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
           ts_rewrite
---------------------------------
 'crab' & ( 'supernova' | 'sn' )

We can change the rewriting rules just by updating the table: テーブルを更新するだけで、書き換えルールを変更することができます。

UPDATE aliases
SET s = to_tsquery('supernovae|sn & !nebulae')
WHERE t = to_tsquery('supernovae');

SELECT ts_rewrite(to_tsquery('supernovae & crab'), 'SELECT * FROM aliases');
                 ts_rewrite
---------------------------------------------
 'crab' & ( 'supernova' | 'sn' & !'nebula' )

Rewriting can be slow when there are many rewriting rules, since it checks every rule for a possible match. To filter out obvious non-candidate rules we can use the containment operators for the <type>tsquery</type> type. In the example below, we select only those rules which might match the original query: 書き換えルールが多くなると、書き換えが遅くなる可能性があります。なぜなら、書き換えの対象になるものを求めて、すべてのルールをチェックするからです。明らかに使われないルールを取り除くために、tsqueryの包含演算子を使うことができます。以下の例では、元の問い合わせにマッチするルールだけを選ぶことができます。

SELECT ts_rewrite('a & b'::tsquery,
                  'SELECT t,s FROM aliases WHERE ''a & b''::tsquery @> t');
 ts_rewrite
------------
 'b' & 'c'

12.4.3. 自動更新のためのトリガ #

<title>Triggers for Automatic Updates</title>

注記

The method described in this section has been obsoleted by the use of stored generated columns, as described in <xref linkend="textsearch-tables-index"/>. この節で説明する方法は、12.2.2で説明するように、格納された生成列の使用に置き換えられました。

When using a separate column to store the <type>tsvector</type> representation of your documents, it is necessary to create a trigger to update the <type>tsvector</type> column when the document content columns change. Two built-in trigger functions are available for this, or you can write your own. tsvector形式の文書を格納するために別の列を使う場合、文書の内容を格納した列が変更されたときにtsvectorを格納した列を更新するトリガを作っておく必要があります。この目的のために、2つの組み込み関数を利用できます。自分で関数を書くこともできます。

tsvector_update_trigger(tsvector_column_name, config_name, text_column_name [, ... ])
tsvector_update_trigger_column(tsvector_column_name, config_column_name, text_column_name [, ... ])

These trigger functions automatically compute a <type>tsvector</type> column from one or more textual columns, under the control of parameters specified in the <command>CREATE TRIGGER</command> command. An example of their use is: これらのトリガ関数は、1つ以上のテキスト列から、CREATE TRIGGERコマンドで指定されたパラメータの制御により、tsvector列を自動的に計算します。使い方の例を示します。

CREATE TABLE messages (
    title       text,
    body        text,
    tsv         tsvector
);

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON messages FOR EACH ROW EXECUTE FUNCTION
tsvector_update_trigger(tsv, 'pg_catalog.english', title, body);

INSERT INTO messages VALUES('title here', 'the body text is here');

SELECT * FROM messages;
   title    |         body          |            tsv
------------+-----------------------+----------------------------
 title here | the body text is here | 'bodi':4 'text':5 'titl':1

SELECT title, body FROM messages WHERE tsv @@ to_tsquery('title & body');
   title    |         body
------------+-----------------------
 title here | the body text is here

Having created this trigger, any change in <structfield>title</structfield> or <structfield>body</structfield> will automatically be reflected into <structfield>tsv</structfield>, without the application having to worry about it. このトリガを作っておくことにより、 title またはbodyへの変更は、アプリケーションで考慮しなくても自動的にtsvに反映されます。

The first trigger argument must be the name of the <type>tsvector</type> column to be updated. The second argument specifies the text search configuration to be used to perform the conversion. For <function>tsvector_update_trigger</function>, the configuration name is simply given as the second trigger argument. It must be schema-qualified as shown above, so that the trigger behavior will not change with changes in <varname>search_path</varname>. For <function>tsvector_update_trigger_column</function>, the second trigger argument is the name of another table column, which must be of type <type>regconfig</type>. This allows a per-row selection of configuration to be made. The remaining argument(s) are the names of textual columns (of type <type>text</type>, <type>varchar</type>, or <type>char</type>). These will be included in the document in the order given. NULL values will be skipped (but the other columns will still be indexed). トリガの最初の引数は更新対象のtsvectorの列名でなければなりません。 2番目の引数は、変換を実行する際に使用されるテキスト検索の設定です。 tsvector_update_triggerでは、設定の名前は単に2番目のトリガ引数で与えられます。上で示すように、スキーマ修飾されていなければなりません。search_pathの変更がトリガの振る舞いに影響を与えないためです。 tsvector_update_trigger_columnでは、2番目のトリガ引数は別のテーブル列の列名です。この列の型はregconfigでなければなりません。この方法により、設定を行単位で変えることができます。残りの引数はテキスト型(text, varchar, charのいずれか)の列の名前です。与えられた順に、文書中に取り込まれます。 NULL値はスキップされます(ただし、それ以外の列はインデックス付けされます)。

A limitation of these built-in triggers is that they treat all the input columns alike. To process columns differently — for example, to weight title differently from body — it is necessary to write a custom trigger. Here is an example using <application>PL/pgSQL</application> as the trigger language: これらの組み込みトリガの制限事項として、すべての列を同じようにしか扱えないというものがあります。それぞれの列を違うように扱うには — たとえば本文とタイトルの重みを変えるとか —、カスタムトリガを書く必要があります。トリガ言語としてPL/pgSQLを使った例を示します。

CREATE FUNCTION messages_trigger() RETURNS trigger AS $$
begin
  new.tsv :=
     setweight(to_tsvector('pg_catalog.english', coalesce(new.title,'')), 'A') ||
     setweight(to_tsvector('pg_catalog.english', coalesce(new.body,'')), 'D');
  return new;
end
$$ LANGUAGE plpgsql;

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
    ON messages FOR EACH ROW EXECUTE FUNCTION messages_trigger();

Keep in mind that it is important to specify the configuration name explicitly when creating <type>tsvector</type> values inside triggers, so that the column's contents will not be affected by changes to <varname>default_text_search_config</varname>. Failure to do this is likely to lead to problems such as search results changing after a dump and restore. tsvector値をトリガ内で作るときには、設定名を明示的に与えることが重要であることを銘記しておいてください。そうすれば、default_text_search_configが変更されても列の内容は影響を受けません。これを怠ると、ダンプしてリストアすると検索結果が変わってしまうような問題が起きる可能性があります。

12.4.4. 文書の統計情報の収集 #

<title>Gathering Document Statistics</title>

The function <function>ts_stat</function> is useful for checking your configuration and for finding stop-word candidates. ts_stat関数は、設定をチェックしたり、ストップワードの候補を探すのに役立ちます。

ts_stat(sqlquery text, [ weights text, ]
        OUT word text, OUT ndoc integer,
        OUT nentry integer) returns setof record

<replaceable>sqlquery</replaceable> is a text value containing an SQL query which must return a single <type>tsvector</type> column. <function>ts_stat</function> executes the query and returns statistics about each distinct lexeme (word) contained in the <type>tsvector</type> data. The columns returned are sqlqueryは単一のtsvector列を返すSQL問い合わせのテキスト値です。ts_statは問い合わせを実行し、tsvectorデータに含まれる語彙素(単語)各々の統計情報を返します。返却される列は以下のものです。

<replaceable>word</replaceable> <type>text</type> — the value of a lexeme word text — 語彙素の値
<replaceable>ndoc</replaceable> <type>integer</type> — number of documents (<type>tsvector</type>s) the word occurred in ndoc integer — 単語が含まれる文書(tsvector)の数
<replaceable>nentry</replaceable> <type>integer</type> — total number of occurrences of the word nentry integer — 含まれる単語の数

If <replaceable>weights</replaceable> is supplied, only occurrences having one of those weights are counted. weightsが与えられていたら、その重みを持つものだけがカウントされます。

For example, to find the ten most frequent words in a document collection: たとえば、文書中もっとも頻繁に現れる単語の上位10位を探すには以下のようにします。

SELECT * FROM ts_stat('SELECT vector FROM apod')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

The same, but counting only word occurrences with weight <literal>A</literal> or <literal>B</literal>: 同じ例で、重みがAかBの単語だけをカウントするには、以下のようにします。

SELECT * FROM ts_stat('SELECT vector FROM apod', 'ab')
ORDER BY nentry DESC, ndoc DESC, word
LIMIT 10;

前へ	上へ	次へ
12.3. テキスト検索の制御	ホーム	12.5. パーサ