12.3. テキスト検索の制御

PostgreSQL 18.0文書
		第12章全文検索	誤訳等の報告
前へ	上へ	12.3. テキスト検索の制御	次へ

12.3. テキスト検索の制御 #

<title>Controlling Text Search</title>

To implement full text searching there must be a function to create a <type>tsvector</type> from a document and a <type>tsquery</type> from a user query. Also, we need to return results in a useful order, so we need a function that compares documents with respect to their relevance to the query. It's also important to be able to display the results nicely. <productname>PostgreSQL</productname> provides support for all of these functions. 全文検索を実装するためには、文書からtsvectorを、そしてユーザの問い合わせからtsqueryを作成する関数が存在しなければなりません。また、結果を意味のある順で返す必要があります。そこで、問い合わせとの関連性で文書を比較する関数も必要になってきます。結果を体裁良く表示できることも重要です。 PostgreSQLはこれらすべての機能を提供しています。

12.3.1. 文書のパース #

<title>Parsing Documents</title>

<productname>PostgreSQL</productname> provides the function <function>to_tsvector</function> for converting a document to the <type>tsvector</type> data type. PostgreSQLは、文書をtsvectorデータ型に変換するto_tsvector関数を提供しています。

to_tsvector([ config regconfig, ] document text) returns tsvector

<function>to_tsvector</function> parses a textual document into tokens, reduces the tokens to lexemes, and returns a <type>tsvector</type> which lists the lexemes together with their positions in the document. The document is processed according to the specified or default text search configuration. Here is a simple example: to_tsvectorは、テキスト文書をパースしてトークンにし、トークンを語彙素に変換、文書中の位置とともに語彙素をリストとして持つtsvectorを返します。文書は、指定したものか、あるいはデフォルトのテキスト検索設定にしたがって処理されます。単純な例を示します。

SELECT to_tsvector('english', 'a fat  cat sat on a mat - it ate a fat rats');
                  to_tsvector
-----------------------------------------------------
 'ate':9 'cat':3 'fat':2,11 'mat':7 'rat':12 'sat':4

In the example above we see that the resulting <type>tsvector</type> does not contain the words <literal>a</literal>, <literal>on</literal>, or <literal>it</literal>, the word <literal>rats</literal> became <literal>rat</literal>, and the punctuation sign <literal>-</literal> was ignored. 上に示す例では、結果のtsvectorで、a、on、itという単語が含まれないこと、ratsという単語がratになっていること、句読点記号-が無視されていることがわかります。

The <function>to_tsvector</function> function internally calls a parser which breaks the document text into tokens and assigns a type to each token. For each token, a list of dictionaries (<xref linkend="textsearch-dictionaries"/>) is consulted, where the list can vary depending on the token type. The first dictionary that <firstterm>recognizes</firstterm> the token emits one or more normalized <firstterm>lexemes</firstterm> to represent the token. For example, <literal>rats</literal> became <literal>rat</literal> because one of the dictionaries recognized that the word <literal>rats</literal> is a plural form of <literal>rat</literal>. Some words are recognized as <firstterm>stop words</firstterm> (<xref linkend="textsearch-stopwords"/>), which causes them to be ignored since they occur too frequently to be useful in searching. In our example these are <literal>a</literal>, <literal>on</literal>, and <literal>it</literal>. If no dictionary in the list recognizes the token then it is also ignored. In this example that happened to the punctuation sign <literal>-</literal> because there are in fact no dictionaries assigned for its token type (<literal>Space symbols</literal>), meaning space tokens will never be indexed. The choices of parser, dictionaries and which types of tokens to index are determined by the selected text search configuration (<xref linkend="textsearch-configuration"/>). It is possible to have many different configurations in the same database, and predefined configurations are available for various languages. In our example we used the default configuration <literal>english</literal> for the English language. to_tsvector関数は、文書をトークンに分解して、そのトークンに型を割り当てるパーサを内部的に呼び出しています。それぞれのトークンに対して辞書(12.6)のリストが検索されます。ここで、辞書のリストはトークンの型によって異なります。最初の辞書は、トークンを認識し、トークンを表現する一つ以上の正規化された語彙素を出力します。例えば、ある辞書はratsはratの複数形であることを認識しているので、ratsはratになります。ある単語はストップワード(12.6.1)として認識されます。これは、あまりにも多く出現し検索の役に立たないため、無視されるものです。先の例では、a、on、およびitがそれです。もしリスト中の辞書のどれもがトークンを認識しなければ、そのトークンは無視されます。先の例では、句読点の-がそうです。なぜなら、実際にはそのトークン型(Space symbols)に対して辞書が割り当てられておらず、空白トークンは決してインデックス付けされないことを意味します。パーサ、辞書、そしてどのトークンがインデックス付けされるかという選択は、テキスト検索設定(12.7)によって決められます。同じデータベース中に多くの異なった設定を持つことができ、多くの言語用に定義済の設定が用意されています。先の例では、英語用として、デフォルトのenglish設定を使っています。

The function <function>setweight</function> can be used to label the entries of a <type>tsvector</type> with a given <firstterm>weight</firstterm>, where a weight is one of the letters <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or <literal>D</literal>. This is typically used to mark entries coming from different parts of a document, such as title versus body. Later, this information can be used for ranking of search results. 関数setweightを使ってtsvectorのエントリに与えられた重みのラベルを与えることができます。ここで重みは、A, B, C, Dのどれかの文字です。重みの典型的な使い方は、文書の各部分がどこから来たのかをマークすることです。たとえば、タイトルから来たのか、本文から来たのかなどです。後でこの情報は検索結果のランキングに利用できます。

Because <function>to_tsvector</function>(<literal>NULL</literal>) will return <literal>NULL</literal>, it is recommended to use <function>coalesce</function> whenever a field might be null. Here is the recommended method for creating a <type>tsvector</type> from a structured document: to_tsvector(NULL)はNULLを返すので、NULLになる可能性のある列に対してはcoalesceを使うことをお勧めします。構造化された文書からtsvectorを作るための推奨できる方法を示します。

UPDATE tt SET ti =
    setweight(to_tsvector(coalesce(title,'')), 'A')    ||
    setweight(to_tsvector(coalesce(keyword,'')), 'B')  ||
    setweight(to_tsvector(coalesce(abstract,'')), 'C') ||
    setweight(to_tsvector(coalesce(body,'')), 'D');

Here we have used <function>setweight</function> to label the source of each lexeme in the finished <type>tsvector</type>, and then merged the labeled <type>tsvector</type> values using the <type>tsvector</type> concatenation operator <literal>||</literal>. (<xref linkend="textsearch-manipulate-tsvector"/> gives details about these operations.) ここでは、完成したtsvectorの語彙素に対して、ラベル付けのためにsetweightを使っています。そして、tsvectorの連結演算子||を使って、ラベルづけされたtsvectorの値をマージします。(詳細は12.4.1を参照してください。)

12.3.2. 問い合わせのパース #

<title>Parsing Queries</title>

<productname>PostgreSQL</productname> provides the functions <function>to_tsquery</function>, <function>plainto_tsquery</function>, <function>phraseto_tsquery</function> and <function>websearch_to_tsquery</function> for converting a query to the <type>tsquery</type> data type. <function>to_tsquery</function> offers access to more features than either <function>plainto_tsquery</function> or <function>phraseto_tsquery</function>, but it is less forgiving about its input. <function>websearch_to_tsquery</function> is a simplified version of <function>to_tsquery</function> with an alternative syntax, similar to the one used by web search engines. PostgreSQLは、問い合わせをtsqueryに変換する関数to_tsquery、plainto_tsquery、phraseto_tsquery、websearch_to_tsqueryを提供しています。 to_tsqueryは、plainto_tsqueryとphraseto_tsqueryのいずれよりも多くの機能を提供していますが、入力のチェックはより厳格です。 websearch_to_tsqueryは、webサーチエンジンで使われているものに似た別の構文を使うto_tsqueryの簡易バージョンです。

to_tsquery([ config regconfig, ] querytext text) returns tsquery

<function>to_tsquery</function> creates a <type>tsquery</type> value from <replaceable>querytext</replaceable>, which must consist of single tokens separated by the <type>tsquery</type> operators <literal>&</literal> (AND), <literal>|</literal> (OR), <literal>!</literal> (NOT), and <literal><-></literal> (FOLLOWED BY), possibly grouped using parentheses. In other words, the input to <function>to_tsquery</function> must already follow the general rules for <type>tsquery</type> input, as described in <xref linkend="datatype-tsquery"/>. The difference is that while basic <type>tsquery</type> input takes the tokens at face value, <function>to_tsquery</function> normalizes each token into a lexeme using the specified or default configuration, and discards any tokens that are stop words according to the configuration. For example: to_tsqueryは、querytextからtsqueryとしての値を生成します。 querytextは、tsquery演算子& (AND), | (OR)、! (NOT)、<-> (FOLLOWED BY)で区切られる単一のトークンから構成されなければなりません。これらの演算子は括弧でグループ化できます。言い換えると、to_tsqueryの入力は、8.11.2で述べられているtsquery入力の一般規則にしたがっていなければなりません。違いは、基本的なtsqueryの入力はトークンの表面的な値を受け取るのに対し、to_tsqueryは指定した、あるいはデフォルトの設定を使ってトークンを語彙素へと正規化し、設定にしたがって、ストップワードであるようなトークンを破棄します。例を示します。

SELECT to_tsquery('english', 'The & Fat & Rats');
  to_tsquery
---------------
 'fat' & 'rat'

As in basic <type>tsquery</type> input, weight(s) can be attached to each lexeme to restrict it to match only <type>tsvector</type> lexemes of those weight(s). For example: 基本的なtsqueryの入力では、各々の語彙素に重みを付加することにより、同じ重みを持つtsvectorの語彙素のみに照合するようにすることができます。例を示します。

SELECT to_tsquery('english', 'Fat | Rats:AB');
    to_tsquery
------------------
 'fat' | 'rat':AB

Also, <literal>*</literal> can be attached to a lexeme to specify prefix matching: また、明示的な前方一致検索のため、*を語彙素に与えることもできます。

SELECT to_tsquery('supern:*A & star:A*B');
        to_tsquery
--------------------------
 'supern':*A & 'star':*AB

Such a lexeme will match any word in a <type>tsvector</type> that begins with the given string. このような語彙素は、与えられた文字列で始まるtsvector中のどんな単語にも照合するでしょう。

<function>to_tsquery</function> can also accept single-quoted phrases. This is primarily useful when the configuration includes a thesaurus dictionary that may trigger on such phrases. In the example below, a thesaurus contains the rule <literal>supernovae stars : sn</literal>: to_tsqueryは、単一引用符で囲まれた語句を受け付けることもできます。これは主に、設定の中にそういった語句を持つ同義語辞書を含んでいるときに有用です。以下の例では、ある同義語の中にsupernovae stars : snという規則が含まれています。

SELECT to_tsquery('''supernovae stars'' & !crab');
  to_tsquery
---------------
 'sn' & !'crab'

Without quotes, <function>to_tsquery</function> will generate a syntax error for tokens that are not separated by an AND, OR, or FOLLOWED BY operator. 引用符がない場合は、to_tsqueryは、AND、ORあるいはFOLLOWED BY演算子で区切られていないトークンに対して構文エラーを引き起こします。

plainto_tsquery([ config regconfig, ] querytext text) returns tsquery

<function>plainto_tsquery</function> transforms the unformatted text <replaceable>querytext</replaceable> to a <type>tsquery</type> value. The text is parsed and normalized much as for <function>to_tsvector</function>, then the <literal>&</literal> (AND) <type>tsquery</type> operator is inserted between surviving words. plainto_tsqueryは整形されていないテキストquerytextを、tsqueryの値に変換します。テキストはパースされ、to_tsvectorとしてできる限り正規化されます。そして、tsquery演算子& (AND) が存続した単語の間に挿入されます。

Example: 例：

SELECT plainto_tsquery('english', 'The Fat Rats');
 plainto_tsquery
-----------------
 'fat' & 'rat'

Note that <function>plainto_tsquery</function> will not recognize <type>tsquery</type> operators, weight labels, or prefix-match labels in its input: plainto_tsqueryは、入力中のtsquery演算子も、重み付けラベルも、前方一致ラベルも認識しないことに注意してください。

SELECT plainto_tsquery('english', 'The Fat & Rats:C');
   plainto_tsquery
---------------------
 'fat' & 'rat' & 'c'

Here, all the input punctuation was discarded. ここでは、入力中のすべての句読点が破棄されています。

phraseto_tsquery([ config regconfig, ] querytext text) returns tsquery

<function>phraseto_tsquery</function> behaves much like <function>plainto_tsquery</function>, except that it inserts the <literal><-></literal> (FOLLOWED BY) operator between surviving words instead of the <literal>&</literal> (AND) operator. Also, stop words are not simply discarded, but are accounted for by inserting <literal><<replaceable>N</replaceable>></literal> operators rather than <literal><-></literal> operators. This function is useful when searching for exact lexeme sequences, since the FOLLOWED BY operators check lexeme order not just the presence of all the lexemes. phraseto_tsqueryはplainto_tsqueryとほぼ同じ動作をしますが、残った語の間に& (AND) 演算子ではなく、<-> (FOLLOWED BY) 演算子を挿入するところが違います。また、ストップワードを単に無視するのでなく、<->演算子の代わりに<N>演算子を挿入することで、意味のあるものとします。 FOLLOWED BY演算子は、単にすべての語彙素が存在することだけでなく、語彙素の順序についても確認するため、この関数は語彙素の正確な順序について検索するときに役立ちます。

Example: 例を示します。

SELECT phraseto_tsquery('english', 'The Fat Rats');
 phraseto_tsquery
------------------
 'fat' <-> 'rat'

Like <function>plainto_tsquery</function>, the <function>phraseto_tsquery</function> function will not recognize <type>tsquery</type> operators, weight labels, or prefix-match labels in its input: plainto_tsqueryと同じく、phraseto_tsquery関数もその入力内のtsquery演算子、重み付けラベル、前方一致ラベルを認識しません。

SELECT phraseto_tsquery('english', 'The Fat & Rats:C');
      phraseto_tsquery
-----------------------------
 'fat' <-> 'rat' <-> 'c'

websearch_to_tsquery([ config regconfig, ] querytext text) returns tsquery

<function>websearch_to_tsquery</function> creates a <type>tsquery</type> value from <replaceable>querytext</replaceable> using an alternative syntax in which simple unformatted text is a valid query. Unlike <function>plainto_tsquery</function> and <function>phraseto_tsquery</function>, it also recognizes certain operators. Moreover, this function will never raise syntax errors, which makes it possible to use raw user-supplied input for search. The following syntax is supported: websearch_to_tsqueryは、問い合わせとして、単純で整形されていないテキストが代わりに使えるような構文を使ってquerytextからtsqueryを作り出します。 plainto_tsqueryおよびphraseto_tsqueryと違って、ある種の演算子を理解します。更にこの関数は決して構文エラーを引き起こさないので、ユーザ入力をそのまま検索で使用することができます。以下の構文をサポートします。

<literal>unquoted text</literal>: text not inside quote marks will be converted to terms separated by <literal>&</literal> operators, as if processed by <function>plainto_tsquery</function>. 引用符なしのテキスト:引用符の内側にないテキストは、あたかもplainto_tsqueryで処理されたように&演算子で区切られます。
<literal>"quoted text"</literal>: text inside quote marks will be converted to terms separated by <literal><-></literal> operators, as if processed by <function>phraseto_tsquery</function>. "引用符内のテキスト":引用符内のテキストは、あたかもphraseto_tsqueryで処理されたように<->で区切られた表現に変換されます。
<literal>OR</literal>: the word <quote>or</quote> will be converted to the <literal>|</literal> operator. OR:単語「or」は|演算子に変換されます。
<literal>-</literal>: a dash will be converted to the <literal>!</literal> operator. -:ダッシュは!演算子に変換されます。

Other punctuation is ignored. So like <function>plainto_tsquery</function> and <function>phraseto_tsquery</function>, the <function>websearch_to_tsquery</function> function will not recognize <type>tsquery</type> operators, weight labels, or prefix-match labels in its input. その他の句読点は無視されます。ですので、plainto_tsqueryやphraseto_tsqueryと同様、websearch_to_tsquery関数はtsquery演算子、重み付けラベルや前方一致ラベルを入力として認識しません。

Examples: 例を示します。

SELECT websearch_to_tsquery('english', 'The fat rats');
 websearch_to_tsquery
----------------------
 'fat' & 'rat'
(1 row)

SELECT websearch_to_tsquery('english', '"supernovae stars" -crab');
       websearch_to_tsquery
----------------------------------
 'supernova' <-> 'star' & !'crab'
(1 row)

SELECT websearch_to_tsquery('english', '"sad cat" or "fat rat"');
       websearch_to_tsquery
-----------------------------------
 'sad' <-> 'cat' | 'fat' <-> 'rat'
(1 row)

SELECT websearch_to_tsquery('english', 'signal -"segmentation fault"');
         websearch_to_tsquery
---------------------------------------
 'signal' & !( 'segment' <-> 'fault' )
(1 row)

SELECT websearch_to_tsquery('english', '""" )( dummy \\ query <->');
 websearch_to_tsquery
----------------------
 'dummi' & 'queri'
(1 row)

12.3.3. 検索結果のランキング #

<title>Ranking Search Results</title>

Ranking attempts to measure how relevant documents are to a particular query, so that when there are many matches the most relevant ones can be shown first. <productname>PostgreSQL</productname> provides two predefined ranking functions, which take into account lexical, proximity, and structural information; that is, they consider how often the query terms appear in the document, how close together the terms are in the document, and how important is the part of the document where they occur. However, the concept of relevancy is vague and very application-specific. Different applications might require additional information for ranking, e.g., document modification time. The built-in ranking functions are only examples. You can write your own ranking functions and/or combine their results with additional factors to fit your specific needs. ランキングはある問い合わせに対して、どの程度文書が関連しているかを計測しようとするものです。合致している文書が多数あるとき、もっとも関連している文書が最初に表示されるようにするためです。 PostgreSQLは、2つの定義済ランキング関数を提供しています。それらは、辞書情報、近接度情報、構造的情報を加味します。すなわち、問い合わせの用語がどの位の頻度で文書に出現するか、文書中でどの程度それらの用語が近接しているか、どの用語が含まれる文書部位がどの程度重要なのかを考慮します。しかし、関連度という概念は曖昧で、用途に強く依存します。異なる用途は、ランキングのために追加の情報を必要とするかも知れません。たとえば、文書の更新時刻などです。組み込みのランキング関数は例に過ぎません。利用者の目的に応じて、自分用のランキング関数を作ったり、その結果を追加の情報と組み合わせることができます。

The two ranking functions currently available are: 今のところ、二種類のランキング関数が利用可能です。

ts_rank([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

Ranks vectors based on the frequency of their matching lexemes. それらの語彙素にマッチした頻度に基づくベクトルのランク。

ts_rank_cd([ weights float4[], ] vector tsvector, query tsquery [, normalization integer ]) returns float4

This function computes the <firstterm>cover density</firstterm> ranking for the given document vector and query, as described in Clarke, Cormack, and Tudhope's "Relevance Ranking for One to Three Term Queries" in the journal "Information Processing and Management", 1999. Cover density is similar to <function>ts_rank</function> ranking except that the proximity of matching lexemes to each other is taken into consideration. この関数は、1999年の"Information Processing and Management"ジャーナルに掲載されたClarke, Cormack, Tudhopeの"Relevance Ranking for One to Three Term Queries"で述べられている方法で、与えられた文書ベクトルと問い合わせの被覆密度(cover density)ランクを計算します。被覆密度は互いにマッチする語彙素の近接度を考慮に入れる点を除いてts_rankのランク付けと似ています。

This function requires lexeme positional information to perform its calculation. Therefore, it ignores any <quote>stripped</quote> lexemes in the <type>tsvector</type>. If there are no unstripped lexemes in the input, the result will be zero. (See <xref linkend="textsearch-manipulate-tsvector"/> for more information about the <function>strip</function> function and positional information in <type>tsvector</type>s.) この関数は、計算を実行するために語彙素の位置情報を必要とします。ですから、tsvector内の「剥き出しの」語彙素は無視します。入力に剥き出しでない語彙素がなければ、結果は0です。 (strip関数とtsvector内の位置情報についてのより詳しい情報は12.4.1を参照してください。)

For both these functions, the optional <replaceable class="parameter">weights</replaceable> argument offers the ability to weigh word instances more or less heavily depending on how they are labeled. The weight arrays specify how heavily to weigh each category of word, in the order: これらの関数では、単語がどの程度ラベル付けに依存するかを、単語ごとに指定する機能がweightsオプションパラメータによって提供されています。重み配列で、それぞれのカテゴリの単語がどの程度重み付けするかを指定します。その順は以下のようになっています。

{D-weight, C-weight, B-weight, A-weight}

If no <replaceable class="parameter">weights</replaceable> are provided, then these defaults are used: weightsを与えない場合は、次のデフォルト値が使われます。

{0.1, 0.2, 0.4, 1.0}

Typically weights are used to mark words from special areas of the document, like the title or an initial abstract, so they can be treated with more or less importance than words in the document body. 重みの典型的な使い方は、文書のタイトルやアブストラクトのような特定の場所にある単語をマーク付けするような使い方です。そうすることにより、文書の本体に比べてそこにある単語がより重要なのか、そうでないのか、扱いを変えることができます。

Since a longer document has a greater chance of containing a query term it is reasonable to take into account document size, e.g., a hundred-word document with five instances of a search word is probably more relevant than a thousand-word document with five instances. Both ranking functions take an integer <replaceable>normalization</replaceable> option that specifies whether and how a document's length should impact its rank. The integer option controls several behaviors, so it is a bit mask: you can specify one or more behaviors using <literal>|</literal> (for example, <literal>2|4</literal>). 文書が長ければ、それだけ問い合わせ用語を含む確率が高くなるため、文書のサイズを考慮にいれることは理にかなっています。たとえば、5つの検索語を含む100語の文書は、たぶん5つの検索語を含む1000語の文書よりも関連性が高いでしょう。ランキング関数には、どちらも整数型の正規化オプションがあります。これは、文書の長さがランクに影響を与えるのかどうか、与えるとすればどの程度か、ということを指定します。この整数オプションは、いくつかの挙動を制御するので、ビットマスクになっています。複数の挙動を|で指定できます(例：2|4)。

0 (the default) ignores the document length 0(デフォルト):文書の長さを無視します
1 divides the rank by 1 + the logarithm of the document length 1:ランクを(1 + log(文書の長さ))で割ります
2 divides the rank by the document length 2:ランクを文書の長さで割ります
4 divides the rank by the mean harmonic distance between extents (this is implemented only by <function>ts_rank_cd</function>) 4:ランクをエクステントの間の調和平均距離で割ります(これはts_rank_cdのみで実装されています)
8 divides the rank by the number of unique words in document 8: ランクを文書中の一意の単語の数で割ります
16 divides the rank by 1 + the logarithm of the number of unique words in document 16: ランクをlog(文書中の一意の単語の数)+1 で割ります
32 divides the rank by itself + 1 32: ランクをランク自身+1 で割ります

If more than one flag bit is specified, the transformations are applied in the order listed. 2以上のフラグビットが指定された場合には、変換は上記に列挙された順に行われます。

It is important to note that the ranking functions do not use any global information, so it is impossible to produce a fair normalization to 1% or 100% as sometimes desired. Normalization option 32 (<literal>rank/(rank+1)</literal>) can be applied to scale all ranks into the range zero to one, but of course this is just a cosmetic change; it will not affect the ordering of the search results. これは重要なことですが、ランキング関数はグローバル情報を一切使わないので、時には必要になる1%から100%までの均一な正規化はできません。正規化オプション32(rank/(rank+1))を適用することにより、すべてのランクを0から1に分布させることができます。しかし、もちろんこれは表面的な変更に過ぎません。検索結果のならび順に影響を与えるものではありません。

Here is an example that selects only the ten highest-ranked matches: マッチする順位の高い10位までを選ぶ例を示します。

SELECT title, ts_rank_cd(textsearch, query) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |   rank
-----------------------------------------------+----------
 Neutrinos in the Sun                          |      3.1
 The Sudbury Neutrino Detector                 |      2.4
 A MACHO View of Galactic Dark Matter          |  2.01317
 Hot Gas and Dark Matter                       |  1.91171
 The Virgo Cluster: Hot Plasma and Dark Matter |  1.90953
 Rafting for Solar Neutrinos                   |      1.9
 NGC 4650A: Strange Galaxy and Dark Matter     |  1.85774
 Hot Gas and Dark Matter                       |   1.6123
 Ice Fishing for Cosmic Neutrinos              |      1.6
 Weak Lensing Distorts the Universe            | 0.818218

This is the same example using normalized ranking: 同じ例を正規化ランキングを使ったものを示します。

SELECT title, ts_rank_cd(textsearch, query, 32 /* rank/(rank+1) */ ) AS rank
FROM apod, to_tsquery('neutrino|(dark & matter)') query
WHERE  query @@ textsearch
ORDER BY rank DESC
LIMIT 10;
                     title                     |        rank
-----------------------------------------------+-------------------
 Neutrinos in the Sun                          | 0.756097569485493
 The Sudbury Neutrino Detector                 | 0.705882361190954
 A MACHO View of Galactic Dark Matter          | 0.668123210574724
 Hot Gas and Dark Matter                       |  0.65655958650282
 The Virgo Cluster: Hot Plasma and Dark Matter | 0.656301290640973
 Rafting for Solar Neutrinos                   | 0.655172410958162
 NGC 4650A: Strange Galaxy and Dark Matter     | 0.650072921219637
 Hot Gas and Dark Matter                       | 0.617195790024749
 Ice Fishing for Cosmic Neutrinos              | 0.615384618911517
 Weak Lensing Distorts the Universe            | 0.450010798361481

Ranking can be expensive since it requires consulting the <type>tsvector</type> of each matching document, which can be I/O bound and therefore slow. Unfortunately, it is almost impossible to avoid since practical queries often result in large numbers of matches. ランキングは、I/Oに結び付けられていて遅い可能性のある、一致する各文書のtsvectorへのアクセスが必要なので、高価な処理であるかもしれません。不幸なことに、実際の問い合わせでは往々にして大量の検索結果が生じるため、これはほとんど不可避であると言えます。

12.3.4. 結果の強調 #

<title>Highlighting Results</title>

To present search results it is ideal to show a part of each document and how it is related to the query. Usually, search engines show fragments of the document with marked search terms. <productname>PostgreSQL</productname> provides a function <function>ts_headline</function> that implements this functionality. 検索結果を表示する際には、文書の該当部分を表示し、どの程度問い合わせと関連しているかを示すのが望ましいです。通常、検索エンジンは、強調表示された検索語を含む文書の断片を表示します。 PostgreSQLはこの機能を実装したts_headline関数を提供しています。

ts_headline([ config regconfig, ] document text, query tsquery [, options text ]) returns text

<function>ts_headline</function> accepts a document along with a query, and returns an excerpt from the document in which terms from the query are highlighted. Specifically, the function will use the query to select relevant text fragments, and then highlight all words that appear in the query, even if those word positions do not match the query's restrictions. The configuration to be used to parse the document can be specified by <replaceable>config</replaceable>; if <replaceable>config</replaceable> is omitted, the <varname>default_text_search_config</varname> configuration is used. ts_headlineは、問い合わせと一緒に文書を受け取り、問い合わせが注目した文書中の語句を抜粋して返します。具体的には、関数は問い合わせを使用して関連するテキスト断片を選択し、単語の位置が問い合わせの制限に合わない場合であっても、問い合わせ内に出現するすべての単語を強調表示します。文書をパースするのに使われる設定をconfigで指定できます。configが省略された場合は、default_text_search_config設定が使われます。

If an <replaceable>options</replaceable> string is specified it must consist of a comma-separated list of one or more <replaceable>option</replaceable><literal>=</literal><replaceable>value</replaceable> pairs. The available options are: options文字列を指定する場合は、一つ以上のoption=valueのペアをカンマで区切ったものでなければなりません。利用可能なオプションは以下の通りです。

<literal>MaxWords</literal>, <literal>MinWords</literal> (integers): these numbers determine the longest and shortest headlines to output. The default values are 35 and 15. MaxWords, MinWords (整数): この数字を使って見出しの最大の長さと最小の長さを指定します。デフォルトは35と15です。
<literal>ShortWord</literal> (integer): words of this length or less will be dropped at the start and end of a headline, unless they are query terms. The default value of three eliminates common English articles. ShortWord (整数): この長さか、それ以下の長さの単語は、検索語でない限り、見出しの最初と最後から削除されます。デフォルト値の3は、常用される英語の冠詞を取り除きます。
<literal>HighlightAll</literal> (boolean): if <literal>true</literal> the whole document will be used as the headline, ignoring the preceding three parameters. The default is <literal>false</literal>. HighlightAll (論理値): trueなら文書全体が見出しとして使われ、前の3つのパラメータは無視されます。デフォルトはfalseです。
<literal>MaxFragments</literal> (integer): maximum number of text fragments to display. The default value of zero selects a non-fragment-based headline generation method. A value greater than zero selects fragment-based headline generation (see below). MaxFragments (整数): 表示するテキスト断片の最大数です。デフォルト値の0は断片化を起こさない見出しの生成の選択となります。 0より大きい場合は断片化を基本とした見出しの生成の選択となります(下記参照)。
<literal>StartSel</literal>, <literal>StopSel</literal> (strings): the strings with which to delimit query words appearing in the document, to distinguish them from other excerpted words. The default values are <quote><literal><b></literal></quote> and <quote><literal></b></literal></quote>, which can be suitable for HTML output (but see the warning below). StartSel, StopSel (文字列): 文書中に現れる問い合わせ単語を区切るこの文字列は、他の抜粋される単語と区別されます。デフォルト値は「<b>」と「</b>」であり、HTML出力には適切でしょう（ただし、下の警告を参照してください）。
<literal>FragmentDelimiter</literal> (string): When more than one fragment is displayed, the fragments will be separated by this string. The default is <quote><literal> ... </literal></quote>. FragmentDelimiter (文字列): 複数の断片が表示される時、その断片はこの文字列で区切られます。デフォルトは「 ... 」です。

警告: クロスサイトスクリプティング（XSS）の安全性

<title>Warning: Cross-site Scripting (XSS) Safety</title>

The output from <function>ts_headline</function> is not guaranteed to be safe for direct inclusion in web pages. When <literal>HighlightAll</literal> is <literal>false</literal> (the default), some simple XML tags are removed from the document, but this is not guaranteed to remove all HTML markup. Therefore, this does not provide an effective defense against attacks such as cross-site scripting (XSS) attacks, when working with untrusted input. To guard against such attacks, all HTML markup should be removed from the input document, or an HTML sanitizer should be used on the output. ts_headlineの出力は、Webページに直接含めるのに安全であることは保証されません。 HighlightAllがfalse（デフォルト）の場合、一部のシンプルXMLタグがドキュメントから削除されますが、すべてのHTMLマークアップが削除されることは保証されません。したがって、信頼できない入力を扱う場合、クロスサイトスクリプト（XSS）攻撃のような攻撃に対する効果的な防御は提供されません。そのような攻撃から守るためには、入力ドキュメントからすべてのHTMLマークアップを削除するか、出力に対してHTMLサニタイザーを使用する必要があります。

These option names are recognized case-insensitively. You must double-quote string values if they contain spaces or commas. これらのオプション名は大文字小文字の区別なく認識されます。空白やカンマを含む場合には、文字列の値を二重引用符で括ってください。

In non-fragment-based headline generation, <function>ts_headline</function> locates matches for the given <replaceable class="parameter">query</replaceable> and chooses a single one to display, preferring matches that have more query words within the allowed headline length. In fragment-based headline generation, <function>ts_headline</function> locates the query matches and splits each match into <quote>fragments</quote> of no more than <literal>MaxWords</literal> words each, preferring fragments with more query words, and when possible <quote>stretching</quote> fragments to include surrounding words. The fragment-based mode is thus more useful when the query matches span large sections of the document, or when it's desirable to display multiple matches. In either mode, if no query matches can be identified, then a single fragment of the first <literal>MinWords</literal> words in the document will be displayed. 断片化を起こさない見出しの生成では、ts_headlineは与えられたqueryとの一致を見つけて、見出しの許される長さ以内でより多くの問い合わせの単語のある一致を優先して一つ選びます。断片化を基本とした見出しの生成では、ts_headlineは問い合わせの一致を見つけて、各一致を最大でMaxWords個の単語からなる「断片」に分割します。このとき、より多くの問い合わせの単語を含む断片を優先します。そして、可能であれば周囲の単語を含むよう断片を「広げます」。それゆえ、問い合わせの一致が文書の長い部分に渡る場合や複数の一致を表示するのが望ましい場合には、断片化を基本としたモードがより有用です。どちらのモードでも、もし問い合わせの一致が特定されなかった場合は、文書中の最初のMinWords個の単語から成る一つの断片が表示されます。

For example: 例を示します。

SELECT ts_headline('english',
  'The most common type of search
is to find all documents containing given query terms
and return them in order of their similarity to the
query.',
  to_tsquery('english', 'query & similarity'));
                        ts_headline
------------------------------------------------------------
 containing given <b>query</b> terms                       +
 and return them in order of their <b>similarity</b> to the+
 <b>query</b>.

SELECT ts_headline('english',
  'Search terms may occur
many times in a document,
requiring ranking of the search matches to decide which
occurrences to display in the result.',
  to_tsquery('english', 'search & term'),
  'MaxFragments=10, MaxWords=7, MinWords=3, StartSel=<<, StopSel=>>');
                        ts_headline
------------------------------------------------------------
 <<Search>> <<terms>> may occur                            +
 many times ... ranking of the <<search>> matches to decide

<function>ts_headline</function> uses the original document, not a <type>tsvector</type> summary, so it can be slow and should be used with care. ts_headlineは、tsvectorの要約ではなく、元の文書を使います。ですので遅い可能性があり、注意深く使用する必要があります。

前へ	上へ	次へ
12.2. テーブルとインデックス	ホーム	12.4. 追加機能