8.11. テキスト検索に関する型

PostgreSQL 17.6文書
		第8章データ型	誤訳等の報告
前へ	上へ	8.11. テキスト検索に関する型	次へ

8.11. テキスト検索に関する型 #

<title>Text Search Types</title>

<productname>PostgreSQL</productname> provides two data types that are designed to support full text search, which is the activity of searching through a collection of natural-language <firstterm>documents</firstterm> to locate those that best match a <firstterm>query</firstterm>. The <type>tsvector</type> type represents a document in a form optimized for text search; the <type>tsquery</type> type similarly represents a text query. <xref linkend="textsearch"/> provides a detailed explanation of this facility, and <xref linkend="functions-textsearch"/> summarizes the related functions and operators. PostgreSQLは、自然言語の文書の集合を通して検索を行い問い合わせに最も合致する文書を見つける機能である全文検索をサポートするために設計された2つのデータ型を提供します。 tsvector型はテキスト検索に最適化された形式で文書を表現します。 tsquery型は同様に問い合わせを表現します。第12章ではこの機能を詳しく説明します。また、9.13では、関連する関数や演算子を要約します。

8.11.1. `tsvector` #

A <type>tsvector</type> value is a sorted list of distinct <firstterm>lexemes</firstterm>, which are words that have been <firstterm>normalized</firstterm> to merge different variants of the same word (see <xref linkend="textsearch"/> for details). Sorting and duplicate-elimination are done automatically during input, as shown in this example: tsvectorの値は重複がない語彙素のソート済みリストです。語彙素とは同じ単語の変種をまとめるために正規化された単語です（詳細は第12章を参照）。以下の例に示すようにソートと重複除去は入力の際に自動的になされます。

SELECT 'a fat cat sat on a mat and ate a fat rat'::tsvector;
                      tsvector
----------------------------------------------------
 'a' 'and' 'ate' 'cat' 'fat' 'mat' 'on' 'rat' 'sat'

To represent lexemes containing whitespace or punctuation, surround them with quotes: 空白文字または句読点を含む語彙素を表現するには、引用符でくくってください。

SELECT $$the lexeme '    ' contains spaces$$::tsvector;
                 tsvector
-------------------------------------------
 '    ' 'contains' 'lexeme' 'spaces' 'the'

(We use dollar-quoted string literals in this example and the next one to avoid the confusion of having to double quote marks within the literals.) Embedded quotes and backslashes must be doubled: （この例と次の例では、リテラル内で引用符記号を二重にしなければならないことによる混乱を防ぐためにドル引用符付け文字列を使用します。）引用符およびバックスラッシュが埋め込まれている場合は、以下のように二重にしなければなりません。

SELECT $$the lexeme 'Joe''s' contains a quote$$::tsvector;
                    tsvector
------------------------------------------------
 'Joe''s' 'a' 'contains' 'lexeme' 'quote' 'the'

Optionally, integer <firstterm>positions</firstterm> can be attached to lexemes: オプションとして、語彙素に整数の位置を付けることもできます。

SELECT 'a:1 fat:2 cat:3 sat:4 on:5 a:6 mat:7 and:8 ate:9 a:10 fat:11 rat:12'::tsvector;
                                  tsvector
-------------------------------------------------------------------------------
 'a':1,6,10 'and':8 'ate':9 'cat':3 'fat':2,11 'mat':7 'on':5 'rat':12 'sat':4

A position normally indicates the source word's location in the document. Positional information can be used for <firstterm>proximity ranking</firstterm>. Position values can range from 1 to 16383; larger numbers are silently set to 16383. Duplicate positions for the same lexeme are discarded. 位置は通常、元の単語の文書中の位置を示します。位置情報を近接順序に使用することができます。位置の値は1から16383までで、これより大きな値は警告なく16383に設定されます。同一語彙素に対する重複する位置項目は破棄されます。

Lexemes that have positions can further be labeled with a <firstterm>weight</firstterm>, which can be <literal>A</literal>, <literal>B</literal>, <literal>C</literal>, or <literal>D</literal>. <literal>D</literal> is the default and hence is not shown on output: 位置を持つ語彙素はさらに重み付きのラベルを付与することができます。ラベルはA、B、C、Dを取ることができます。 Dはデフォルトですので、以下のように出力には現れません。

SELECT 'a:1A fat:2B,4C cat:5D'::tsvector;
          tsvector
----------------------------
 'a':1A 'cat':5 'fat':2B,4C

Weights are typically used to reflect document structure, for example by marking title words differently from body words. Text search ranking functions can assign different priorities to the different weight markers. 典型的に重みは、例えば、表題の単語には本文の単語と異なる印をつけるといった、文書構造を反映させるために使用されます。テキスト検索の順序付け関数は異なる重み印に異なる優先度を割り当てることができます。

It is important to understand that the <type>tsvector</type> type itself does not perform any word normalization; it assumes the words it is given are normalized appropriately for the application. For example, tsvector型自体は単語の正規化を行わないことを理解することは重要です。与えられる単語はアプリケーションのために適切に正規化されていると仮定しています。以下に例を示します。

SELECT 'The Fat Rats'::tsvector;
      tsvector
--------------------
 'Fat' 'Rats' 'The'

For most English-text-searching applications the above words would be considered non-normalized, but <type>tsvector</type> doesn't care. Raw document text should usually be passed through <function>to_tsvector</function> to normalize the words appropriately for searching: ほとんどの英文テキスト検索アプリケーションでは、上の単語は正規化されていないとみなされますが、tsvectorは気にしません。検索用に単語を適切に正規化するために、生の文書テキストは通常to_tsvector経由で渡されます。

SELECT to_tsvector('english', 'The Fat Rats');
   to_tsvector
-----------------
 'fat':2 'rat':3

Again, see <xref linkend="textsearch"/> for more detail. これについても、詳細は第12章を参照してください。

8.11.2. `tsquery` #

A <type>tsquery</type> value stores lexemes that are to be searched for, and can combine them using the Boolean operators <literal>&</literal> (AND), <literal>|</literal> (OR), and <literal>!</literal> (NOT), as well as the phrase search operator <literal><-></literal> (FOLLOWED BY). There is also a variant <literal><<replaceable>N</replaceable>></literal> of the FOLLOWED BY operator, where <replaceable>N</replaceable> is an integer constant that specifies the distance between the two lexemes being searched for. <literal><-></literal> is equivalent to <literal><1></literal>. tsqueryの値には検索される語彙素が格納されます。それらは論理演算子& (論理積)、| (論理和)、!(否定)および語句検索演算子<->(FOLLOWED BY)を組み合わせることができます。 FOLLOWED BY演算子には<N>という変化形もあり、Nは２つの検索される語彙素の距離を指定する数値型の定数です。 <->と<1>は同じです。

Parentheses can be used to enforce grouping of these operators. In the absence of parentheses, <literal>!</literal> (NOT) binds most tightly, <literal><-></literal> (FOLLOWED BY) next most tightly, then <literal>&</literal> (AND), with <literal>|</literal> (OR) binding the least tightly. 括弧を使用して演算子を強制的にグループ化することができます。括弧が無い場合、! (NOT)が最も強く結合し、<-> (FOLLOWED BY)が次に強く結合します。次いで、& (AND)の結合が強く、 | (OR)の結合が最も弱くなります。

Here are some examples: 以下に例を示します：

SELECT 'fat & rat'::tsquery;
    tsquery
---------------
 'fat' & 'rat'

SELECT 'fat & (rat | cat)'::tsquery;
          tsquery
---------------------------
 'fat' & ( 'rat' | 'cat' )

SELECT 'fat & rat & ! cat'::tsquery;
        tsquery
------------------------
 'fat' & 'rat' & !'cat'

Optionally, lexemes in a <type>tsquery</type> can be labeled with one or more weight letters, which restricts them to match only <type>tsvector</type> lexemes with one of those weights: 省略することもできますが、tsquery内の語彙素に1つ以上の重み文字でラベルを付けることができます。こうすると、これらの重みを持つtsvector語彙素のみに一致するように制限することになります。

SELECT 'fat:ab & cat'::tsquery;
    tsquery
------------------
 'fat':AB & 'cat'

Also, lexemes in a <type>tsquery</type> can be labeled with <literal>*</literal> to specify prefix matching: 同時に、tsquery内の語彙素は、前方一致を指定するため*でラベルを付けることができます。

SELECT 'super:*'::tsquery;
  tsquery
-----------
 'super':*

This query will match any word in a <type>tsvector</type> that begins with <quote>super</quote>. この問い合わせでは「super」で始まるtsvector中の全ての言葉と一致します。

Quoting rules for lexemes are the same as described previously for lexemes in <type>tsvector</type>; and, as with <type>tsvector</type>, any required normalization of words must be done before converting to the <type>tsquery</type> type. The <function>to_tsquery</function> function is convenient for performing such normalization: 語彙素の引用符規則は前に説明したtsvectorにおける語彙素と同じです。また、tsvector同様、必要な単語の正規化はtsquery型に変換する前に行う必要があります。こうした正規化の実行にはto_tsquery関数が簡便です。

SELECT to_tsquery('Fat:ab & Cats');
    to_tsquery
------------------
 'fat':AB & 'cat'

Note that <function>to_tsquery</function> will process prefixes in the same way as other words, which means this comparison returns true: to_tsqueryは他の言葉と同じように接頭辞を扱うことに注意してください。以下の比較の例ではtrueを返します。

SELECT to_tsvector( 'postgraduate' ) @@ to_tsquery( 'postgres:*' );
 ?column?
----------
 t

because <literal>postgres</literal> gets stemmed to <literal>postgr</literal>: これはpostgresにはpostgrの語幹を含んでいるためです。

SELECT to_tsvector( 'postgraduate' ), to_tsquery( 'postgres:*' );
  to_tsvector  | to_tsquery
---------------+------------
 'postgradu':1 | 'postgr':*

which will match the stemmed form of <literal>postgraduate</literal>. これはpostgraduateの語幹の形と一致します。

前へ	上へ	次へ
8.10. ビット列データ型	ホーム	8.12. UUID型