12.6. 辞書

PostgreSQL 17.5文書
		第12章全文検索	誤訳等の報告
前へ	上へ	12.6. 辞書	次へ

12.6. 辞書 #

<title>Dictionaries</title>

Dictionaries are used to eliminate words that should not be considered in a search (<firstterm>stop words</firstterm>), and to <firstterm>normalize</firstterm> words so that different derived forms of the same word will match. A successfully normalized word is called a <firstterm>lexeme</firstterm>. Aside from improving search quality, normalization and removal of stop words reduce the size of the <type>tsvector</type> representation of a document, thereby improving performance. Normalization does not always have linguistic meaning and usually depends on application semantics. 辞書は、検索の対象とならない単語(ストップワード)を削除するために使われます。また、同じ単語から派生した異なる形態の単語が照合するようにするために、単語を正規化するためにも使われます。正規化された単語は語彙素と呼ばれます。検索の品質を向上するという面以外にも、正規化とストップワードの削除は、tsvector表現の文書のサイズを小さくし、結果として性能を向上させます。正規化は常に言語学的な意味を持つとは限らず、通常は用途の意味論に依存します。

Some examples of normalization: 正規化の例を示します。

Linguistic — Ispell dictionaries try to reduce input words to a normalized form; stemmer dictionaries remove word endings 言語学的 — Ispell辞書は入力された単語を正規化された形式に変換しようとします。語幹辞書は単語の終了部を削除します。
<acronym>URL</acronym> locations can be canonicalized to make equivalent URLs match: 以下のようなURLが同一のURLに一致するように正規化することができます。
- http://www.pgsql.ru/db/mw/index.html
- http://www.pgsql.ru/db/mw/
- http://www.pgsql.ru/db/../db/mw/index.html
Color names can be replaced by their hexadecimal values, e.g., 色の名前は、16進値に変換できます。例： red, green, blue, magenta -> FF0000, 00FF00, 0000FF, FF00FF
If indexing numbers, we can remove some fractional digits to reduce the range of possible numbers, so for example <emphasis>3.14</emphasis>159265359, <emphasis>3.14</emphasis>15926, <emphasis>3.14</emphasis> will be the same after normalization if only two digits are kept after the decimal point. 数をインデックス付けする際には、可能な範囲を縮小するために、端数を削除することができます。たとえば、もし正規化後に小数点未満2桁を保持するならば、3.14159265359、3.1415926、3.14は同じことになります。

A dictionary is a program that accepts a token as input and returns: 辞書は、トークンを入力し、以下を返すプログラムです。

an array of lexemes if the input token is known to the dictionary (notice that one token can produce more than one lexeme) 入力が辞書に登録されていれば語彙素の配列(一つのトークンが一つ以上の語彙素を生成する可能性があることに注意してください)
a single lexeme with the <literal>TSL_FILTER</literal> flag set, to replace the original token with a new token to be passed to subsequent dictionaries (a dictionary that does this is called a <firstterm>filtering dictionary</firstterm>) 元々のトークンを新規のトークンに置き換え、それに続く辞書にその新規トークン渡す場合は、TSL_FILTERフラグセットを伴う単一の語彙素(このような置き換え機能をもつ辞書はフィルタリング辞書と呼ばれます)
an empty array if the dictionary knows the token, but it is a stop word 辞書が入力を認識しないが、ストップワードであることは認識する場合は空の配列
<literal>NULL</literal> if the dictionary does not recognize the input token 辞書が入力トークンを認識しない場合はNULL

<productname>PostgreSQL</productname> provides predefined dictionaries for many languages. There are also several predefined templates that can be used to create new dictionaries with custom parameters. Each predefined dictionary template is described below. If no existing template is suitable, it is possible to create new ones; see the <filename>contrib/</filename> area of the <productname>PostgreSQL</productname> distribution for examples. PostgreSQLは、多くの言語に定義済の辞書を提供しています。また、カスタムパラメータを使った新しい辞書を作るために使えるテンプレートもいくつかあります。定義済の辞書のテンプレートについては、以下で述べています。今あるテンプレートが適当でないのなら、新しいものを作ることもできます。例は、PostgreSQLの配布物のcontrib/をご覧下さい。

A text search configuration binds a parser together with a set of dictionaries to process the parser's output tokens. For each token type that the parser can return, a separate list of dictionaries is specified by the configuration. When a token of that type is found by the parser, each dictionary in the list is consulted in turn, until some dictionary recognizes it as a known word. If it is identified as a stop word, or if no dictionary recognizes the token, it will be discarded and not indexed or searched for. Normally, the first dictionary that returns a non-<literal>NULL</literal> output determines the result, and any remaining dictionaries are not consulted; but a filtering dictionary can replace the given word with a modified word, which is then passed to subsequent dictionaries. テキスト検索設定は、パーサと、パーサの出力トークンを処理する辞書の集合を結び付けます。パーサが返却する各々のトークン型に対して、設定で辞書のリストを指定します。パーサがあるトークン型を見つけると、ある辞書が単語を認識するまでリスト中の辞書が順番に調べられます。ストップワードであるか、あるいはどの辞書もトークンを認識しない場合はそれは捨てられ、インデックス付けや検索の対象となりません。通常、非NULLを返す最初の辞書の出力が結果を決めることになり、他の残りの辞書は参照されません。しかし、フィルタリング辞書は与えられたワードを変更し、それを続く辞書へ渡すことができます。

The general rule for configuring a list of dictionaries is to place first the most narrow, most specific dictionary, then the more general dictionaries, finishing with a very general dictionary, like a <application>Snowball</application> stemmer or <literal>simple</literal>, which recognizes everything. For example, for an astronomy-specific search (<literal>astro_en</literal> configuration) one could bind token type <type>asciiword</type> (ASCII word) to a synonym dictionary of astronomical terms, a general English dictionary and a <application>Snowball</application> English stemmer: 辞書をリストする一般的な方法は、まずもっとも範囲の狭い、特定用途向の辞書を配置し、次にもっと一般的な辞書を置き、最後にSnowball語幹処理やsimple辞書のような、すべてを認識する非常に一般的な辞書を置くことです。たとえば、天文学向の検索では(astro_en設定)では、asciiword (ASCII単語)型を天文学用語の同義語辞書、一般的な英語辞書、そしてSnowball英語語幹辞書に結び付けることができます。

ALTER TEXT SEARCH CONFIGURATION astro_en
    ADD MAPPING FOR asciiword WITH astrosyn, english_ispell, english_stem;

A filtering dictionary can be placed anywhere in the list, except at the end where it'd be useless. Filtering dictionaries are useful to partially normalize words to simplify the task of later dictionaries. For example, a filtering dictionary could be used to remove accents from accented letters, as is done by the <xref linkend="unaccent"/> module. フィルタリング辞書は、リスト中の好きな場所へ配置できます。(役に立たなくなるリストの最後を除きます。) フィルタリング辞書は、後続の辞書の処理を単純化するために、一部の文字の正規化を行うのに有用です。例えば、フィルタリング辞書はunaccentモジュールで実施される様な、アクセント記号が付与された文字からアクセント記号を取り除くのに使用することができます。

12.6.1. ストップワード #

<title>Stop Words</title>

Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching. For example, every English text contains words like <literal>a</literal> and <literal>the</literal>, so it is useless to store them in an index. However, stop words do affect the positions in <type>tsvector</type>, which in turn affect ranking: ストップワードは、ほとんどすべての文書に現れるような非常に一般的で、ほかのものと同じようには扱う価値のない単語です。ですから、全文検索の際には無視して構いません。たとえば、すべての英語のテキストはaやtheのような単語を含んでおり、インデックスの中にそれらを入れても役に立ちません。しかし、ストップワードはtsvector中の位置に影響を与えるので、結局ランキングにも影響があります。

SELECT to_tsvector('english', 'in the list of stop words');
        to_tsvector
----------------------------
 'list':3 'stop':5 'word':6

The missing positions 1,2,4 are because of stop words. Ranks calculated for documents with and without stop words are quite different: 位置1, 2, 4は、ストップワードのために失われています。ストップワードの有無により、文書のために計算されたランクは非常に影響を受けます。

SELECT ts_rank_cd (to_tsvector('english', 'in the list of stop words'), to_tsquery('list & stop'));
 ts_rank_cd
------------
       0.05

SELECT ts_rank_cd (to_tsvector('english', 'list stop words'), to_tsquery('list & stop'));
 ts_rank_cd
------------
        0.1

It is up to the specific dictionary how it treats stop words. For example, <literal>ispell</literal> dictionaries first normalize words and then look at the list of stop words, while <literal>Snowball</literal> stemmers first check the list of stop words. The reason for the different behavior is an attempt to decrease noise. ストップワードをどのように扱うかは、特定の辞書に任されています。例えば、ispell辞書はまず単語を正規化し、そして、ストップワードのリストを検索します。一方、Snowball語幹抽出はまずストップワードのリストを検査します。動作が異なる理由は、ノイズが紛れ込む可能性を減らすことです。

12.6.2. simple辞書 #

<title>Simple Dictionary</title>

The <literal>simple</literal> dictionary template operates by converting the input token to lower case and checking it against a file of stop words. If it is found in the file then an empty array is returned, causing the token to be discarded. If not, the lower-cased form of the word is returned as the normalized lexeme. Alternatively, the dictionary can be configured to report non-stop-words as unrecognized, allowing them to be passed on to the next dictionary in the list. simple辞書テンプレートは、入力トークンを小文字に変換し、ストップワードのファイルに対してチェックすることによって動作します。もしファイルの中にあれば、空の配列が返却され、そのトークンは捨てられます。そうでないときは、小文字形式の単語が正規化された語彙素として返却されます。別の方法としては、ストップワードではないものは、認識できないものとすることもできます。そうすることにより、それらをリスト中の次の辞書に渡すことができます。

Here is an example of a dictionary definition using the <literal>simple</literal> template: simpleテンプレートを使った辞書定義の例を示します。

CREATE TEXT SEARCH DICTIONARY public.simple_dict (
    TEMPLATE = pg_catalog.simple,
    STOPWORDS = english
);

Here, <literal>english</literal> is the base name of a file of stop words. The file's full name will be <filename>$SHAREDIR/tsearch_data/english.stop</filename>, where <literal>$SHAREDIR</literal> means the <productname>PostgreSQL</productname> installation's shared-data directory, often <filename>/usr/local/share/postgresql</filename> (use <command>pg_config --sharedir</command> to determine it if you're not sure). The file format is simply a list of words, one per line. Blank lines and trailing spaces are ignored, and upper case is folded to lower case, but no other processing is done on the file contents. ここで、englishは、ストップワードファイルのベースネームです。ファイルのフルネームは、$SHAREDIR/tsearch_data/english.stopです。$SHAREDIRは、PostgreSQLインストール先の共有データディレクトリです。これは、よく/usr/local/share/postgresqlに置いてあります(よくわからない場合はpg_config --sharedirを使ってください)。ファイル形式は、単に1行ごとに単語を書くだけです。空行と、後方の空白は無視されます。大文字は小文字に変換されます。このファイルの内容に関する処理はこれだけです。

Now we can test our dictionary: これで辞書のテストができます。

SELECT ts_lexize('public.simple_dict', 'YeS');
 ts_lexize
-----------
 {yes}

SELECT ts_lexize('public.simple_dict', 'The');
 ts_lexize
-----------
 {}

We can also choose to return <literal>NULL</literal>, instead of the lower-cased word, if it is not found in the stop words file. This behavior is selected by setting the dictionary's <literal>Accept</literal> parameter to <literal>false</literal>. Continuing the example: また、ストップワードファイルの中に見つからないときに、小文字に変換した単語を返す代わりに、NULLを返すことを選ぶこともできます。この挙動は、辞書のAcceptパラメータをfalseに設定することで選択されます。さらに例を続けます。

ALTER TEXT SEARCH DICTIONARY public.simple_dict ( Accept = false );

SELECT ts_lexize('public.simple_dict', 'YeS');
 ts_lexize
-----------


SELECT ts_lexize('public.simple_dict', 'The');
 ts_lexize
-----------
 {}

With the default setting of <literal>Accept</literal> = <literal>true</literal>, it is only useful to place a <literal>simple</literal> dictionary at the end of a list of dictionaries, since it will never pass on any token to a following dictionary. Conversely, <literal>Accept</literal> = <literal>false</literal> is only useful when there is at least one following dictionary. デフォルト設定のAccept = trueでは、simple辞書は、辞書リストの最後に置かなければ意味がありません。なぜなら、後続の辞書にトークンを渡すことがないからです。逆にAccept = falseは、後続の辞書が少なくとも一つはあるときに意味があります。

注意

Most types of dictionaries rely on configuration files, such as files of stop words. These files <emphasis>must</emphasis> be stored in UTF-8 encoding. They will be translated to the actual database encoding, if that is different, when they are read into the server. ほとんどの辞書の形式は、ストップワードファイルのように設定ファイルに依存します。これらのファイルは必ずUTF-8エンコーディングにしてください。サーバのエンコーディングがUTF-8でない場合は、サーバに読み込まれる際に実際のデータベースエンコーディングに変換されます。

注意

Normally, a database session will read a dictionary configuration file only once, when it is first used within the session. If you modify a configuration file and want to force existing sessions to pick up the new contents, issue an <command>ALTER TEXT SEARCH DICTIONARY</command> command on the dictionary. This can be a <quote>dummy</quote> update that doesn't actually change any parameter values. 通常、辞書の設定ファイルはデータベースセッションの中で最初に使われる際に、一度だけ読み込まれます。設定ファイルを変更し、現在使われているセッションの中で新しい内容が読み込まれるようにしたい場合は、その辞書に対してALTER TEXT SEARCH DICTIONARYを発行してください。これは実際にはどんなパラメータ値をも変更しない「ダミー」の更新でよいです。

12.6.3. 同義語辞書 #

<title>Synonym Dictionary</title>

This dictionary template is used to create dictionaries that replace a word with a synonym. Phrases are not supported (use the thesaurus template (<xref linkend="textsearch-thesaurus"/>) for that). A synonym dictionary can be used to overcome linguistic problems, for example, to prevent an English stemmer dictionary from reducing the word <quote>Paris</quote> to <quote>pari</quote>. It is enough to have a <literal>Paris paris</literal> line in the synonym dictionary and put it before the <literal>english_stem</literal> dictionary. For example: この辞書テンプレートは、単語を同義語に置き換える辞書を作るために使われます。語句はサポートされていません(そのためには類語テンプレート(12.6.4)を使ってください)。同義語辞書は、言語学的な問題、たとえば、英語語幹辞書が「Paris」という単語を「pari」に縮小してしまうのを防ぎます。 Paris parisという行を同義語辞書に登録し、english_stem辞書の前に置くようにするだけでよいのです。下記はその例です。

SELECT * FROM ts_debug('english', 'Paris');
   alias   |   description   | token |  dictionaries  |  dictionary  | lexemes
-----------+-----------------+-------+----------------+--------------+---------
 asciiword | Word, all ASCII | Paris | {english_stem} | english_stem | {pari}

CREATE TEXT SEARCH DICTIONARY my_synonym (
    TEMPLATE = synonym,
    SYNONYMS = my_synonyms
);

ALTER TEXT SEARCH CONFIGURATION english
    ALTER MAPPING FOR asciiword
    WITH my_synonym, english_stem;

SELECT * FROM ts_debug('english', 'Paris');
   alias   |   description   | token |       dictionaries        | dictionary | lexemes
-----------+-----------------+-------+---------------------------+------------+---------
 asciiword | Word, all ASCII | Paris | {my_synonym,english_stem} | my_synonym | {paris}

The only parameter required by the <literal>synonym</literal> template is <literal>SYNONYMS</literal>, which is the base name of its configuration file — <literal>my_synonyms</literal> in the above example. The file's full name will be <filename>$SHAREDIR/tsearch_data/my_synonyms.syn</filename> (where <literal>$SHAREDIR</literal> means the <productname>PostgreSQL</productname> installation's shared-data directory). The file format is just one line per word to be substituted, with the word followed by its synonym, separated by white space. Blank lines and trailing spaces are ignored. synonymテンプレートに必要なパラメータはSYNONYMSだけで、その設定ファイルのベースネームです — 上の例ではmy_synonymsです。ファイルのフルネームは、$SHAREDIR/tsearch_data/my_synonyms.syn となります(ここで$SHAREDIRは、PostgreSQLをインストールした際の、共有データディレクトリです)。ファイルの形式は、置き換え対象の1単語につき1行で、単語には空白で区切られた同義語が後に続きます。空行、後方の空白は無視されます。

The <literal>synonym</literal> template also has an optional parameter <literal>CaseSensitive</literal>, which defaults to <literal>false</literal>. When <literal>CaseSensitive</literal> is <literal>false</literal>, words in the synonym file are folded to lower case, as are input tokens. When it is <literal>true</literal>, words and tokens are not folded to lower case, but are compared as-is. synonymテンプレートはまた、CaseSensitiveというオプションパラメータを持っており、デフォルトはfalseです。 CaseSensitiveがfalseの時は、同義語ファイル中の単語は入力トークンと同様に小文字に変換されます。 trueの時は、単語とトークンは小文字に変換されずそのまま比較されます。

An asterisk (<literal>*</literal>) can be placed at the end of a synonym in the configuration file. This indicates that the synonym is a prefix. The asterisk is ignored when the entry is used in <function>to_tsvector()</function>, but when it is used in <function>to_tsquery()</function>, the result will be a query item with the prefix match marker (see <xref linkend="textsearch-parsing-queries"/>). For example, suppose we have these entries in <filename>$SHAREDIR/tsearch_data/synonym_sample.syn</filename>: アスタリスク(*)は設定ファイル中の同義語の最後に付与することができます。これは同義語を接頭語とすることを意味します。アスタリスクは、エントリがto_tsvector()で使用される場合には無視されますが、to_tsquery()で使用される場合、結果は前方一致を伴った問い合わせになるでしょう。(詳しくは12.3.2を見てください。) 例えば、$SHAREDIR/tsearch_data/synonym_sample.synに以下の様なエントリをもっていたとします。

postgres        pgsql
postgresql      pgsql
postgre pgsql
gogle   googl
indices index*

Then we will get these results: この場合、次のような結果を得ることになります。

mydb=# CREATE TEXT SEARCH DICTIONARY syn (template=synonym, synonyms='synonym_sample');
mydb=# SELECT ts_lexize('syn', 'indices');
 ts_lexize
-----------
 {index}
(1 row)

mydb=# CREATE TEXT SEARCH CONFIGURATION tst (copy=simple);
mydb=# ALTER TEXT SEARCH CONFIGURATION tst ALTER MAPPING FOR asciiword WITH syn;
mydb=# SELECT to_tsvector('tst', 'indices');
 to_tsvector
-------------
 'index':1
(1 row)

mydb=# SELECT to_tsquery('tst', 'indices');
 to_tsquery
------------
 'index':*
(1 row)

mydb=# SELECT 'indexes are very useful'::tsvector;
            tsvector
---------------------------------
 'are' 'indexes' 'useful' 'very'
(1 row)

mydb=# SELECT 'indexes are very useful'::tsvector @@ to_tsquery('tst', 'indices');
 ?column?
----------
 t
(1 row)

12.6.4. 類語辞書 #

<title>Thesaurus Dictionary</title>

A thesaurus dictionary (sometimes abbreviated as <acronym>TZ</acronym>) is a collection of words that includes information about the relationships of words and phrases, i.e., broader terms (<acronym>BT</acronym>), narrower terms (<acronym>NT</acronym>), preferred terms, non-preferred terms, related terms, etc. 類語辞書(TZと略されることがあります)は、単語と語句の関係情報を集めたものです。つまり、広義用語(BT)、狭義用語(NT)、優先用語、非優先用語、関連用語などです。

Basically a thesaurus dictionary replaces all non-preferred terms by one preferred term and, optionally, preserves the original terms for indexing as well. <productname>PostgreSQL</productname>'s current implementation of the thesaurus dictionary is an extension of the synonym dictionary with added <firstterm>phrase</firstterm> support. A thesaurus dictionary requires a configuration file of the following format: 基本的には、類語辞書は、非優先用語を優先用語に置き換え、オプションで元の用語もインデックス付けのため保存します。 PostgreSQLの現在の類語辞書の実装は、同義語辞書を拡張し、語句のサポートを追加したものです。類語辞書は、以下のようなフォーマットの設定ファイルを必要とします。

# this is a comment
sample word(s) : indexed word(s)
more sample word(s) : more indexed word(s)
...

where the colon (<symbol>:</symbol>) symbol acts as a delimiter between a phrase and its replacement. ここで、コロン(:)は、語句とその置き換え対象の区切りです。

A thesaurus dictionary uses a <firstterm>subdictionary</firstterm> (which is specified in the dictionary's configuration) to normalize the input text before checking for phrase matches. It is only possible to select one subdictionary. An error is reported if the subdictionary fails to recognize a word. In that case, you should remove the use of the word or teach the subdictionary about it. You can place an asterisk (<symbol>*</symbol>) at the beginning of an indexed word to skip applying the subdictionary to it, but all sample words <emphasis>must</emphasis> be known to the subdictionary. 類語辞書は、副辞書(辞書設定で指定します)を、一致する語句をチェックする前に入力テキストを正規化するために使います。副辞書はただ一つだけ選べます。副辞書が単語を認識できない場合はエラーが報告されます。その場合は、その単語の利用を止めるか、副辞書にそのことを教えなければなりません。アスタリスク(*)をインデックス付けされた単語の先頭に置くことにより、副辞書の適用をスキップできます。しかしながら、すべてのサンプルの単語は、副辞書に認識されなければなりません。

The thesaurus dictionary chooses the longest match if there are multiple phrases matching the input, and ties are broken by using the last definition. 複数の類語が照合するときは、類語辞書はもっとも長いものを選びます。そして、語句は、最後の定義を使って分解されます。

Specific stop words recognized by the subdictionary cannot be specified; instead use <literal>?</literal> to mark the location where any stop word can appear. For example, assuming that <literal>a</literal> and <literal>the</literal> are stop words according to the subdictionary: 特定のストップワードを副辞書に認識するように指定することはできません。その代わり、ストップワードが出現する位置を?でマークします。たとえば、aとtheが副辞書によればストップワードだったとします。

? one ? two : swsw

matches <literal>a one the two</literal> and <literal>the one a two</literal>; both would be replaced by <literal>swsw</literal>. は、a one the twoとthe one a twoに照合します。そして、両方ともswswに置き換えられます。

Since a thesaurus dictionary has the capability to recognize phrases it must remember its state and interact with the parser. A thesaurus dictionary uses these assignments to check if it should handle the next word or stop accumulation. The thesaurus dictionary must be configured carefully. For example, if the thesaurus dictionary is assigned to handle only the <literal>asciiword</literal> token, then a thesaurus dictionary definition like <literal>one 7</literal> will not work since token type <literal>uint</literal> is not assigned to the thesaurus dictionary. 類語辞書は語句を認識することができるので、状態を記憶してパーサと連携を保たなければなりません。類語辞書は、この機能を使って次の単語を引き続き処理するのか、単語の蓄積を止めるのかを決定します。類語辞書の設定は注意深く行わなければなりません。たとえば、類語辞書がasciiwordトークンだけを扱うようになっている場合、one 7のような類語辞書の定義は、トークン型uintが類語辞書にアサインされていないので動きません。

注意

Thesauruses are used during indexing so any change in the thesaurus dictionary's parameters <emphasis>requires</emphasis> reindexing. For most other dictionary types, small changes such as adding or removing stopwords does not force reindexing. 類語辞書はインデックス付けの際に利用されるので、類語辞書を設定変更すると、再インデックス付けが必要になります。他のほとんどの辞書では、ストップワードを追加あるいは削除するような小さな変更は、インデックス付けを必要としません。

12.6.4.1. 類語設定 #

<title>Thesaurus Configuration</title>

To define a new thesaurus dictionary, use the <literal>thesaurus</literal> template. For example: 新しい類語辞書を定義するには、thesaurusテンプレートを使います。例を示します。

CREATE TEXT SEARCH DICTIONARY thesaurus_simple (
    TEMPLATE = thesaurus,
    DictFile = mythesaurus,
    Dictionary = pg_catalog.english_stem
);

Here: ここで、

<literal>thesaurus_simple</literal> is the new dictionary's name thesaurus_simpleは新しい辞書の名前です。
<literal>mythesaurus</literal> is the base name of the thesaurus configuration file. (Its full name will be <filename>$SHAREDIR/tsearch_data/mythesaurus.ths</filename>, where <literal>$SHAREDIR</literal> means the installation shared-data directory.) mythesaurusは、類語設定ファイルのベースネームです。 (フルパスは、$SHAREDIR/tsearch_data/mythesaurus.thsとなります。ここで、$SHAREDIRはインストール時の共有データディレクトリです。)
<literal>pg_catalog.english_stem</literal> is the subdictionary (here, a Snowball English stemmer) to use for thesaurus normalization. Notice that the subdictionary will have its own configuration (for example, stop words), which is not shown here. 類語正規化で使用するpg_catalog.english_stemは副辞書です(ここでは、Snowball英語語幹辞書)。副辞書にはそれ用の設定(たとえばストップワード)があることに注意してください。ここではそれは表示していません。

Now it is possible to bind the thesaurus dictionary <literal>thesaurus_simple</literal> to the desired token types in a configuration, for example: これで、類語辞書thesaurus_simpleを、設定中の希望のトークンにバインドすることができるようになります。例を示します。

ALTER TEXT SEARCH CONFIGURATION russian
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
    WITH thesaurus_simple;

12.6.4.2. 類語の例 #

<title>Thesaurus Example</title>

Consider a simple astronomical thesaurus <literal>thesaurus_astro</literal>, which contains some astronomical word combinations: 天文学の単語の組合わせを含む単純な天文学用のthesaurus_astro類語を考えます。

supernovae stars : sn
crab nebulae : crab

Below we create a dictionary and bind some token types to an astronomical thesaurus and English stemmer: 以下で辞書を作り、トークン型を天文学類語辞書と英語の語幹辞書に結び付けます。

CREATE TEXT SEARCH DICTIONARY thesaurus_astro (
    TEMPLATE = thesaurus,
    DictFile = thesaurus_astro,
    Dictionary = english_stem
);

ALTER TEXT SEARCH CONFIGURATION russian
    ALTER MAPPING FOR asciiword, asciihword, hword_asciipart
    WITH thesaurus_astro, english_stem;

Now we can see how it works. <function>ts_lexize</function> is not very useful for testing a thesaurus, because it treats its input as a single token. Instead we can use <function>plainto_tsquery</function> and <function>to_tsvector</function> which will break their input strings into multiple tokens: さあ、これでどのように動くか試せます。ts_lexizeは類語をテストする目的にはあまり有用ではありません。なぜなら、それは入力を単一のトークンとして扱うからです。その代わりに、plainto_tsqueryとto_tsvectorを使って入力文字列を複数のトークンに分解します。

SELECT plainto_tsquery('supernova star');
 plainto_tsquery
-----------------
 'sn'

SELECT to_tsvector('supernova star');
 to_tsvector
-------------
 'sn':1

In principle, one can use <function>to_tsquery</function> if you quote the argument: 原則として、引数を引用符で囲めばto_tsqueryが使えます。

SELECT to_tsquery('''supernova star''');
 to_tsquery
------------
 'sn'

Notice that <literal>supernova star</literal> matches <literal>supernovae stars</literal> in <literal>thesaurus_astro</literal> because we specified the <literal>english_stem</literal> stemmer in the thesaurus definition. The stemmer removed the <literal>e</literal> and <literal>s</literal>. english_stem語幹辞書を同義語辞書の定義時に指定したので、supernova starがthesaurus_astro中のsupernovae starsに照合していることに注意してください。語幹処理がeとsを削除しています。

To index the original phrase as well as the substitute, just include it in the right-hand part of the definition: 置き換え後の語句とオリジナルの語句の両方をインデックス付けするには、定義の右項にオリジナルを追加するだけで良いです。

supernovae stars : sn supernovae stars

SELECT plainto_tsquery('supernova star');
       plainto_tsquery
-----------------------------
 'sn' & 'supernova' & 'star'

12.6.5. Ispell辞書 #

<title><application>Ispell</application> Dictionary</title>

The <application>Ispell</application> dictionary template supports <firstterm>morphological dictionaries</firstterm>, which can normalize many different linguistic forms of a word into the same lexeme. For example, an English <application>Ispell</application> dictionary can match all declensions and conjugations of the search term <literal>bank</literal>, e.g., <literal>banking</literal>, <literal>banked</literal>, <literal>banks</literal>, <literal>banks'</literal>, and <literal>bank's</literal>. Ispell辞書テンプレートは、形態論辞書を提供します。これによって、言語学的に多様な単語の形態を同じ語彙素に変換することができます。たとえば、英語Ispell辞書は、検索語bankの語形変化と活用変化、たとえばbanking, banked, banks, banks', bank'sに照合します。

The standard <productname>PostgreSQL</productname> distribution does not include any <application>Ispell</application> configuration files. Dictionaries for a large number of languages are available from <ulink url="https://www.cs.hmc.edu/~geoff/ispell.html">Ispell</ulink>. Also, some more modern dictionary file formats are supported — <ulink url="https://en.wikipedia.org/wiki/MySpell">MySpell</ulink> (OO < 2.0.1) and <ulink url="https://hunspell.github.io/">Hunspell</ulink> (OO >= 2.0.2). A large list of dictionaries is available on the <ulink url="https://wiki.openoffice.org/wiki/Dictionaries">OpenOffice Wiki</ulink>. PostgreSQLの標準配布には、Ispellの設定ファイルは含まれていません。多くの言語用の辞書がIspellで入手できます。また、より新しい辞書のフォーマットもサポートされています — MySpell(OO < 2.0.1)とHunspell(OO >= 2.0.2)。多数の辞書のリストが OpenOffice Wikiで入手できます。

To create an <application>Ispell</application> dictionary perform these steps: Ispell辞書を作るには、以下の手順を実行します。

download dictionary configuration files. <productname>OpenOffice</productname> extension files have the <filename>.oxt</filename> extension. It is necessary to extract <filename>.aff</filename> and <filename>.dic</filename> files, change extensions to <filename>.affix</filename> and <filename>.dict</filename>. For some dictionary files it is also needed to convert characters to the UTF-8 encoding with commands (for example, for a Norwegian language dictionary): 辞書の設定ファイルをダウンロードします。 OpenOfficeの拡張ファイルは拡張子.oxtがあります。 .affファイルと.dicファイルを抽出し、拡張子を.affixと.dictに変更する必要があります。一部の辞書ファイルでは、以下のコマンドで文字をUTF-8の符号化に変換する必要もあります（例えば、ノルウェー語の辞書では次のようになります）。
```
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.affix nn_NO.aff
iconv -f ISO_8859-1 -t UTF-8 -o nn_no.dict nn_NO.dic
```
copy files to the <filename>$SHAREDIR/tsearch_data</filename> directory ファイルを$SHAREDIR/tsearch_dataディレクトリにコピーします。

load files into PostgreSQL with the following command: 以下のコマンドでファイルをPostgreSQLにロードします。

CREATE TEXT SEARCH DICTIONARY english_hunspell (
    TEMPLATE = ispell,
    DictFile = en_us,
    AffFile = en_us,
    Stopwords = english);

Here, <literal>DictFile</literal>, <literal>AffFile</literal>, and <literal>StopWords</literal> specify the base names of the dictionary, affixes, and stop-words files. The stop-words file has the same format explained above for the <literal>simple</literal> dictionary type. The format of the other files is not specified here but is available from the above-mentioned web sites. ここで、DictFile, AffFile, およびStopWordsは、辞書のベースネーム、接辞ファイル、ストップワードファイルを指定します。ストップワードファイルは、上で説明したsimple辞書と同じ形式です。ほかのファイルの形式はここでは説明されませんが、上にあげたウェブサイトに説明があります。

Ispell dictionaries usually recognize a limited set of words, so they should be followed by another broader dictionary; for example, a Snowball dictionary, which recognizes everything. Ispell辞書は通常限られた数の単語を認識します。ですので、なんでも認識できるSnowball辞書のような、より適用範囲の広い辞書による後処理が必要です。

The <filename>.affix</filename> file of <application>Ispell</application> has the following structure: Ispellの.affixファイルは次のような構造になっています。

prefixes
flag *A:
    .           >   RE      # As in enter > reenter
suffixes
flag T:
    E           >   ST      # As in late > latest
    [^AEIOU]Y   >   -Y,IEST # As in dirty > dirtiest
    [AEIOU]Y    >   EST     # As in gray > grayest
    [^EY]       >   EST     # As in small > smallest

And the <filename>.dict</filename> file has the following structure: そして、.dictファイルは次のような構造になっています。

lapse/ADGRS
lard/DGRS
large/PRTY
lark/MRS

Format of the <filename>.dict</filename> file is: .dictファイルのフォーマットは次の通りです。

basic_form/affix_class_name

In the <filename>.affix</filename> file every affix flag is described in the following format: .affixファイルで、すべてのaffix(接辞)フラグは次のフォーマットで記述されています。

condition > [-stripping_letters,] adding_affix

Here, condition has a format similar to the format of regular expressions. It can use groupings <literal>[...]</literal> and <literal>[^...]</literal>. For example, <literal>[AEIOU]Y</literal> means that the last letter of the word is <literal>"y"</literal> and the penultimate letter is <literal>"a"</literal>, <literal>"e"</literal>, <literal>"i"</literal>, <literal>"o"</literal> or <literal>"u"</literal>. <literal>[^EY]</literal> means that the last letter is neither <literal>"e"</literal> nor <literal>"y"</literal>. ここで、condition(条件)は正規表現の形式と同じような形式になります。 [...]および[^...]のグループ化を使うことができます。例えば[AEIOU]Yは、単語の最後の文字が"y"で、その前の文字が"a"、"e"、"i"、"o"、"u"のいずれかであることを意味します。 [^EY]は最後の文字が"e"でも"y"でもないことを意味します。

Ispell dictionaries support splitting compound words; a useful feature. Notice that the affix file should specify a special flag using the <literal>compoundwords controlled</literal> statement that marks dictionary words that can participate in compound formation: Ispell辞書を使って複合語を分割することができます。これは優れた機能です。接辞ファイルは、複合語形式の候補になる辞書中の単語に印を付けるcompoundwords controlled文を使う特別なフラグを指定しなければならないことに注意してください。

compoundwords  controlled z

Here are some examples for the Norwegian language: ノルウェー語の例をいくつか示します。

SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent');
   {over,buljong,terning,pakk,mester,assistent}
SELECT ts_lexize('norwegian_ispell', 'sjokoladefabrikk');
   {sjokoladefabrikk,sjokolade,fabrikk}

<application>MySpell</application> format is a subset of <application>Hunspell</application>. The <filename>.affix</filename> file of <application>Hunspell</application> has the following structure: MySpellのフォーマットはHunspellの部分集合です。 Hunspellの.affixファイルは以下のような構造になっています。

PFX A Y 1
PFX A   0     re         .
SFX T N 4
SFX T   0     st         e
SFX T   y     iest       [^aeiou]y
SFX T   0     est        [aeiou]y
SFX T   0     est        [^ey]

The first line of an affix class is the header. Fields of an affix rules are listed after the header: 接辞(affix)クラスの1行目はヘッダです。接辞ルールのフィールドはヘッダの後に列挙されます。

parameter name (PFX or SFX) パラメータ名（PFXまたはSFX）
flag (name of the affix class) フラグ（接辞クラスの名前）
stripping characters from beginning (at prefix) or end (at suffix) of the word 単語の先頭（接頭辞）から、あるいは終わり（接尾辞）から文字を削除する
adding affix 接辞を追加する
condition that has a format similar to the format of regular expressions. 正規表現の形式と類似の形式の条件

The <filename>.dict</filename> file looks like the <filename>.dict</filename> file of <application>Ispell</application>: .dictファイルはIspellの.dictファイルと同じように見えます。

larder/M
lardy/RT
large/RSPMYT
largehearted

注記

<application>MySpell</application> does not support compound words. <application>Hunspell</application> has sophisticated support for compound words. At present, <productname>PostgreSQL</productname> implements only the basic compound word operations of Hunspell. MySpellは複合語をサポートしていません。 Hunspellは複合語の高度なサポートを提供しています。いまのところ、PostgreSQLはHunspellの基本的な複合語操作しかサポートしていません。

12.6.6. Snowball辞書 #

<title><application>Snowball</application> Dictionary</title>

The <application>Snowball</application> dictionary template is based on a project by Martin Porter, inventor of the popular Porter's stemming algorithm for the English language. Snowball now provides stemming algorithms for many languages (see the <ulink url="https://snowballstem.org/">Snowball site</ulink> for more information). Each algorithm understands how to reduce common variant forms of words to a base, or stem, spelling within its language. A Snowball dictionary requires a <literal>language</literal> parameter to identify which stemmer to use, and optionally can specify a <literal>stopword</literal> file name that gives a list of words to eliminate. (<productname>PostgreSQL</productname>'s standard stopword lists are also provided by the Snowball project.) For example, there is a built-in definition equivalent to Snowball辞書テンプレートは、有名な「英語用のポーターの語幹アルゴリズム」を発明したMartin Porterのプロジェクトに基づいています。 Snowballは今では多くの言語用の語幹アルゴリズムを提供しています(詳細はSnowballのサイトを参照してください)。各々のアルゴリズムにより、その言語において単語の共通部分を取りだし、基本部もしくは語幹の綴りに縮退させることができます。 Snowball辞書には、どの語幹処理を使うかを識別する言語パラメータが必須で、加えて、オプションで無視すべき単語のリストを保持するストップワードファイルを指定することもできます。 (PostgreSQLの標準的なストップワードファイルもまたSnowball projectから提供されています。) たとえば、以下と同じ組み込みの定義があります。

CREATE TEXT SEARCH DICTIONARY english_stem (
    TEMPLATE = snowball,
    Language = english,
    StopWords = english
);

The stopword file format is the same as already explained. ストップワードファイルの形式はすでに説明されているものと同じです。

A <application>Snowball</application> dictionary recognizes everything, whether or not it is able to simplify the word, so it should be placed at the end of the dictionary list. It is useless to have it before any other dictionary because a token will never pass through it to the next dictionary. Snowball辞書は、単純化できるかどうかに関係なく、すべての単語を認識するので、辞書リストの最後に置く必要があります。他の辞書の前に置くのは意味がありません。Snowball辞書は決してトークンを次の辞書に渡さないからです。

前へ	上へ	次へ
12.5. パーサ	ホーム	12.7. 設定例