F.46. unaccent — 発音区分記号を取り除く全文検索用辞書

PostgreSQL 17.5文書
		付録F 追加で提供されるモジュールと拡張	誤訳等の報告
前へ	上へ	F.46. unaccent — 発音区分記号を取り除く全文検索用辞書	次へ

F.46. unaccent — 発音区分記号を取り除く全文検索用辞書 #

<title>unaccent — a text search dictionary which removes diacritics</title>

<filename>unaccent</filename> is a text search dictionary that removes accents (diacritic signs) from lexemes. It's a filtering dictionary, which means its output is always passed to the next dictionary (if any), unlike the normal behavior of dictionaries. This allows accent-insensitive processing for full text search. unaccentは語彙素からアクセント(発音区分記号)を取り除く全文検索用の辞書です。これはフィルタ処理を行う辞書、つまり、標準の動作と異なり、その出力が常に次の辞書（もしあれば）に渡されるものです。これにより全文検索においてアクセントを無視した処理を行うことができます。

The current implementation of <filename>unaccent</filename> cannot be used as a normalizing dictionary for the <filename>thesaurus</filename> dictionary. 現在のunaccentの実装ではthesaurus辞書向けの正規化用辞書として使用することはできません。

This module is considered <quote>trusted</quote>, that is, it can be installed by non-superusers who have <literal>CREATE</literal> privilege on the current database. このモジュールは「trusted」と見なされます。つまり、現在のデータベースに対してCREATE権限を持つ非スーパーユーザがインストールできます。

F.46.1. 設定 #

<title>Configuration</title>

An <literal>unaccent</literal> dictionary accepts the following options: unaccent辞書は以下のオプションを受け付けます。

<literal>RULES</literal> is the base name of the file containing the list of translation rules. This file must be stored in <filename>$SHAREDIR/tsearch_data/</filename> (where <literal>$SHAREDIR</literal> means the <productname>PostgreSQL</productname> installation's shared-data directory). Its name must end in <literal>.rules</literal> (which is not to be included in the <literal>RULES</literal> parameter). RULESは変換規則の一覧を含むファイルのベースネームです。このファイルは$SHAREDIR/tsearch_data/内に格納しなければなりません。（ここで$SHAREDIRはPostgreSQLインストレーションの共有データディレクトリを意味します。）この名前は.rulesで終わらなければなりません。（.rulesはRULESパラメータには含まれません。）

The rules file has the following format: rulesファイルの書式は以下の通りです。

Each line represents one translation rule, consisting of a character with accent followed by a character without accent. The first is translated into the second. For example, 各行は、アクセント付き文字とその後にアクセントを取り除いた文字から構成される、1つの変換規則です。一つ目が二つ目に変換されます。以下に例を示します。
```
À        A
Á        A
Â        A
Ã        A
Ä        A
Å        A
Æ        AE
```
The two characters must be separated by whitespace, and any leading or trailing whitespace on a line is ignored. 2つの文字は空白で分けられていなければならず、行の先頭や末尾の空白は無視されます。
Alternatively, if only one character is given on a line, instances of that character are deleted; this is useful in languages where accents are represented by separate characters. あるいは、一行に一文字だけ指定された場合、その文字は削除されます。これは、アクセントが分かれた文字で表現される言語では便利です。
Actually, each <quote>character</quote> can be any string not containing whitespace, so <filename>unaccent</filename> dictionaries could be used for other sorts of substring substitutions besides diacritic removal. 実のところ、各「文字」は空白を含まなければいかなる文字列でも良いので、unaccent辞書は発音区別符号の除去に加えて、部分文字列の置換などに使うこともできます。
Some characters, like numeric symbols, may require whitespaces in their translation rule. It is possible to use double quotes around the translated characters in this case. A double quote needs to be escaped with a second double quote when including one in the translated character. For example: 数字などの一部の文字では、変換規則に空白が必要な場合があります。この場合、変換された文字を囲むのに二重引用符が使えます。変換された文字に二重引用符を含める場合は、もう一つの二重引用符でエスケープすることが必要です。例えば以下の通りです。
```
¼      " 1/4"
½      " 1/2"
¾      " 3/4"
“       """"
”       """"
```
As with other <productname>PostgreSQL</productname> text search configuration files, the rules file must be stored in UTF-8 encoding. The data is automatically translated into the current database's encoding when loaded. Any lines containing untranslatable characters are silently ignored, so that rules files can contain rules that are not applicable in the current encoding. 他のPostgreSQLテキスト検索設定ファイルと同じように、rulesファイルはUTF-8エンコーディングで保存しなければなりません。データはロード時に自動的に現在のデータベースのエンコーディングに変換されます。 rulesファイルが現在のエンコーディングで適用できない規則も含むことができるように、変換できない文字を含む行は単に無視されます。

A more complete example, which is directly useful for most European languages, can be found in <filename>unaccent.rules</filename>, which is installed in <filename>$SHAREDIR/tsearch_data/</filename> when the <filename>unaccent</filename> module is installed. This rules file translates characters with accents to the same characters without accents, and it also expands ligatures into the equivalent series of simple characters (for example, Æ to AE). unaccent.rulesは、ほとんどの欧州圏の言語で直接使用できる、より複雑な例です。これはunaccentモジュールをインストールした時に$SHAREDIR/tsearch_data/にインストールされます。このrulesファイルは、アクセント記号のある文字をアクセント記号のない同じ文字に変換し、また、合字を同等な普通の文字の並びに(例えば、ÆをAEに)展開します。

F.46.2. 使用方法 #

<title>Usage</title>

Installing the <literal>unaccent</literal> extension creates a text search template <literal>unaccent</literal> and a dictionary <literal>unaccent</literal> based on it. The <literal>unaccent</literal> dictionary has the default parameter setting <literal>RULES='unaccent'</literal>, which makes it immediately usable with the standard <filename>unaccent.rules</filename> file. If you wish, you can alter the parameter, for example unaccent拡張をインストールすることで、unaccent全文検索テンプレートとそれに基づくデフォルトのパラメータを持つunaccent辞書が生成されます。 unaccent辞書はRULES='unaccent'というデフォルトパラメータ設定を持ちます。これは標準のunaccent.rulesファイルを即座に使用可能にします。次の例のようにパラメータを変更することができます。

mydb=# ALTER TEXT SEARCH DICTIONARY unaccent (RULES='my_rules');

or create new dictionaries based on the template. また、このテンプレートに基づいた辞書を新規に作成することができます。

To test the dictionary, you can try: 以下を行うことで、辞書の動作を確認することができます。

mydb=# select ts_lexize('unaccent','Hôtel');
 ts_lexize
-----------
 {Hotel}
(1 row)

Here is an example showing how to insert the <filename>unaccent</filename> dictionary into a text search configuration: 全文検索設定にunaccent辞書を組み込む方法を示す例を以下に示します。

mydb=# CREATE TEXT SEARCH CONFIGURATION fr ( COPY = french );
mydb=# ALTER TEXT SEARCH CONFIGURATION fr
        ALTER MAPPING FOR hword, hword_part, word
        WITH unaccent, french_stem;
mydb=# select to_tsvector('fr','Hôtels de la Mer');
    to_tsvector
-------------------
 'hotel':1 'mer':4
(1 row)

mydb=# select to_tsvector('fr','Hôtel de la Mer') @@ to_tsquery('fr','Hotels');
 ?column?
----------
 t
(1 row)

mydb=# select ts_headline('fr','Hôtel de la Mer',to_tsquery('fr','Hotels'));
      ts_headline
------------------------
 <b>Hôtel</b> de la Mer
(1 row)

F.46.3. 関数 #

<title>Functions</title>

The <function>unaccent()</function> function removes accents (diacritic signs) from a given string. Basically, it's a wrapper around <filename>unaccent</filename>-type dictionaries, but it can be used outside normal text search contexts. unaccent関数は与えられた文字列からアクセント（発音区別符号）を取り除きます。基本的にこれはunaccent型の辞書のラッパーです。しかし通常の全文検索以外の文脈で使用することができます。

unaccent([dictionary regdictionary, ] string text) returns text

If the <replaceable class="parameter">dictionary</replaceable> argument is omitted, the text search dictionary named <literal>unaccent</literal> and appearing in the same schema as the <function>unaccent()</function> function itself is used. 引数dictionaryが省略された場合、unaccentという名前でunaccent()関数自体と同じスキーマにある全文検索用の辞書が使われます。

For example: 下記は使用例です。

SELECT unaccent('unaccent', 'Hôtel');
SELECT unaccent('Hôtel');

前へ	上へ	次へ
F.45. tsm_system_time — `TABLESAMPLE`に対する`SYSTEM_TIME`サンプリングメソッド	ホーム	F.47. uuid-ossp — UUID生成器