F.16. fuzzystrmatch — 文字列の類似度と距離を決定する

PostgreSQL 17.6文書
		付録F 追加で提供されるモジュールと拡張	誤訳等の報告
前へ	上へ	F.16. fuzzystrmatch — 文字列の類似度と距離を決定する	次へ

F.16. fuzzystrmatch — 文字列の類似度と距離を決定する #

<title>fuzzystrmatch — determine string similarities and distance</title>

The <filename>fuzzystrmatch</filename> module provides several functions to determine similarities and distance between strings. fuzzystrmatchモジュールは、文字列間の類似度や相違度を決める複数の関数を提供します。

注意

At present, the <function>soundex</function>, <function>metaphone</function>, <function>dmetaphone</function>, and <function>dmetaphone_alt</function> functions do not work well with multibyte encodings (such as UTF-8). Use <function>daitch_mokotoff</function> or <function>levenshtein</function> with such data. 現時点で、soundex、metaphone、dmetaphone、dmetaphone_altは（UTF-8のような）マルチバイト符号化方式では充分に動作しません。このようなデータにはdaitch_mokotoffまたはlevenshteinを使用してください。

This module is considered <quote>trusted</quote>, that is, it can be installed by non-superusers who have <literal>CREATE</literal> privilege on the current database. このモジュールは「trusted」と見なされます。つまり、現在のデータベースに対してCREATE権限を持つ非スーパーユーザがインストールできます。

F.16.1. Soundex #

The Soundex system is a method of matching similar-sounding names by converting them to the same code. It was initially used by the United States Census in 1880, 1900, and 1910. Note that Soundex is not very useful for non-English names. Soundexシステムは、同一コードに変換することで似ているように見える名称を一致させる手法です。これは、1880年、1900年、1910年の米国国勢調査で初めて使用されました。 Soundexは非英語圏の名称では特に有用なものではないことに注意してください。

The <filename>fuzzystrmatch</filename> module provides two functions for working with Soundex codes: fuzzystrmatchはSoundexコードを使用して動作する2つの関数を提供します。

soundex(text) returns text
difference(text, text) returns int

The <function>soundex</function> function converts a string to its Soundex code. The <function>difference</function> function converts two strings to their Soundex codes and then reports the number of matching code positions. Since Soundex codes have four characters, the result ranges from zero to four, with zero being no match and four being an exact match. (Thus, the function is misnamed — <function>similarity</function> would have been a better name.) soundex関数は文字列をSoundexコードに変換します。 difference関数は2つの文字列をそのSoundexコードに変換し、コード位置が一致する個数を報告します。 Soundexコードは4文字からなりますので、結果は0から4までの範囲になります。 0はまったく一致しないことを、4は完全に一致することを示します。（したがってこの関数の名前は間違っています。similarityの方がより優れた名前だったかもしれません。）

Here are some usage examples: 以下に使用例をいくつか示します。

SELECT soundex('hello world!');

SELECT soundex('Anne'), soundex('Ann'), difference('Anne', 'Ann');
SELECT soundex('Anne'), soundex('Andrew'), difference('Anne', 'Andrew');
SELECT soundex('Anne'), soundex('Margaret'), difference('Anne', 'Margaret');

CREATE TABLE s (nm text);

INSERT INTO s VALUES ('john');
INSERT INTO s VALUES ('joan');
INSERT INTO s VALUES ('wobbly');
INSERT INTO s VALUES ('jack');

SELECT * FROM s WHERE soundex(nm) = soundex('john');

SELECT * FROM s WHERE difference(s.nm, 'john') > 2;

F.16.2. Daitch-Mokotoff Soundex #

Like the original Soundex system, Daitch-Mokotoff Soundex matches similar-sounding names by converting them to the same code. However, Daitch-Mokotoff Soundex is significantly more useful for non-English names than the original system. Major improvements over the original system include: 従来Soundexシステムと同様に、Daitch-Mokotoff Soundexは、似たような名称を同じコードに変換することで一致させます。ただし、Daitch-Mokotoff Soundexは、非英語圏の名称に対しては、従来システムよりもはるかに便利です。従来システムに対する主な改善点は以下のとおりです。

The code is based on the first six meaningful letters rather than four. コードは4文字ではなく意味のある最初の6文字に基づいています。
A letter or combination of letters maps into ten possible codes rather than seven. 文字または文字の組み合わせは、7つではなく10の可能なコードにマップされます。
Where two consecutive letters have a single sound, they are coded as a single number. 2つの連続した文字が1つの音を持つ場合、それらは1つの数字としてコード化されます。
When a letter or combination of letters may have different sounds, multiple codes are emitted to cover all possibilities. 文字または文字の組み合わせには異なる音がある場合、すべての可能性をカバーするために複数のコードが出力されます。

This function generates the Daitch-Mokotoff soundex codes for its input: この関数は入力に対するDaitch-Mokotoff soundexコードを生成します。

daitch_mokotoff(source text) returns text[]

The result may contain one or more codes depending on how many plausible pronunciations there are, so it is represented as an array. 結果は、考えられる発音がいくつあるかによって1つ以上のコードを含む可能性があるため、配列として表現されます。

Since a Daitch-Mokotoff soundex code consists of only 6 digits, <parameter>source</parameter> should be preferably a single word or name. Daitch-Mokotoff soundexコードは6桁の数字のみで構成されるため、sourceは単語または名前であることが好ましいです。

Here are some examples: 以下に例をいくつか示します。

SELECT daitch_mokotoff('George');
 daitch_mokotoff
-----------------
 {595000}

SELECT daitch_mokotoff('John');
 daitch_mokotoff
-----------------
 {160000,460000}

SELECT daitch_mokotoff('Bierschbach');
                      daitch_mokotoff
-----------------------------------------------------------
 {794575,794574,794750,794740,745750,745740,747500,747400}

SELECT daitch_mokotoff('Schwartzenegger');
 daitch_mokotoff
-----------------
 {479465}

For matching of single names, returned text arrays can be matched directly using the <literal>&&</literal> operator: any overlap can be considered a match. A GIN index may be used for efficiency, see <xref linkend="gin"/> and this example: 単一の名前の一致には、返されたテキスト配列を直接&&演算子を使用して一致させられます。重複はマッチとみなされます。効率のためにGINインデックスを使用できます。 64.4および以下の例を参照してください。

CREATE TABLE s (nm text);
CREATE INDEX ix_s_dm ON s USING gin (daitch_mokotoff(nm)) WITH (fastupdate = off);

INSERT INTO s (nm) VALUES
  ('Schwartzenegger'),
  ('John'),
  ('James'),
  ('Steinman'),
  ('Steinmetz');

SELECT * FROM s WHERE daitch_mokotoff(nm) && daitch_mokotoff('Swartzenegger');
SELECT * FROM s WHERE daitch_mokotoff(nm) && daitch_mokotoff('Jane');
SELECT * FROM s WHERE daitch_mokotoff(nm) && daitch_mokotoff('Jens');

For indexing and matching of any number of names in any order, Full Text Search features can be used. See <xref linkend="textsearch"/> and this example: 任意の順序で任意の数の名前のインデックス付けと一致には、全文検索機能を使用できます。第12章および以下の例を参照してください。

CREATE FUNCTION soundex_tsvector(v_name text) RETURNS tsvector
BEGIN ATOMIC
  SELECT to_tsvector('simple',
                     string_agg(array_to_string(daitch_mokotoff(n), ' '), ' '))
  FROM regexp_split_to_table(v_name, '\s+') AS n;
END;

CREATE FUNCTION soundex_tsquery(v_name text) RETURNS tsquery
BEGIN ATOMIC
  SELECT string_agg('(' || array_to_string(daitch_mokotoff(n), '|') || ')', '&')::tsquery
  FROM regexp_split_to_table(v_name, '\s+') AS n;
END;

CREATE TABLE s (nm text);
CREATE INDEX ix_s_txt ON s USING gin (soundex_tsvector(nm)) WITH (fastupdate = off);

INSERT INTO s (nm) VALUES
  ('John Doe'),
  ('Jane Roe'),
  ('Public John Q.'),
  ('George Best'),
  ('John Yamson');

SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john');
SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('jane doe');
SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('john public');
SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('besst, giorgio');
SELECT * FROM s WHERE soundex_tsvector(nm) @@ soundex_tsquery('Jameson John');

If it is desired to avoid recalculation of soundex codes during index rechecks, an index on a separate column can be used instead of an index on an expression. A stored generated column can be used for this; see <xref linkend="ddl-generated-columns"/>. インデックス再チェック中にsoundexコードの再計算を避ける場合は、式のインデックスの代わりに別の列のインデックスを使用できます。これには、格納された生成列を使用できます。 5.4を参照してください。

F.16.3. レーベンシュタイン(Levenshtein) #

<title>Levenshtein</title>

This function calculates the Levenshtein distance between two strings: この関数は2つの文字列間のレーベンシュタイン距離(Levenshtein distance)を計算します。

levenshtein(source text, target text, ins_cost int, del_cost int, sub_cost int) returns int
levenshtein(source text, target text) returns int
levenshtein_less_equal(source text, target text, ins_cost int, del_cost int, sub_cost int, max_d int) returns int
levenshtein_less_equal(source text, target text, max_d int) returns int

Both <literal>source</literal> and <literal>target</literal> can be any non-null string, with a maximum of 255 characters. The cost parameters specify how much to charge for a character insertion, deletion, or substitution, respectively. You can omit the cost parameters, as in the second version of the function; in that case they all default to 1. sourceおよびtargetは255文字までの任意の非NULL文字列を取ることができます。コストパラメータはそれぞれ、文字の挿入、削除、置換に負わせる文字数を指定します。この関数の2番目のバージョンのようにコストパラメータを省略することができます。この場合デフォルトですべて1になります。

<function>levenshtein_less_equal</function> is an accelerated version of the Levenshtein function for use when only small distances are of interest. If the actual distance is less than or equal to <literal>max_d</literal>, then <function>levenshtein_less_equal</function> returns the correct distance; otherwise it returns some value greater than <literal>max_d</literal>. If <literal>max_d</literal> is negative then the behavior is the same as <function>levenshtein</function>. levenshtein_less_equalは小さな距離だけを問題にする場合についてのlevenshtein関数の高速化版です。実際の距離がmax_d以下の場合、levenshtein_less_equalは正しい値を返しますが、そうでなければ、max_dより大きい何らかの値を返します。 max_dが負の場合は、levenshteinと同じ動作になります。

Examples: 以下に例を示します。

test=# SELECT levenshtein('GUMBO', 'GAMBOL');
 levenshtein
-------------
           2
(1 row)

test=# SELECT levenshtein('GUMBO', 'GAMBOL', 2, 1, 1);
 levenshtein
-------------
           3
(1 row)

test=# SELECT levenshtein_less_equal('extensive', 'exhaustive', 2);
 levenshtein_less_equal
------------------------
                      3
(1 row)

test=# SELECT levenshtein_less_equal('extensive', 'exhaustive', 4);
 levenshtein_less_equal
------------------------
                      4
(1 row)

F.16.4. Metaphone #

Metaphone, like Soundex, is based on the idea of constructing a representative code for an input string. Two strings are then deemed similar if they have the same codes. Metaphoneは、Soundex同様、入力文字に対する対応するコードを構築するという考えに基づいたものです。 2つの文字列が同一コードを持つ場合、類似とみなされます。

This function calculates the metaphone code of an input string: 以下の関数は入力文字列に対するmetaphoneコードを計算します。

metaphone(source text, max_output_length int) returns text

<literal>source</literal> has to be a non-null string with a maximum of 255 characters. <literal>max_output_length</literal> sets the maximum length of the output metaphone code; if longer, the output is truncated to this length. sourceは255文字までの非NULL文字列を取ることができます。 max_output_lengthは出力metaphoneコードの最大長を設定します。出力は長すぎるとこの長さに切り詰められます。

Example: 以下に例を示します。

test=# SELECT metaphone('GUMBO', 4);
 metaphone
-----------
 KM
(1 row)

F.16.5. Double Metaphone #

The Double Metaphone system computes two <quote>sounds like</quote> strings for a given input string — a <quote>primary</quote> and an <quote>alternate</quote>. In most cases they are the same, but for non-English names especially they can be a bit different, depending on pronunciation. These functions compute the primary and alternate codes: Double Metaphoneシステムは与えられた入力文字列に対する、「primary」と「alternate」という2つの「似たように見える」文字列を計算します。ほとんどの場合、これらは同じですが、英語以外の名称では特に発音に応じて多少異なる場合があります。以下の関数はprimaryコードとalternateコードを計算します。

dmetaphone(source text) returns text
dmetaphone_alt(source text) returns text

There is no length limit on the input strings. 入力文字列長に関する制限はありません。

Example: 以下に例を示します。

test=# SELECT dmetaphone('gumbo');
 dmetaphone
------------
 KMP
(1 row)

前へ	上へ	次へ
F.15. file_fdw — サーバのファイルシステムにあるデータファイルにアクセスする	ホーム	F.17. hstore — hstoreキー/値データ型