12.5. パーサ

PostgreSQL 17.5文書
		第12章全文検索	誤訳等の報告
前へ	上へ	12.5. パーサ	次へ

12.5. パーサ #

<title>Parsers</title>

Text search parsers are responsible for splitting raw document text into <firstterm>tokens</firstterm> and identifying each token's type, where the set of possible types is defined by the parser itself. Note that a parser does not modify the text at all — it simply identifies plausible word boundaries. Because of this limited scope, there is less need for application-specific custom parsers than there is for custom dictionaries. At present <productname>PostgreSQL</productname> provides just one built-in parser, which has been found to be useful for a wide range of applications. テキスト検索パーサは、もとの文書テキストを分割してトークンに変換し、それぞれのトークンの型を識別する役割を持っています。ここで、可能な型の集合は、パーサ自身が定義します。パーサは文書をまったく変更しないことに注意してください — それは、単に可能な単語の境界を識別するだけです。このような制限があるため、カスタム辞書を作るのに比べ、用途限定のカスタムパーサを作る必要性は少ないです。今のところ、PostgreSQLはたった一つの組み込みパーサを提供しています。これは広い範囲の用途に対して有用であると考えられています。

The built-in parser is named <literal>pg_catalog.default</literal>. It recognizes 23 token types, shown in <xref linkend="textsearch-default-parser"/>. 組み込みのパーサはpg_catalog.defaultというものです。表 12.1に示す23のトークンを理解します。

表12.1 デフォルトパーサのトークン型

<title>Default Parser's Token Types</title>

別名	説明	例
`asciiword`	単語、すべてのASCII文字	`elephant`
`word`	単語、すべての文字	`mañana`
`numword`	単語、文字、数字	`beta1`
`asciihword`	ハイフンでつながれた単語、すべてのASCII	`up-to-date`
`hword`	ハイフンでつながれた単語、すべての文字	`lógico-matemática`
`numhword`	ハイフンでつながれた単語、すべての文字、数字	`postgresql-beta1`
`hword_asciipart`	ハイフンでつながれた単語の一部、すべての ASCII	`postgresql-beta1`の`postgresql`
`hword_part`	ハイフンでつながれた単語の一部、すべての文字	`lógico-matemática`の`lógico`または`matemática`
`hword_numpart`	ハイフンでつながれた単語の文字+数字の部分	`postgresql-beta1`の`beta1`
`email`	電子メールアドレス	`foo@example.com`
`protocol`	プロトコルヘッダ	`http://`
`url`	URL	`example.com/stuff/index.html`
`host`	ホスト名	`example.com`
`url_path`	URL中のパス名	URL中の`/stuff/index.html`
`file`	ファイルまたはパス名	URL中でない`/usr/local/foo.txt`
`sfloat`	科学技術表記	`-1.234e56`
`float`	10進表記	`-1.234`
`int`	符号付き整数	`-1234`
`uint`	符号なし整数	`1234`
`version`	バージョン番号	`8.3.0`
`tag`	XMLタグ	`<a href="dictionaries.html">`
`entity`	XMLエンティティ	`&`
`blank`	空白記号	(他のものに解釈できない空白または句読点)

注記

The parser's notion of a <quote>letter</quote> is determined by the database's locale setting, specifically <varname>lc_ctype</varname>. Words containing only the basic ASCII letters are reported as a separate token type, since it is sometimes useful to distinguish them. In most European languages, token types <literal>word</literal> and <literal>asciiword</literal> should be treated alike. パーサにとっての「文字」は、データベースのロケールの設定、特にlc_ctypeによって決まります。基本的なASCIIのみを含む単語は、別のトークン型として報告されます。ときには、それらを他と区別することが有用だからです。ヨーロッパのたいていの言語では、word と asciiwordは、同じように扱われます。

<literal>email</literal> does not support all valid email characters as defined by <ulink url="https://datatracker.ietf.org/doc/html/rfc5322">RFC 5322</ulink>. Specifically, the only non-alphanumeric characters supported for email user names are period, dash, and underscore. emailはRFC 5322で定義された有効なメールアドレス文字をすべてサポートしている訳ではありません。特に、メールアドレスのユーザ名としてサポートされる英数字以外の文字はピリオド、ダッシュ、アンダースコアのみです。

<literal>tag</literal> does not support all valid tag names as defined by <ulink url="https://www.w3.org/TR/xml/">W3C Recommendation, XML</ulink>. Specifically, the only tag names supported are those starting with an ASCII letter, underscore, or colon, and containing only letters, digits, hyphens, underscores, periods, and colons. <literal>tag</literal> also includes XML comments starting with <literal></literal>, and XML declarations (but note that this includes anything starting with <literal><?x</literal> and ending with <literal>></literal>). tagは、W3C勧告で定義されているXMLのすべての有効なタグ名をサポートしているわけではありません。特に、サポートされているタグ名は、ASCII文字、アンダースコアまたはコロンで始まり、文字、数字、ハイフン、アンダースコア、ピリオドおよびコロンのみを含むものです。 tagは、で終わるXMLコメントやXML宣言も含みます（ただし、これは<?xで始まり、>で終わるものをすべて含むことに注意してください）。

It is possible for the parser to produce overlapping tokens from the same piece of text. As an example, a hyphenated word will be reported both as the entire word and as each component: パーサがテキストの同じ部分から重複したトークンを生成することはあり得ます。たとえば、ハイフン付の単語は、単語全体と、各部分の両方を報告します。例を示します。

SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
      alias      |               description                |     token
-----------------+------------------------------------------+---------------
 numhword        | Hyphenated word, letters and digits      | foo-bar-beta1
 hword_asciipart | Hyphenated word part, all ASCII          | foo
 blank           | Space symbols                            | -
 hword_asciipart | Hyphenated word part, all ASCII          | bar
 blank           | Space symbols                            | -
 hword_numpart   | Hyphenated word part, letters and digits | beta1

This behavior is desirable since it allows searches to work for both the whole compound word and for components. Here is another instructive example: この挙動は好ましいのものです。単語全体と、各々の部分の両方に対して検索ができるからです。初歩的な別の例を示します。

SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
  alias   |  description  |            token
----------+---------------+------------------------------
 protocol | Protocol head | http://
 url      | URL           | example.com/stuff/index.html
 host     | Host          | example.com
 url_path | URL path      | /stuff/index.html

前へ	上へ	次へ
12.4. 追加機能	ホーム	12.6. 辞書