pg_trgm `'term'% किसी भी (array_column)` क्वेरी के लिए एक स्ट्रिंग सरणी कॉलम को इंडेक्स कैसे करें?

यह काम क्यों नहीं करता

सूचकांक प्रकार (यानी ऑपरेटर वर्ग) gin_trgm_ops % . पर आधारित है ऑपरेटर, जो दो text . पर काम करता है तर्क:

CREATE OPERATOR trgm.%(
  PROCEDURE = trgm.similarity_op,
  LEFTARG = text,
  RIGHTARG = text,
  COMMUTATOR = %,
  RESTRICT = contsel,
  JOIN = contjoinsel);

आप gin_trgm_ops . का उपयोग नहीं कर सकते सरणियों के लिए। किसी सरणी स्तंभ के लिए परिभाषित अनुक्रमणिका कभी भी any(array[...]) के साथ काम नहीं करेगी क्योंकि सरणियों के अलग-अलग तत्वों को अनुक्रमित नहीं किया जाता है। एक सरणी को अनुक्रमित करने के लिए एक अलग प्रकार की अनुक्रमणिका की आवश्यकता होगी, अर्थात् जिन सरणी अनुक्रमणिका।

सौभाग्य से, अनुक्रमणिका gin_trgm_ops इतनी चतुराई से डिजाइन किया गया है कि यह ऑपरेटरों के साथ काम कर रहा है like और ilike , जिसका उपयोग वैकल्पिक समाधान के रूप में किया जा सकता है (उदाहरण नीचे वर्णित है)।

टेस्ट टेबल

दो कॉलम हैं (id serial primary key, names text[]) और इसमें 100000 लैटिन वाक्य शामिल हैं जो सरणी तत्वों में विभाजित हैं।

select count(*), sum(cardinality(names))::int words from test;

 count  |  words  
--------+---------
 100000 | 1799389

select * from test limit 1;

 id |                                                     names                                                     
----+---------------------------------------------------------------------------------------------------------------
  1 | {fugiat,odio,aut,quis,dolorem,exercitationem,fugiat,voluptates,facere,error,debitis,ut,nam,et,voluptatem,eum}

फ्रेगमेंट शब्द की खोज praesent 2400 ms में 7051 पंक्तियाँ देता है:

explain analyse
select count(*)
from test
where 'praesent' % any(names);

                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=5479.49..5479.50 rows=1 width=0) (actual time=2400.866..2400.866 rows=1 loops=1)
   ->  Seq Scan on test  (cost=0.00..5477.00 rows=996 width=0) (actual time=1.464..2400.271 rows=7051 loops=1)
         Filter: ('praesent'::text % ANY (names))
         Rows Removed by Filter: 92949
 Planning time: 1.038 ms
 Execution time: 2400.916 ms

भौतिक दृश्य

एक समाधान मॉडल को सामान्य बनाना है, जिसमें एक पंक्ति में एक ही नाम के साथ एक नई तालिका का निर्माण शामिल है। मौजूदा प्रश्नों, विचारों, कार्यों या अन्य निर्भरताओं के कारण इस तरह के पुनर्गठन को लागू करना मुश्किल हो सकता है और कभी-कभी असंभव हो सकता है। एक भौतिक दृश्य का उपयोग करके तालिका संरचना को बदले बिना एक समान प्रभाव प्राप्त किया जा सकता है।

create materialized view test_names as
    select id, name, name_id
    from test
    cross join unnest(names) with ordinality u(name, name_id)
    with data;

With ordinality आवश्यक नहीं है, लेकिन नामों को उसी क्रम में एकत्रित करते समय उपयोगी हो सकता है जैसा कि मुख्य तालिका में है। test_names की क्वेरी कर रहा है एक ही समय में मुख्य तालिका के समान परिणाम देता है।

अनुक्रमणिका बनाने के बाद निष्पादन समय बार-बार घटता है:

create index on test_names using gin (name gin_trgm_ops);

explain analyse
select count(distinct id)
from test_names
where 'praesent' % name

                                                                QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4888.89..4888.90 rows=1 width=4) (actual time=56.045..56.045 rows=1 loops=1)
   ->  Bitmap Heap Scan on test_names  (cost=141.95..4884.39 rows=1799 width=4) (actual time=10.513..54.987 rows=7230 loops=1)
         Recheck Cond: ('praesent'::text % name)
         Rows Removed by Index Recheck: 7219
         Heap Blocks: exact=8122
         ->  Bitmap Index Scan on test_names_name_idx  (cost=0.00..141.50 rows=1799 width=0) (actual time=9.512..9.512 rows=14449 loops=1)
               Index Cond: ('praesent'::text % name)
 Planning time: 2.990 ms
 Execution time: 56.521 ms

समाधान में कुछ कमियां हैं। चूंकि दृश्य भौतिक हो गया है, डेटा डेटाबेस में दो बार संग्रहीत किया जाता है। मुख्य तालिका में परिवर्तन के बाद आपको दृश्य को ताज़ा करना याद रखना होगा। और मुख्य तालिका में दृश्य में शामिल होने की आवश्यकता के कारण प्रश्न अधिक जटिल हो सकते हैं।

`ilike` का उपयोग करना

हम ilike . का उपयोग कर सकते हैं पाठ के रूप में दर्शाए गए सरणियों पर। संपूर्ण सरणी पर अनुक्रमणिका बनाने के लिए हमें एक अपरिवर्तनीय फ़ंक्शन की आवश्यकता है:

create function text(text[])
returns text language sql immutable as
$$ select $1::text $$

create index on test using gin (text(names) gin_trgm_ops);

और प्रश्नों में फ़ंक्शन का उपयोग करें:

explain analyse
select count(*)
from test
where text(names) ilike '%praesent%' 

                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=117.06..117.07 rows=1 width=0) (actual time=60.585..60.585 rows=1 loops=1)
   ->  Bitmap Heap Scan on test  (cost=76.08..117.03 rows=10 width=0) (actual time=2.560..60.161 rows=7051 loops=1)
         Recheck Cond: (text(names) ~~* '%praesent%'::text)
         Heap Blocks: exact=2899
         ->  Bitmap Index Scan on test_text_idx  (cost=0.00..76.08 rows=10 width=0) (actual time=2.160..2.160 rows=7051 loops=1)
               Index Cond: (text(names) ~~* '%praesent%'::text)
 Planning time: 3.301 ms
 Execution time: 60.876 ms

60 बनाम 2400 एमएस, अतिरिक्त संबंध बनाने की आवश्यकता के बिना काफी अच्छा परिणाम।

यह समाधान आसान लगता है और इसके लिए कम काम की आवश्यकता होती है, बशर्ते कि ilike , जो trgm % . से कम सटीक टूल है ऑपरेटर, पर्याप्त है।

हमें ilike . का उपयोग क्यों करना चाहिए? % . के बजाय पाठ के रूप में संपूर्ण सरणियों के लिए? समानता काफी हद तक ग्रंथों की लंबाई पर निर्भर करती है। विभिन्न लंबाई के लंबे ग्रंथों में एक शब्द की खोज के लिए एक उपयुक्त सीमा चुनना बहुत मुश्किल है। उदा। limit = 0.3 . के साथ हमारे पास परिणाम हैं:

with data(txt) as (
values
    ('praesentium,distinctio,modi,nulla,commodi,tempore'),
    ('praesentium,distinctio,modi,nulla,commodi'),
    ('praesentium,distinctio,modi,nulla'),
    ('praesentium,distinctio,modi'),
    ('praesentium,distinctio'),
    ('praesentium')
)
select length(txt), similarity('praesent', txt), 'praesent' % txt "matched?"
from data;

 length | similarity | matched? 
--------+------------+----------
     49 |   0.166667 | f           <--!
     41 |        0.2 | f           <--!
     33 |   0.228571 | f           <--!
     27 |   0.275862 | f           <--!
     22 |   0.333333 | t
     11 |   0.615385 | t
(6 rows)

pg_trgm `'term'% किसी भी (array_column)` क्वेरी के लिए एक स्ट्रिंग सरणी कॉलम को इंडेक्स कैसे करें?

यह काम क्यों नहीं करता

टेस्ट टेबल

भौतिक दृश्य

ilike का उपयोग करना

`ilike` का उपयोग करना