SW engineering, engineering management and the business of software

subscribe for more
stuff like this:

Postgres Text Search 101

Simple primer on the topic of Postgres text search. No code examples, it’s intended to be conceptual.

Simple, Perfect Equality Checks

If you just want to match exactly the contents of a TEXT or VARCHAR field, just use a normal Postgres index on that column.

However most text search tends to be keyword-based partial or fuzzy.

Full-Text Search and Trigram based searching

Full-text search (FTS) matches words exactly as well as understanding the notion of grammar-based modifiers:

  • FTS will match “balance” with any of “balance, balanced, balances, balancing,” etc.
  • FTS will NOT match “university” with “uni”
  • FTS will NOT match “university” with any misspelling, such as “uniservity, universiyt” etc.

You have to tell FTS which language so understands which grammar rules to apply. An index built with English rules would not work as well matching text written in Hawaiian.

Trigram matching is usually done with the LIKE and ILIKE operators and the pg_trgm add on module.

pg_trgm breaks down words into 1-3 letter chunks: “fred” -> f, fr, fre, red, re, ed, d, etc…

Trigrams are great for optionally doing “fuzzy” matching: misspellings or partial words. This tends to catch grammer-based modifications such as plurals or verb conjugation as well.

Trigrams are typically use with weights and scores. The closer the search term is to an indexed term, the higher the weighted score.

  • Trigram based search:
    • Will match “university” with “uni”
    • Can match “university” with any misspelling, such as “uniservity, universiyt” etc. with a lower match score.
    • Can match “balance” with any of “balanced, balances, balancing,” etc. with a lower match score.

Occasionally you see usage of both in parallel. Conceptually, imagine the search results of a search engine being the output of FTS, and the “Did you mean: some_other_spelling” the output of trigram-based fuzzy matching.

GIST vs GIN indices

The docs have a good rundown: https://www.postgresql.org/docs/current/textsearch-indexes.html

In general, GIN is the preferred indexing flavor.

GIN indexes tend to have slightly faster lookups.
GIN also tend to be slightly larger and slower to built.

GIN does not support weights, but GiST does. GiST indices are “lossy”, meaning you may, very occasionally, get false matches.

For me, I almost always use GIN, except when building trigram indices for fuzzy matching I use GiST to get efficient weighted scoring.

EXPLAIN is your friend.

You want to see things like “Index Scan” or “Heap Scan”:

Limit  (cost=0.27..114.25 rows=100 width=75)
->  Index Scan using institutes_searchable_text_trgm_gist_idx on institutes  (cost=0.27..2538.49 rows=2227 width=75)
    Order By: ((((((COALESCE(name, ''::character varying))::text || ' '::text) || COALESCE(alias, ''::text)) || ' '::text) || (COALESCE(city, ''::character varying))::text) <->> 'search phrase'::text)
Filter: (word_similarity(’search phrase::text, (((((COALESCE(name, ''::character varying))::text || ' '::text) || COALESCE(alias, ''::text)) || ' '::text) || (COALESCE(city, ''::character varying))::text)) > '0.4'::double precision)

Limit  (cost=8.79..188.84 rows=67 width=91)
->  Bitmap Heap Scan on institutes  (cost=8.79..188.84 rows=67 width=91)
        Recheck Cond: ((((((COALESCE(name, ''::character varying))::text || ' '::text) || COALESCE(alias, ''::text)) || ' '::text) || (COALESCE(city, ''::character varying))::text) ~~* '%mary%'::text)
->  Bitmap Index Scan on institutes_searchable_text_trgm_gist_idx  (cost=0.00..8.78 rows=67 width=0)
          Index Cond: ((((((COALESCE(name, ''::character varying))::text || ' '::text) || COALESCE(alias, ''::text)) || ' '::text) || (COALESCE(city, ''::character varying))::text) ~~* '%mary%'::text)

If you see “Seq Scan”, you need to index the appropriate columns/tables until you don’t:

Limit  (cost=0.00..567.32 rows=53 width=91)
  ->  Seq Scan on institutes  (cost=0.00..567.32 rows=53 width=91)
        Filter: ((((((COALESCE(name, ''::character varying))::text || ' '::text) || COALESCE(alias, ''::text)) || ' '::text) || (COALESCE(city, ''::character varying))::text) ~~* '%mary%'::text)

The cost=0.00..567.32 chunk is important. That’s postgres telling you what it estimates the startup cost and total cost of the query to be. Typically only the total cost matters for most purposes.

The above were run on a table with just 6000 or so rows. Normally there is a much bigger delta between sequential and index scan costs.

If you use EXPLAIN ANALYZE, you get actual costs, rather than estimates.

There are some cool tools to help you parse the output like: https://explain.depesz.com

Optimization is a rabbit hole

If you can eliminate sequential scans, you’ve done the low hanging fruit. After that, it’s usually measure, trial and iterate.

Importantly, the type of data (table structure, data types, etc) and type of search matters a lot and will dictate any deep optimization strategies.

Here’s an example of fairly deep optimization with great results: https://alexklibisz.com/2022/02/18/optimizing-postgres-trigram-search

in lieu of comments, you should follow me on twitter at twitter/amattn and on twitch.tv at twitch.tv/amattn. I'm happy to chat about content here anytime.

the fine print:
aboutarchivemastodontwittertwitchconsulting or speaking inquiries
© matt nunogawa 2010 - 2023 / all rights reserved