Rails with Postgres - Fuzzy Searches
Overview
Recently I was working on a project at work that required finding the appropriate record with incomplete information (that might be either mispelled or within multiple columns - thus LIKE
and ILIKE
are insufficient).
My co-worker Gernot Kogler, introduced me to the trigram scoring searches using similarity
and word_similarity
- this is a simple and very effective way to do fuzzy searches.
If interested, a good article Optimizing Postgres Text Search with Trigrams that explains the details of the scoring and optimizing the search speed.
In fact, there are several ways to do fuzzy searches in Postgres, this this article covers pg_trgrm
(trigram) extension - efficient scored partial matches. In the coclusion there is a list of alternatives to pg_trgrm.
TLDR - setup
You can find the code for this project at rails_pg_fuzzy_search so you can just do:
git clone https://github.com/btihen-dev/rails_pg_fuzzy_search
cd rails_pg_fuzzy_search
bin/rails db:create
bin/rails db:migrate
bin/rails db:seed
Or you can create the environment yourself using the following commands:
bin/rails new fuzzy -d postgresql -T
cd fuzzy
bin/rails db:create
bin/rails generate migration AddPgTrgmExtensionToDb
# update migration with `enable_extension "pg_trgm"`
# db/migrate/20241031122103_add_pg_trgm_extension_to_db.rb
class AddPgTrgmExtensionToDb < ActiveRecord::Migration[7.2]
def change
enable_extension "pg_trgm"
end
end
bin/rails generate model Person last_name:string:index first_name:string:index
bin/rails generate model Role job_title:string department:string person:references
bin/rails db:migrate
# see `Seeding` section for a seed file
bin/rails db:seed
TLDR - Examples
SQL - Single Table - Single-Column
bin/rails db
SELECT id, last_name, first_name,
similarity(last_name, 'Johns') AS score
FROM people
WHERE similarity(last_name, 'Johns') > 0.1
ORDER BY score DESC;
id | last_name | first_name | score
---+---------+---------+---------
2 | Johnson | John | 0.5555556
6 | Johnson | Emma | 0.5555556
7 | Johnston | Emilia | 0.5
3 | Johanson | Jonathan | 0.25
9 | Jones | Olivia | 0.2
(5 rows)
Single Table - Single-Column
bin/rails c
threshold = 0.2
compare_string = 'Johns'
compare_quoted = ActiveRecord::Base.connection.quote(compare_string)
Person.select(Arel.sql("*, similarity(last_name, #{compare_quoted}) AS score"))
.where("similarity(last_name, ?) > ?", compare_string, threshold)
.order(Arel.sql("similarity(last_name, #{compare_quoted}) DESC"))
.limit(5)
# or easier to read
compare_string = 'Johns'
compare_quoted = ActiveRecord::Base.connection.quote(compare_string)
similarity_calc = "similarity(last_name, #{compare_quoted})"
Person.select("*, #{similarity_calc} AS score")
.where("last_name % ?", compare_string)
.order("score DESC")
.limit(5)
Single Table - Multi-Column
use CONCAT_WS
to build a single text from the db fields:
bin/rails c
threshold = 0.2
compare_string = 'Emily Johns'
compare_quoted = ActiveRecord::Base.connection.quote(compare_string)
concat_fields = "CONCAT_WS(' ', first_name, last_name)"
similarity_calc = "similarity(#{concat_fields}, #{compare_quoted})"
Person.select("*, #{similarity_calc} AS score")
.where("#{similarity_calc} > ?", threshold)
.order("score DESC")
.limit(5)
# or
Person.select("*, #{similarity_calc} AS score")
.where("#{concat_fields} % #{compare_quoted}")
.order("score DESC")
.limit(5)
Multi-Table - Multi-Column
bin/rails c
compare_string = 'Emily, a research scientist'
compare_quoted = ActiveRecord::Base.connection.quote(compare_string)
concat_fields = "CONCAT_WS(' ', first_name, last_name, job_title, department)"
similarity_calc = "similarity(#{concat_fields}, #{compare_quoted})"
Person.joins(:roles)
.select("*, #{similarity_calc} AS score")
.order("score DESC")
.limit(3)
# or
Person.joins(:roles)
.select("*, #{similarity_calc} AS score")
.where("#{concat_fields} % #{compare_quoted}")
.order("score DESC")
.limit(3)
Read the rest of article for an explanation and code setup that handles these queries.
Getting Started
Let’s create a test project that finds the best matching person in our database, from a ‘description’.
Simple Rails Project
# macos 15 seems to need this
ulimit -n 10240
rbenv install 3.3.5
rbenv local 3.3.5
# create a new rails project (with PG & without minitest)
rails new fuzzy -d postgresql -T
# let's check the gem file to be sure we are using PG,
# we expect to find:
gem "pg", "~> 1.1"
# lets be sure we are using uptodate gems:
cd fuzzy
bundle update
# currently if we check Gemfile.lock we will see:
rails (7.2.2)
# setup the database with
bin/rails db:create
# assuming all went well
git init
git add .
git commit -m "initial commit"
Add DB extensions
To do similarity searches we need the extension: pg_trgm
bin/rails generate migration AddPgTrgmExtensionToDb
# db/migrate/20241031122103_add_pg_trgm_extension_to_db.rb
class AddPgTrgmExtensionToDb < ActiveRecord::Migration[7.2]
def change
enable_extension "pg_trgm" # for similarity searches
end
# or if you prefer raw sql
# def up
# execute("CREATE EXTENSION IF NOT EXISTS pg_trgm;")
# end
# def down
# execute("DROP EXTENSION IF EXISTS pg_trgm;")
# end
end
now migrate
bin/rails db:migrate
Create Models
Let’s create a simple person schema and learn to do a fuzzy search.
bin/rails generate model Person last_name:string:index first_name:string:index
bin/rails generate model Role job_title:string department:string person:references
Update Migration
# db/migrate/20241031124235_create_people.rb
class CreatePeople < ActiveRecord::Migration[7.2]
def change
create_table :people do |t|
t.string :last_name, null: false
t.string :first_name, null: false
t.timestamps
end
add_index :people, [:last_name, :first_name], unique: true
end
end
# and
# db/migrate/20241031124236_create_roles.rb
class CreateRoles < ActiveRecord::Migration[7.2]
def change
create_table :roles do |t|
t.string :job_title, null: false
t.string :department, null: false
t.references :person, null: false, foreign_key: true
t.timestamps
end
add_index :roles, [:person_id, :job_title, :department], unique: true
end
end
now we can migrate:
bin/rails db:migrate
Add Relationships
By adding relationships we can simplify our queries.
# app/models/person.rb
class Person < ApplicationRecord
has_many :roles
validates :first_name, presence: true
validates :last_name, presence: true
validates :first_name, uniqueness: { scope: [:last_name] }
end
# and
# app/models/role.rb
class Role < ApplicationRecord
belongs_to :person
validates :job_title, presence: true
validates :department, presence: true
validates :person_id, uniqueness: { scope: [:job_title, :department] }
end
Seeding
Now let’s make some records for testing using the seeds.rb
file.
# db/seeds.rb
# Step 1: People Data
people_data = [
{
last_name: "Smith",
first_name: "John",
job_title: "Software Engineer",
department: "Product"
},
{
last_name: "Johnson",
first_name: "John",
job_title: "Network Engineer",
department: "Operations"
},
{
last_name: "Johanson",
first_name: "Jonathan",
job_title: "QA Engineer",
department: "Quality"
},
{
last_name: "Smithers",
first_name: "Charles",
job_title: "Tester",
department: "Quality"
},
{
last_name: "Brown",
first_name: "Charlie",
job_title: "Designer",
department: "Product"
},
{
last_name: "Johnson",
first_name: "Emma",
job_title: "Network Engineer",
username: "emmajohnson",
email: "Emma.Johnson@example.com",
department: "Operations"
},
{
last_name: "Johnston",
first_name: "Emilia",
job_title: "Data Scientist",
department: "Research"
},
{
last_name: "Williams",
first_name: "Liam",
job_title: "DevOps Engineer",
department: "Operations"
},
{
last_name: "Jones",
first_name: "Olivia",
job_title: "UX Researcher",
department: "Research"
},
{
last_name: "Miller",
first_name: "Noah",
job_title: "Software Engineer",
department: "Product"
},
{
last_name: "Davis",
first_name: "Ava",
job_title: "Software Engineer",
department: "Product"
},
{
last_name: "Garcia",
first_name: "Sophia",
job_title: "QA Engineer",
department: "Quality"
},
{
last_name: "Rodriguez",
first_name: "Isabella",
job_title: "Tech Support",
department: "Customers"
},
{
last_name: "Martinez",
first_name: "Mason",
job_title: "Business Analyst",
department: "Research"
},
{
last_name: "Hernandez",
first_name: "Lucas",
job_title: "Systems Administrator",
department: "Operations"
},
{
last_name: "Lopez",
first_name: "Amelia",
job_title: "Product Manager",
department: "Business"
},
{
last_name: "Gonzalez",
first_name: "James",
job_title: "Network Engineer",
department: "Operations"
},
{
last_name: "Wilson",
first_name: "Elijah",
job_title: "Cloud Architect",
department: "Operations"
}
]
# Step 2: Insert People and Assign Accounts & Departments
people_data.each do |person_data|
# Insert the Person record
person =
Person.create(
first_name: person_data[:first_name],
last_name: person_data[:last_name]
)
# Insert the corresponding Account record
Role.create!(
person_id: person.id,
job_title: person_data[:job_title],
department: person_data[:department]
)
end
now let’s run the seed file:
bin/rails db:seed
Test Setup
let’s see if we can access our records using iex
:
bin/rails console
# tes basic models and seed
Person.first
Role.last
# test relationships
Person.first.role
Role.last.person
If all this works we are ready to go:
git add .
git commit -m "database setup and added people"
Simple Fuzzy Search SQL
The following list consists of the features that the extension comes with, which helps in doing trigram searches:
from the docs:
similarity(text, text)
- real - Returns a number that indicates how similar the two arguments are. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the two strings are identical).word_similarity(text, text)
- real - Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below.strict_word_similarity(text, text)
- real - Same as word_similarity(text, text), but forces extent boundaries to match word boundaries. Since we don’t have cross-word trigrams, this function actually returns greatest similarity between first string and any continuous extent of words of the second string.
Debuggin
show_trgm(text)
- text[] - Returns an array of all the trigrams in the given string. (In practice this is seldom useful except for debugging.)
Shows / Sets Default Threshold
show_limit()
- real - Returns the current similarity threshold used by the%
operator. This sets the minimum similarity between two words for them to be considered similar enough to be misspellings of each other, for example (deprecated).set_limit(real)
- real - Sets the current similarity threshold that is used by the%
operator. The threshold must be between 0 and 1 (default is 0.3). Returns the same value passed in (deprecated).
SQL Examples
Trigrams are substrings of three consecutive characters. For example, the word “Johns” has the following trigrams: " j"," jo", “joh”, “ohn”, “hns”, “ns “. We can use show_trgm('Johns')
to see the trigrams used by pg_trgm.
SELECT show_trgm('Johns') AS trigrams;
trigrams
------------------------------
{" j"," jo","hns","joh","ns ","ohn"}
(1 row)
Lets try out the various functions this extension gives us:
bin/rails db
-- trying our 3 methods of similarity
SELECT similarity('word', 'two words');
similarity
----------
0.36363637
SELECT word_similarity('word', 'two words');
word_similarity
-------------
0.8
SELECT strict_word_similarity('word', 'two words');
strict_word_similarity
-------------------
0.5714286
(1 row)
Let’s try something more useful and compare a db field to a string - using:
similarity(last_name, 'Johns')
which returns a (‘real’) a similarity score between 0 and 1.
NOTE: where clauses must return a binary value (true or false), thus to use a where clause, it looks like:
WHERE similarity(last_name, 'Johns') > 0.1
thus a basic query would look like:
bin/rails db
SELECT id, last_name, first_name,
similarity(last_name, 'Johns') AS score
FROM people
WHERE similarity(last_name, 'Johns') > 0.1
ORDER BY score DESC;
id | last_name | first_name | score
---+---------+---------+---------
2 | Johnson | John | 0.5555556
6 | Johnson | Emma | 0.5555556
7 | Johnston | Emilia | 0.5
3 | Johanson | Jonathan | 0.25
9 | Jones | Olivia | 0.2
(5 rows)
To simplify (and use the default threshold of 0.3) we can rewrite the query using a %
boolean operator:
SELECT id, last_name, first_name,
similarity(last_name, 'Johns') AS score
FROM people
WHERE last_name % 'Johns'
ORDER BY score DESC;
id | last_name | first_name | score
---+---------+----------+---------
2 | Johnson | John | 0.5555556
6 | Johnson | Emma | 0.5555556
7 | Johnston | Emilia | 0.5
We don’t actually need the where
clause, lets see what happens if we remove it:
SELECT id, last_name, first_name,
similarity(last_name, 'Johns') AS score
FROM people
ORDER BY score DESC;
id | last_name | first_name | score
---+---------+----------+-----------
6 | Johnson | Emma | 0.5555556
2 | Johnson | John | 0.5555556
7 | Johnston | Emilia | 0.5
3 | Johanson | Jonathan | 0.25
9 | Jones | Olivia | 0.2
11 | Davis | Ava | 0
12 | Garcia | Sophia | 0
13 | Rodriguez | Isabella | 0
14 | Martinez | Mason | 0
15 | Hernandez | Lucas | 0
16 | Lopez | Amelia | 0
17 | Gonzalez | James | 0
1 | Smith | John | 0
18 | Wilson | Elijah | 0
4 | Smithers | Charles | 0
5 | Brown | Charlie | 0
8 | Williams | Liam | 0
10 | Miller | Noah | 0
(18 rows)
We see our results return every record - no matter the score. Usually we are only interested in the top results, so we can use LIMIT
(along with ORDER BY
) to limit the top results (this is probably faster than using the where
clause):
SELECT id, last_name, first_name,
similarity(last_name, 'Johns') AS score
FROM people
ORDER BY score DESC
LIMIT 4;
id | last_name | first_name | score
----+-----------+------------+-----------
6 | Johnson | Emma | 0.5555556
2 | Johnson | John | 0.5555556
7 | Johnston | Emilia | 0.5
3 | Johanson | Jonathan | 0.25
(4 rows)
OK - this is looking good - let’s do this for the full name (first and last name). To do this we need to concatenate the first and last name together (with a space as a separator). We can use CONCAT_WS
to do this:
SELECT id, first_name, last_name,
similarity(CONCAT_WS(' ', first_name, last_name), 'Emily Johns') AS score
FROM people
ORDER BY score DESC
LIMIT 5;
id | first_name | last_name | score
----+------------+-----------+------------
7 | Emilia | Johnston | 0.47368422
6 | Emma | Johnson | 0.3888889
2 | John | Johnson | 0.3125
1 | John | Smith | 0.21052632
3 | Jonathan | Johanson | 0.125
Interestingly, if we switch the order of the field names compared to the given string (at least for two fields), we get the same results:
SELECT id, last_name, first_name,
similarity(CONCAT_WS(' ', last_name, first_name), 'Emily Johns') AS score
FROM people
ORDER BY score DESC
LIMIT 5;
id | last_name | first_name | score
---+---------+----------+----------
7 | Johnston | Emilia | 0.47368422
6 | Johnson | Emma | 0.3888889
2 | Johnson | John | 0.3125
1 | Smith | John | 0.21052632
3 | Johanson | Jonathan | 0.125
(5 rows)
Simple Fuzzy Search Rails
Single Field Example
threshold = 0.3
similarity_string = 'Emily Johns'
similarity_quoted = ActiveRecord::Base.connection.quote(similarity_string)
Person.where("similarity(last_name, ?) > ?", similarity_string, threshold)
.order("similarity(last_name, #{similarity_quoted}) DESC")
.limit(5)
# Dangerous query method (method whose arguments are used as raw SQL) called with non-attribute argument(s): "similarity(last_name, 'Emily Johns') DESC".This method should not be called with user-provided values, such as request parameters or model attributes. Known-safe values can be passed by wrapping them in Arel.sql(). (ActiveRecord::UnknownAttributeReference)
Person.where("similarity(last_name, ?) > ?", similarity_string, threshold)
.order(Arel.sql("similarity(last_name, #{similarity_quoted}) DESC"))
.limit(5)
[#<Person:0x000000011dfce108
id: 2,
last_name: "Johnson",
first_name: "John",
created_at: "2024-10-31 13:42:15.380507000 +0000",
updated_at: "2024-10-31 13:42:15.380507000 +0000">,
#<Person:0x000000011dfcdfc8
id: 6,
last_name: "Johnson",
first_name: "Emma",
created_at: "2024-10-31 13:42:15.456393000 +0000",
updated_at: "2024-10-31 13:42:15.456393000 +0000">,
#<Person:0x000000011dfcde88
id: 7,
last_name: "Johnston",
first_name: "Emilia",
created_at: "2024-10-31 13:42:15.473551000 +0000",
updated_at: "2024-10-31 13:42:15.473551000 +0000">]
We only see three because other results are below the threshold, to see this we can use select
to add the score to the results (now that we have an alias score
for the similarity
function we can simplify the ORDER BY
):
threshold = 0.3
similarity_string = 'Emily Johns'
similarity_quoted = ActiveRecord::Base.connection.quote(similarity_string)
Person.select("*, similarity(last_name, #{similarity_quoted}) AS score")
.where("similarity(last_name, ?) > ?", similarity_string, threshold)
.order("score DESC")
.limit(5)
[#<Person:0x000000011dfce108
id: 2,
last_name: "Johnson",
first_name: "John",
created_at: "2024-10-31 13:42:15.380507000 +0000",
updated_at: "2024-10-31 13:42:15.380507000 +0000",
score: 0.33333334>,
#<Person:0x000000011dfcdfc8
id: 6,
last_name: "Johnson",
first_name: "Emma",
created_at: "2024-10-31 13:42:15.456393000 +0000",
updated_at: "2024-10-31 13:42:15.456393000 +0000",
score: 0.33333334>,
#<Person:0x000000011dfcde88
id: 7,
last_name: "Johnston",
first_name: "Emilia",
created_at: "2024-10-31 13:42:15.473551000 +0000",
updated_at: "2024-10-31 13:42:15.473551000 +0000",
score: 0.3125>]
Boolean Search Operators (shortens where clauses)
Using %
operator we can further simplify the query by replacing:
where("similarity(last_name, ?) > ?", similarity_string, threshold)
with the boolean operator:
where("? % ?", last_name, similarity_string)
The difference is that the %
operator ONLY uses the default threshold of 0.3 or whatever what last set using set_limit(0.3)
so if you want to dynamically change the threshold its probably easier and clearer to use the similarity function as used above.
Here is an example using the %
operator:
compare_string = 'Emily Johns'
compare_quoted = ActiveRecord::Base.connection.quote(compare_string)
similarity_string = "similarity(last_name, #{compare_quoted})"
Person.select("*, #{similarity_string} AS score, show_limit() as threshold")
.where("last_name % ?", compare_quoted)
.order("score DESC")
.limit(5)
[#<Person:0x000000015305a510
id: 2,
last_name: "Johnson",
first_name: "John",
created_at: "2024-10-31 18:30:08.640962000 +0000",
updated_at: "2024-10-31 18:30:08.640962000 +0000",
score: 0.33333334,
threshold: 0.3>,
#<Person:0x000000015305a3d0
id: 6,
last_name: "Johnson",
first_name: "Emma",
created_at: "2024-10-31 18:30:08.757941000 +0000",
updated_at: "2024-10-31 18:30:08.757941000 +0000",
score: 0.33333334,
threshold: 0.3>,
#<Person:0x000000015305a290
id: 7,
last_name: "Johnston",
first_name: "Emilia",
created_at: "2024-10-31 18:30:08.773495000 +0000",
updated_at: "2024-10-31 18:30:08.773495000 +0000",
score: 0.3125,
threshold: 0.3>]
Here is a summary of all the boolean operators:
text % text
→ Returns true if its arguments have a similarity that is greater than the current similarity threshold set by pg_trgm.similarity_threshold.text <% text
→ Returns true if the similarity between the trigram set in the first argument and a continuous extent of an ordered trigram set in the second argument is greater than the current word similarity threshold set by pg_trgm.word_similarity_threshold parameter. (text %> text
→ Commutator of the <% operator)text <<% text
→ Returns true if its second argument has a continuous extent of an ordered trigram set that matches word boundaries, and its similarity to the trigram set of the first argument is greater than the current strict word similarity threshold set by the pg_trgm.strict_word_similarity_threshold parameter. (text %>> text
→ Commutator of the «% operator)
Without with
If we only want a few top results back and not all values above the threshold we can simply drop the where
clause and use limit
to limit the results:
compare_string = 'Emily Johns'
compare_quoted = ActiveRecord::Base.connection.quote(compare_string)
similarity_string = "similarity(last_name, #{compare_quoted})"
Person.select("*, #{similarity_string} AS score")
.order("score DESC")
.limit(5)
[#<Person:0x000000015313a7a0
id: 2,
last_name: "Johnson",
first_name: "John",
created_at: "2024-10-31 18:30:08.640962000 +0000",
updated_at: "2024-10-31 18:30:08.640962000 +0000",
score: 0.33333334>,
#<Person:0x000000015313a660
id: 6,
last_name: "Johnson",
first_name: "Emma",
created_at: "2024-10-31 18:30:08.757941000 +0000",
updated_at: "2024-10-31 18:30:08.757941000 +0000",
score: 0.33333334>,
#<Person:0x000000015313a520
id: 7,
last_name: "Johnston",
first_name: "Emilia",
created_at: "2024-10-31 18:30:08.773495000 +0000",
updated_at: "2024-10-31 18:30:08.773495000 +0000",
score: 0.3125>,
#<Person:0x000000015313a3e0
id: 3,
last_name: "Johanson",
first_name: "Jonathan",
created_at: "2024-10-31 18:30:08.674588000 +0000",
updated_at: "2024-10-31 18:30:08.674588000 +0000",
score: 0.16666667>,
#<Person:0x000000015313a2a0
id: 9,
last_name: "Jones",
first_name: "Olivia",
created_at: "2024-10-31 18:30:08.820875000 +0000",
updated_at: "2024-10-31 18:30:08.820875000 +0000",
score: 0.125>]
That’s much more readable!
Two Fields Example
Only searching the last_name
doesn’t fully help us narrow our choices. Let’s try searching the first_name
and last_name
together:
We can use CONCAT_WS
to consolidate multiple fields into a single value (and compare that to the given string) so the similarity function would look like:
similarity(CONCAT_WS(' ', last_name, first_name), 'Emily Johns')
where CONCAT_WS(' ', last_name, first_name)
is our first argument compared to
'Emily Johns'
which is our second argument.
Here is an example:
Person
.select(
"*, similarity(CONCAT_WS(' ', last_name, first_name), 'Emily Johns') AS score"
).order("score DESC")
.limit(3)
[#<Person:0x000000011ddf1060
id: 7,
last_name: "Johnston",
first_name: "Emilia",
created_at: "2024-10-31 18:30:08.773495000 +0000",
updated_at: "2024-10-31 18:30:08.773495000 +0000",
score: 0.47368422>,
#<Person:0x000000011ddf0f20
id: 6,
last_name: "Johnson",
first_name: "Emma",
created_at: "2024-10-31 18:30:08.757941000 +0000",
updated_at: "2024-10-31 18:30:08.757941000 +0000",
score: 0.3888889>,
#<Person:0x000000011ddf0de0
id: 2,
last_name: "Johnson",
first_name: "John",
created_at: "2024-10-31 18:30:08.640962000 +0000",
updated_at: "2024-10-31 18:30:08.640962000 +0000",
score: 0.3125>]
Again, this is hard to read - let’s try making it a bit more readable:
compare_string = 'Emily Johns'
compare_quoted = ActiveRecord::Base.connection.quote(compare_string)
db_fields_concat = "CONCAT_WS(' ', last_name, first_name)"
similarity_calc = "similarity(#{db_fields_concat}, #{compare_quoted})"
Person.select("*, #{similarity_calc} AS score")
.order("score DESC")
.limit(3)
[#<Person:0x000000011dad49e0
id: 7,
last_name: "Johnston",
first_name: "Emilia",
created_at: "2024-10-31 18:30:08.773495000 +0000",
updated_at: "2024-10-31 18:30:08.773495000 +0000",
score: 0.47368422>,
#<Person:0x000000011dad48a0
id: 6,
last_name: "Johnson",
first_name: "Emma",
created_at: "2024-10-31 18:30:08.757941000 +0000",
updated_at: "2024-10-31 18:30:08.757941000 +0000",
score: 0.3888889>,
#<Person:0x000000011dad4760
id: 2,
last_name: "Johnson",
first_name: "John",
created_at: "2024-10-31 18:30:08.640962000 +0000",
updated_at: "2024-10-31 18:30:08.640962000 +0000",
score: 0.3125>]
Whew, that’s pretty readable again!
Multi-Table Fuzzy Search
Lets say we want to search for a person by last_name, first_name, title and department (we aren’t sure what data we have available).
compare_string = 'Emily, a research scientist'
compare_quoted = ActiveRecord::Base.connection.quote(compare_string)
db_fields_concat = "CONCAT_WS(' ', first_name, first_name, job_title, department)"
similarity_calc = "similarity(#{db_fields_concat}, #{compare_quoted})"
Person.joins(:roles)
.select("*, #{similarity_calc} AS score")
.order("score DESC")
.limit(3)
[#<Person:0x000000012277de90
id: 7,
last_name: "Johnston",
first_name: "Emilia",
created_at: "2024-10-31 18:30:08.788015000 +0000",
updated_at: "2024-10-31 18:30:08.788015000 +0000",
job_title: "Data Scientist",
department: "Research",
person_id: 7,
score: 0.6571429>,
#<Person:0x000000012277dc10
id: 6,
last_name: "Johnson",
first_name: "Emma",
created_at: "2024-10-31 18:30:08.765276000 +0000",
updated_at: "2024-10-31 18:30:08.765276000 +0000",
job_title: "Data Scientist",
department: "Research",
person_id: 6,
score: 0.6>,
#<Person:0x000000012277dad0
id: 14,
last_name: "Martinez",
first_name: "Mason",
created_at: "2024-10-31 18:30:08.910685000 +0000",
updated_at: "2024-10-31 18:30:08.910685000 +0000",
job_title: "Business Analyst",
department: "Research",
person_id: 14,
score: 0.22916667>]
Sweet, this works well!
EXTRAS
Efficiency & Performance Ideas
- Trigram Indexes using (
gist_trgm_ops
) - approximates a set of trigrams as a bitmap signature.CREATE INDEX trgm_idx ON people USING GIST (last_name gist_trgm_ops);
CREATE INDEX trgm_idx ON people USING GIN (last_name gin_trgm_ops);
CREATE INDEX trgm_idx ON people USING GIST (last_name gist_trgm_ops(siglen=32));
Example of creating such an index with a signature length of 32 bytes:
SELECT first_name, similarity(first_name, 'Emily') AS sml
FROM people
WHERE first_name % 'Emily'
ORDER BY sml DESC, first_name;
NOTES:
- Beginning in PostgreSQL 9.1, these index types also support index searches for LIKE and ILIKE, for example:
SELECT * FROM people WHERE first_name ILIKE '%emil%';
- Beginning in PostgreSQL 9.3, these index types also support index searches for regular-expression matches (~ and ~* operators), for example:
SELECT * FROM people WHERE first_name ~ '(Emily|John)';
Distance Searching (instead of similarity)
Use distance
instead of similarity
, we can use the <->
operator instead of (1.0 - similarity(last_name, 'Johns'))
to compare a db field to a string - using:
SELECT id, last_name, first_name,
(1.0 - similarity(last_name, 'Johns')) AS score
FROM people
WHERE last_name % 'Johns'
ORDER BY score DESC;
id | last_name | first_name | score
---+---------+----------+----------------
7 | Johnston | Emilia | 0.5
2 | Johnson | John | 0.4444444179534912
6 | Johnson | Emma | 0.4444444179534912
(3 rows)
SELECT id, last_name, first_name,
last_name <-> 'Johns' AS score
FROM people
WHERE last_name % 'Johns'
ORDER BY score DESC;
id | last_name | first_name | score
---+---------+----------+----------
7 | Johnston | Emilia | 0.5
2 | Johnson | John | 0.44444442
6 | Johnson | Emma | 0.44444442
Here is a summary of the short-cut distance operators (1.0 - similarity()
):
text <-> text
→ real - distance using similarity()text <<-> text
→ real - distance using word_similarity()text <->> text
→ real - Commutator of the «-> operatortext <<<-> text
→ real - distance using strict_word_similarity()text <->>> text
→ real - Commutator of the «<-> operator.
Conclusion
similarity
, word_similarity
& strict_word_similarity
- are straight forward ways to find records from partial information.
Depending on your use case, you may want to consider the following other methods:
LIKE
&ILIKE
- sub-string string matching- SIMILAR TO
string SIMILAR TO pattern [ESCAPE escape-character]
string NOT SIMILAR TO pattern [ESCAPE escape-character]
- POSIX Regular Expressions
text ~ text → boolean
ie. ('thomas' ~ 't.*ma'
= true) - String matches regular expression, case sensitivelytext ~* text → boolean
i.e. ('thomas' ~* 'T.*ma'
= true) - string matches regular expression, case-insensitivelytext !~ text → boolean
i.e. ('thomas' !~ 't.*max'
= true) - String does not match regular expression, case sensitivelytext !~* text → boolean
i.e. ('thomas' !~* 'T.*ma'
= false) - - String does not match regular expression, case-insensitively
fuzzystrmatch - determine string similarities and distance
Soundex
(do not use with UTF-8)Metaphone
(do not use with UTF-8)Double Metaphone
(do not use with UTF-8)Levenshtein
(work with UTF-8)Daitch-Mokotoff Soundex
(work with UTF-8)
Full Text (Document) Search - big topic - can in some cases replace something like elastic search.
pg_trgm - efficient scored trigram matching
similarity
- Returns a number that indicates how similar the two arguments are. The range of the result is zero (indicating that the two strings are completely dissimilar) to one (indicating that the two strings are identical).word_similarity
- Returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string. For details, see the explanation below.strict_word_similarity
- ame as word_similarity, but forces extent boundaries to match word boundaries. Since we don’t have cross-word trigrams, this function actually returns greatest similarity between first string and any continuous extent of words of the second string.
NOTE: There is a gem called pg_search that provides the trigram search functionality, but given how straight-forward this is, I’d recommend just doing it yourself.
Resources & Articles
- pg_trgm docs
- Postgres Fuzzy Search (Trigrams)
- Optimizing Postgres Text Search with Trigrams
- Awesome Autocomplete: Trigram Search in Rails and PostgreSQL
Alternatives
- Pattern Matching
- LIKE/ILIKE - returns true if the string matches the supplied pattern
- Regular Expressions (Regex) - returns true if the regex pattern is matched
- SIMILAR TO - a cross between LIKE and REGEX
- FuzzyStrMatch
- Soundex - matching similar-sounding names by converting them to the same code (best for English)
- Daitch-Mokotoff Soundex - matching similar-sounding names by converting them to the same code (good for non-English)
- Levenshtein - the distance between two strings (number of edits to make them the same)
- Metaphone - like Soundex, creates a representative ‘sound’ code for an input string and matches ‘sounds’
- Double Metaphone - creates two ‘sounds like’ strings for a given input string — a ‘primary’ and an ‘alternate’ - espescially helpful with non-English texts
- Full-Text / Document Search