Phoenix / Ecto - Fuzzy Postgres Search using SIMILARITY
Simple Effective Scored Fuzzy Searches
Overview
Recently I was working on a project at work that required finding the appropriate record with incomplete information (that might be either mispelled or within multiple columns - thus LIKE
and ILIKE
are insufficient).
My co-worker Gernot Kogler, introduced me to the trigram scoring searches using similarity
and word_similarity
- this is a simple and very effective way to do fuzzy searches.
If interested, a good article Optimizing Postgres Text Search with Trigrams that explains the details of the scoring and optimizing the search speed.
In fact, there are several ways to do fuzzy searches in Postgres, this is probably an incomplete list, but a good starting place for those who want to explore further.
LIKE
&ILIKE
- single column (exact partial matches)grep
- searches matchessimilarity
&word_similarity
- multiple columns (scored partial)pg_trgrm
(trigram) extension - efficient scored partial matchesfuzzystrmatch
- fuzzy searcheslevenshtein
- Full Text (Document) Search
In this article, we will only explore similarity
and word_similarity
.
Getting Started
Let’s create a test project that finds the best matching person in our database, from a ‘description’.
Simple Rails Project
# macos 15 seems to need this
ulimit -n 10240
rbenv istall 3.3.5
rbenv local 3.3.5
rails new fuzzy -d postgresql
# let's check the gem file to be sure we are using PG,
# we expect to find:
gem "pg", "~> 1.1"
# lets be sure we are using uptodate gems:
cd fuzzy
bundle update
# currently if we check Gemfile.lock we will see:
rails (7.2.2)
# setup the database with
bin/rails db:create
# assuming all went well
git init
git add .
git commit -m "initial commit"
Add DB extensions
To do similarity searches we need the extension: pg_trgm
bin/rails generate migration AddPgTrgmExtensionToDb
# priv/repo/migrations/20241003142425_add_pg_extensions.exs
defmodule Fuzzy.Repo.Migrations.AddPgExtensions do
use Ecto.Migration
def change do
execute("CREATE EXTENSION IF NOT EXISTS citext", "DROP EXTENSION IF EXISTS citext")
execute("CREATE EXTENSION IF NOT EXISTS pg_trgm", "DROP EXTENSION IF EXISTS pg_trgm")
end
end
now migrate
mix ecto.migrate
Create Person
Let’s create a simple person schema and learn to do a fuzzy search.
mix phx.gen.schema Person people last_name:string first_name:string job_title:string department:string
mix phx.gen.schema Account accounts status:enum:active:inactive username:string:unique password:string:redact person_id:references:people
Update Migration
now let’s ensure that username is case insensitive (using the citext
extension). We change :username
column’s type from :string
to :citext
.
# priv/repo/migrations/20241003150857_create_accounts.exs
defmodule Fuzzy.Repo.Migrations.CreateAccounts do
use Ecto.Migration
def change do
create table(:accounts) do
add :status, :string
# add :username, :string
add :username, :citext # now for Ecto this is a case insensitive string
# ^^^^^^^
add :password, :string
add :person_id, references(:people, on_delete: :nothing)
timestamps(type: :utc_datetime)
end
create unique_index(:accounts, [:username])
create index(:accounts, [:person_id])
end
end
now we can migrate:
mix ecto.migrate
Add Relationships
By adding relationships we can simplify our queries.
# lib/fuzzy/person.ex
defmodule Fuzzy.Person do
use Ecto.Schema
import Ecto.Changeset
schema "people" do
field :last_name, :string
field :first_name, :string
field :job_title, :string
field :department, :string
# add a relationship to the account
has_one :account, Fuzzy.Account
timestamps(type: :utc_datetime)
end
@doc false
def changeset(person, attrs) do
person
|> cast(attrs, [:last_name, :first_name, :job_title, :department])
|> validate_required([:last_name, :first_name, :job_title, :department])
end
end
# lib/fuzzy/account.ex
defmodule Fuzzy.Account do
use Ecto.Schema
import Ecto.Changeset
schema "accounts" do
field :status, Ecto.Enum, values: [:active, :inactive]
field :username, :string
field :password, :string, redact: true
# field :person_id, :id
belongs_to :person, Fuzzy.Person
timestamps(type: :utc_datetime)
end
@doc false
def changeset(account, attrs) do
account
|> cast(attrs, [:status, :username, :password, :person_id])
|> validate_required([:status, :username, :password, :person_id])
|> unique_constraint(:username)
end
end
Seeding
Now let’s make some records for testing using the seeds.exs
file.
# priv/repo/seeds.exs
alias Fuzzy.{Repo, Person, Account}
# Step 1: People Data
people_data = [
%{
last_name: "Smith",
first_name: "John",
title: "Software Engineer",
status: :active,
username: "johnsmith",
password: "password123",
department: "Product"
},
%{
last_name: "Johnson",
first_name: "John",
title: "Network Engineer",
status: :active,
username: "johnjohnson",
password: "password123",
department: "Operations"
},
%{
last_name: "Johanson",
first_name: "Jonathan",
title: "QA Engineer",
status: :active,
username: "jonathanjohanson",
password: "password123",
department: "Quality"
},
%{
last_name: "Smith",
first_name: "Charles",
title: "Tester",
status: :active,
username: "charlessmith",
password: "password123",
department: "Quality"
},
%{
last_name: "Brown",
first_name: "Charlie",
title: "Designer",
status: :inactive,
username: "charliebrown",
password: "password123",
department: "Product"
},
%{
last_name: "Johnson",
first_name: "Emma",
title: "Data Scientist",
status: :active,
username: "emmajohnson",
password: "password123",
department: "Research"
},
%{
last_name: "Johnston",
first_name: "Emilia",
title: "Data Scientist",
status: :active,
username: "emiliajohnston",
password: "password123",
department: "Research"
},
%{
last_name: "Williams",
first_name: "Liam",
title: "DevOps Engineer",
status: :inactive,
username: "liamwilliams",
password: "password123",
department: "Operations"
},
%{
last_name: "Jones",
first_name: "Olivia",
title: "UX Researcher",
status: :active,
username: "oliviajones",
password: "password123",
department: "Research"
},
%{
last_name: "Miller",
first_name: "Noah",
title: "Software Engineer",
status: :inactive,
username: "noahmiller",
password: "password123",
department: "Product"
},
%{
last_name: "Davis",
first_name: "Ava",
title: "Software Engineer",
status: :active,
username: "avadavis",
password: "password123",
department: "Product"
},
%{
last_name: "Garcia",
first_name: "Sophia",
title: "QA Engineer",
status: :inactive,
username: "sophiagarcia",
password: "password123",
department: "Quality"
},
%{
last_name: "Rodriguez",
first_name: "Isabella",
title: "Tech Support",
status: :active,
username: "isabellarodriguez",
password: "password123",
department: "Customers"
},
%{
last_name: "Martinez",
first_name: "Mason",
title: "Business Analyst",
status: :active,
username: "masonmartinez",
password: "password123",
department: "Research"
},
%{
last_name: "Hernandez",
first_name: "Lucas",
title: "Systems Administrator",
status: :inactive,
username: "lucashernandez",
password: "password123",
department: "Operations"
},
%{
last_name: "Lopez",
first_name: "Amelia",
title: "Product Manager",
status: :active,
username: "amelialopez",
password: "password123",
department: "Business"
},
%{
last_name: "Gonzalez",
first_name: "James",
title: "Network Engineer",
status: :inactive,
username: "jamesgonzalez",
password: "password123",
department: "Operations"
},
%{
last_name: "Wilson",
first_name: "Elijah",
title: "Cloud Architect",
status: :active,
username: "elijahwilson",
password: "password123",
department: "Operations"
}
]
# Step 2: Insert People and Assign Accounts & Departments
Enum.each(people_data, fn person_data ->
# Insert the Person record
person =
%Person{}
|> Person.changeset(%{
first_name: person_data.first_name,
last_name: person_data.last_name,
title: person_data.title,
department: person_data.department})
|> Repo.insert!()
# Insert the corresponding Account record
%Account{}
|> Account.changeset(%{
person_id: person.id,
username: person_data.username,
password: person_data.password,
status: person_data.status
})
|> Repo.insert!()
end)
now let’s run the seed file:
mix run priv/repo/seeds.exs
Test Setup
let’s see if we can access our records using iex
:
# enter iex (within the context of phoenix)
iex -S mix phx.server
# load the Query module and simplify with aliases
import Ecto.Query
alias Fuzzy.Account
alias Fuzzy.Person
alias Fuzzy.Repo
# Using pipeline syntax
last_pipeline_person =
(
Person
|> order_by(desc: :inserted_at)
|> Repo.all()
|> Repo.preload(:account)
)
# using Ecto from details:
last_from_person =
Repo.all(from p in Person, order_by: [desc: p.inserted_at], preload: [:account])
If all this works we are ready to go:
git add .
git commit -m "database setup and added people"
Simple Fuzzy Search
Now that we have some records, let’s create a simple fuzzy search module.
Simple Fuzzy first_name
Search
# enter iex (within the context of phoenix)
iex -S mix phx.server
# load the Query module and simplify with aliases
import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person
search_string = "John"
threshold = 0.05
from(p in Person,
where:
fragment(
"word_similarity(?, ?) > ?",
p.first_name,
^search_string,
^threshold
),
select: %{
score:
fragment(
"word_similarity(?, ?)",
p.first_name,
^search_string
),
person: p
},
order_by: [
desc: fragment(
"word_similarity(?, ?)",
p.first_name,
^search_string
)
]
)
|> Repo.all()
[
%{
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 1,
last_name: "Smith",
first_name: "John",
title: "Software Engineer",
department: "Product",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
},
score: 1.0
},
%{
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 2,
last_name: "Johnson",
first_name: "John",
title: "Network Engineer",
department: "Operations",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
},
score: 1.0
},
%{
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 3,
last_name: "Johanson",
first_name: "Jonathan",
title: "QA Engineer",
department: "Quality",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
},
score: 0.2222222238779068
},
%{
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 17,
last_name: "Gonzalez",
first_name: "James",
title: "Network Engineer",
department: "Operations",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
},
score: 0.1666666716337204
}
]
If you get the following error, Then you need to fix (or install) the pg_trgm
extension.
** (Postgrex.Error) ERROR 42883 (undefined_function) function word_similarity(character varying, unknown) does not exist
Play with the search string a bit (and the threshold) to see how it works. You will probably find that search strings LONGER than 3 characters work best.
A score of 1 means a perfect match (100% quality match). The closer to 0, the worse, the match. The closer to 1, the better, the match.
You may also notice we have two records with a 100% match, but they have different lastnames, roles, etc. In order to refine our search - let’s explore multi-column fuzzy searches.
Multi-Column Fuzzy Search
Let’s try matching on first and last name using:
iex -S mix phx.server
import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person
# if ewe lower the threshold we get more results
search_string = "john smith"
threshold = 0.4
from(p in Person,
where:
fragment(
"word_similarity(CONCAT_WS(' ', ?, ?), ?) > ?",
p.first_name,
p.last_name,
^search_string,
^threshold
),
select: %{
score:
fragment(
"word_similarity(CONCAT_WS(' ', ?, ?), ?)",
p.first_name,
p.last_name,
^search_string
),
person: p
},
order_by: [
desc: fragment(
"word_similarity(CONCAT_WS(' ', ?, ?), ?)",
p.first_name,
p.last_name,
^search_string
)
]
)
|> Repo.all()
[
%{
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 1,
last_name: "Smith",
first_name: "John",
title: "Software Engineer",
department: "Product",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
},
score: 1.0
},
%{
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 2,
last_name: "Johnson",
first_name: "John",
title: "Network Engineer",
department: "Operations",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
},
score: 0.5555555820465088
},
%{
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 4,
last_name: "Smith",
first_name: "Charles",
title: "Tester",
department: "Quality",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
},
score: 0.4285714328289032
}
]
you will see that longer strings comparisons are more likely to match (and shorter strings, the lower the match scores).
you can / should of course include all the person fields.
Multi-Table Fuzzy Search
To improve the search we can add multiple columns to the search (including accross tables).
Let’s search first and last name, job title and username.
iex -S mix phx.server
import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person
# if ewe lower the threshold we get more results
search_string = "john an engineer in product"
threshold = 0.3
from(p in Person,
join: a in assoc(p, :account), # with associations defined in the schema
# join: a in Account, on: a.person_id == p.id, # without associations in schema
where:
# needs a threshold for a binary result (include in results or not)
fragment(
"similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?) > ?",
# concat columns with a space between them
a.username,
p.first_name,
p.last_name,
p.title,
p.department,
# input params
^search_string,
^threshold
),
select: %{
score:
fragment(
"similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
# concat columns with a space between them
a.username,
p.first_name,
p.last_name,
p.title,
p.department,
# input params
^search_string
),
person: p
},
order_by:
[
desc: fragment(
"similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
# concat columns with a space between them
a.username,
p.first_name,
p.last_name,
p.title,
p.department,
# input params
^search_string
)
]
) |> Repo.all()
# returns:
[
%{
score: 0.5,
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 1,
last_name: "Smith",
first_name: "John",
title: "Software Engineer",
department: "Product",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
}
},
%{
score: 0.375,
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 11,
last_name: "Davis",
first_name: "Ava",
title: "Software Engineer",
department: "Product",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
}
},
%{
score: 0.3400000035762787,
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 10,
last_name: "Miller",
first_name: "Noah",
title: "Software Engineer",
department: "Product",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
}
}
]
Now the queries are more complicated, but also more useful.
Fuzzy Search without Threshold
using limit
iex -S mix phx.server
import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person
# if ewe lower the threshold we get more results
search_string = "john an engineer in product"
from(p in Person,
join: a in assoc(p, :account), # with associations defined in the schema
# join: a in Account, on: a.person_id == p.id, # without associations in schema
select: %{
score:
fragment(
"similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
# concat columns with a space between them
a.username,
p.first_name,
p.last_name,
p.title,
p.department,
# input params
^search_string
),
person: p
},
order_by:
[
desc: fragment(
"similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
# concat columns with a space between them
a.username,
p.first_name,
p.last_name,
p.title,
p.department,
# input params
^search_string
)
],
limit: 2
) |> Repo.all()
[
%{
score: 0.5,
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 1,
last_name: "Smith",
first_name: "John",
title: "Software Engineer",
department: "Product",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
}
},
%{
score: 0.375,
person: %Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 11,
last_name: "Davis",
first_name: "Ava",
title: "Software Engineer",
department: "Product",
account: #Ecto.Association.NotLoaded<association :account is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
}
}
]
Only Sort & Limit
iex -S mix phx.server
import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person
# if ewe lower the threshold we get more results
search_string = "john an engineer in product"
# or just in sorting
from(p in Person,
join: a in assoc(p, :account), # with associations defined in the schema
select: p,
preload: [:account],
order_by:
[
desc: fragment(
"similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
# concat columns with a space between them
a.username,
p.first_name,
p.last_name,
p.title,
p.department,
# input params
^search_string
)
],
limit: 2
) |> Repo.all()
[
%Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 1,
last_name: "Smith",
first_name: "John",
title: "Software Engineer",
department: "Product",
account: #Fuzzy.Account<
__meta__: #Ecto.Schema.Metadata<:loaded, "accounts">,
id: 1,
status: :active,
username: "johnsmith",
person_id: 1,
person: #Ecto.Association.NotLoaded<association :person is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z],
...
>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
},
%Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 11,
last_name: "Davis",
first_name: "Ava",
title: "Software Engineer",
department: "Product",
account: #Fuzzy.Account<
__meta__: #Ecto.Schema.Metadata<:loaded, "accounts">,
id: 11,
status: :active,
username: "avadavis",
person_id: 11,
person: #Ecto.Association.NotLoaded<association :person is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z],
...
>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
}
]
Or if you only every want the top response:
iex -S mix phx.server
import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person
# if ewe lower the threshold we get more results
search_string = "john an engineer in product"
from(p in Person,
join: a in assoc(p, :account), # with associations defined in the schema
select: p,
preload: [:account],
order_by:
[
desc: fragment(
"similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
# concat columns with a space between them
a.username,
p.first_name,
p.last_name,
p.title,
p.department,
# input params
^search_string
)
],
limit: 1
) |> Repo.one()
%Fuzzy.Person{
__meta__: #Ecto.Schema.Metadata<:loaded, "people">,
id: 1,
last_name: "Smith",
first_name: "John",
title: "Software Engineer",
department: "Product",
account: #Fuzzy.Account<
__meta__: #Ecto.Schema.Metadata<:loaded, "accounts">,
id: 1,
status: :active,
username: "johnsmith",
person_id: 1,
person: #Ecto.Association.NotLoaded<association :person is not loaded>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z],
...
>,
inserted_at: ~U[2024-10-04 09:31:56Z],
updated_at: ~U[2024-10-04 09:31:56Z]
}
I personally prefer at least returning the score to assess the match quality even when I set the threshold.
Conclusion
This is a straight forward way to find records from incomplete information and when used in coordination with other strategies this can be a powerful way to help users locate data.
PS - generally speaking 0.3 is considered a generous, yet healthy cutoff.