Phoenix / Ecto - Fuzzy Postgres Search using SIMILARITY

Simple Effective Scored Fuzzy Searches

Overview

Recently I was working on a project at work that required finding the appropriate record with incomplete information (that might be either mispelled or within multiple columns - thus LIKE and ILIKE are insufficient).

My co-worker Gernot Kogler, introduced me to the trigram scoring searches using similarity and word_similarity - this is a simple and very effective way to do fuzzy searches.

If interested, a good article Optimizing Postgres Text Search with Trigrams that explains the details of the scoring and optimizing the search speed.

In fact, there are several ways to do fuzzy searches in Postgres, this is probably an incomplete list, but a good starting place for those who want to explore further.

  • LIKE & ILIKE - single column (exact partial matches)
  • grep - searches matches
  • similarity & word_similarity - multiple columns (scored partial)
  • pg_trgrm (trigram) extension - efficient scored partial matches
  • fuzzystrmatch - fuzzy searches
  • levenshtein
  • Full Text (Document) Search

In this article, we will only explore similarity and word_similarity.

Getting Started

Let’s create a test project that finds the best matching person in our database, from a ‘description’.

Simple Rails Project

# macos 15 seems to need this
ulimit -n 10240

rbenv istall 3.3.5
rbenv local 3.3.5
rails new fuzzy -d postgresql

# let's check the gem file to be sure we are using PG,
# we expect to find:
gem "pg", "~> 1.1"

# lets be sure we are using uptodate gems:
cd fuzzy
bundle update

# currently if we check Gemfile.lock we will see:
rails (7.2.2)

# setup the database with
bin/rails db:create

# assuming all went well
git init
git add .
git commit -m "initial commit"

Add DB extensions

To do similarity searches we need the extension: pg_trgm

bin/rails generate migration AddPgTrgmExtensionToDb
# priv/repo/migrations/20241003142425_add_pg_extensions.exs
defmodule Fuzzy.Repo.Migrations.AddPgExtensions do
  use Ecto.Migration

  def change do
    execute("CREATE EXTENSION IF NOT EXISTS citext", "DROP EXTENSION IF EXISTS citext")
    execute("CREATE EXTENSION IF NOT EXISTS pg_trgm", "DROP EXTENSION IF EXISTS pg_trgm")
  end
end

now migrate

mix ecto.migrate

Create Person

Let’s create a simple person schema and learn to do a fuzzy search.

mix phx.gen.schema Person people last_name:string first_name:string job_title:string department:string

mix phx.gen.schema Account accounts status:enum:active:inactive username:string:unique password:string:redact person_id:references:people

Update Migration

now let’s ensure that username is case insensitive (using the citext extension). We change :username column’s type from :string to :citext.

# priv/repo/migrations/20241003150857_create_accounts.exs
defmodule Fuzzy.Repo.Migrations.CreateAccounts do
  use Ecto.Migration

  def change do
    create table(:accounts) do
      add :status, :string
      # add :username, :string
      add :username, :citext # now for Ecto this is a case insensitive string
      #              ^^^^^^^
      add :password, :string
      add :person_id, references(:people, on_delete: :nothing)

      timestamps(type: :utc_datetime)
    end

    create unique_index(:accounts, [:username])
    create index(:accounts, [:person_id])
  end
end

now we can migrate:

mix ecto.migrate

Add Relationships

By adding relationships we can simplify our queries.

# lib/fuzzy/person.ex
defmodule Fuzzy.Person do
  use Ecto.Schema
  import Ecto.Changeset

  schema "people" do
    field :last_name, :string
    field :first_name, :string
    field :job_title, :string
    field :department, :string

    # add a relationship to the account
    has_one :account, Fuzzy.Account

    timestamps(type: :utc_datetime)
  end

  @doc false
  def changeset(person, attrs) do
    person
    |> cast(attrs, [:last_name, :first_name, :job_title, :department])
    |> validate_required([:last_name, :first_name, :job_title, :department])
  end
end

# lib/fuzzy/account.ex
defmodule Fuzzy.Account do
  use Ecto.Schema
  import Ecto.Changeset

  schema "accounts" do
    field :status, Ecto.Enum, values: [:active, :inactive]
    field :username, :string
    field :password, :string, redact: true
    # field :person_id, :id
    belongs_to :person, Fuzzy.Person

    timestamps(type: :utc_datetime)
  end

  @doc false
  def changeset(account, attrs) do
    account
    |> cast(attrs, [:status, :username, :password, :person_id])
    |> validate_required([:status, :username, :password, :person_id])
    |> unique_constraint(:username)
  end
end

Seeding

Now let’s make some records for testing using the seeds.exs file.

# priv/repo/seeds.exs
alias Fuzzy.{Repo, Person, Account}

# Step 1: People Data
people_data = [
  %{
    last_name: "Smith",
    first_name: "John",
    title: "Software Engineer",
    status: :active,
    username: "johnsmith",
    password: "password123",
    department: "Product"
  },
  %{
    last_name: "Johnson",
    first_name: "John",
    title: "Network Engineer",
    status: :active,
    username: "johnjohnson",
    password: "password123",
    department: "Operations"
  },
  %{
    last_name: "Johanson",
    first_name: "Jonathan",
    title: "QA Engineer",
    status: :active,
    username: "jonathanjohanson",
    password: "password123",
    department: "Quality"
  },
  %{
    last_name: "Smith",
    first_name: "Charles",
    title: "Tester",
    status: :active,
    username: "charlessmith",
    password: "password123",
    department: "Quality"
  },
  %{
    last_name: "Brown",
    first_name: "Charlie",
    title: "Designer",
    status: :inactive,
    username: "charliebrown",
    password: "password123",
    department: "Product"
  },
  %{
    last_name: "Johnson",
    first_name: "Emma",
    title: "Data Scientist",
    status: :active,
    username: "emmajohnson",
    password: "password123",
    department: "Research"
  },
  %{
    last_name: "Johnston",
    first_name: "Emilia",
    title: "Data Scientist",
    status: :active,
    username: "emiliajohnston",
    password: "password123",
    department: "Research"
  },
  %{
    last_name: "Williams",
    first_name: "Liam",
    title: "DevOps Engineer",
    status: :inactive,
    username: "liamwilliams",
    password: "password123",
    department: "Operations"
  },
  %{
    last_name: "Jones",
    first_name: "Olivia",
    title: "UX Researcher",
    status: :active,
    username: "oliviajones",
    password: "password123",
    department: "Research"
  },
  %{
    last_name: "Miller",
    first_name: "Noah",
    title: "Software Engineer",
    status: :inactive,
    username: "noahmiller",
    password: "password123",
    department: "Product"
  },
  %{
    last_name: "Davis",
    first_name: "Ava",
    title: "Software Engineer",
    status: :active,
    username: "avadavis",
    password: "password123",
    department: "Product"
  },
  %{
    last_name: "Garcia",
    first_name: "Sophia",
    title: "QA Engineer",
    status: :inactive,
    username: "sophiagarcia",
    password: "password123",
    department: "Quality"
  },
  %{
    last_name: "Rodriguez",
    first_name: "Isabella",
    title: "Tech Support",
    status: :active,
    username: "isabellarodriguez",
    password: "password123",
    department: "Customers"
  },
  %{
    last_name: "Martinez",
    first_name: "Mason",
    title: "Business Analyst",
    status: :active,
    username: "masonmartinez",
    password: "password123",
    department: "Research"
  },
  %{
    last_name: "Hernandez",
    first_name: "Lucas",
    title: "Systems Administrator",
    status: :inactive,
    username: "lucashernandez",
    password: "password123",
    department: "Operations"
  },
  %{
    last_name: "Lopez",
    first_name: "Amelia",
    title: "Product Manager",
    status: :active,
    username: "amelialopez",
    password: "password123",
    department: "Business"
  },
  %{
    last_name: "Gonzalez",
    first_name: "James",
    title: "Network Engineer",
    status: :inactive,
    username: "jamesgonzalez",
    password: "password123",
    department: "Operations"
  },
  %{
    last_name: "Wilson",
    first_name: "Elijah",
    title: "Cloud Architect",
    status: :active,
    username: "elijahwilson",
    password: "password123",
    department: "Operations"
  }
]

# Step 2: Insert People and Assign Accounts & Departments
Enum.each(people_data, fn person_data ->
  # Insert the Person record
  person =
    %Person{}
    |> Person.changeset(%{
      first_name: person_data.first_name,
      last_name: person_data.last_name,
      title: person_data.title,
      department: person_data.department})
    |> Repo.insert!()

  # Insert the corresponding Account record
  %Account{}
  |> Account.changeset(%{
    person_id: person.id,
    username: person_data.username,
    password: person_data.password,
    status: person_data.status
  })
  |> Repo.insert!()
end)

now let’s run the seed file:

mix run priv/repo/seeds.exs

Test Setup

let’s see if we can access our records using iex:

# enter iex (within the context of phoenix)
iex -S mix phx.server

# load the Query module and simplify with aliases
import Ecto.Query
alias Fuzzy.Account
alias Fuzzy.Person
alias Fuzzy.Repo

# Using pipeline syntax
last_pipeline_person =
  (
    Person
    |> order_by(desc: :inserted_at)
    |> Repo.all()
    |> Repo.preload(:account)
  )

# using Ecto from details:
last_from_person =
  Repo.all(from p in Person, order_by: [desc: p.inserted_at], preload: [:account])

If all this works we are ready to go:

git add .
git commit -m "database setup and added people"

Now that we have some records, let’s create a simple fuzzy search module.

# enter iex (within the context of phoenix)
iex -S mix phx.server

# load the Query module and simplify with aliases
import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person

search_string = "John"
threshold = 0.05

from(p in Person,
  where:
    fragment(
      "word_similarity(?, ?) > ?",
      p.first_name,
      ^search_string,
      ^threshold
    ),
  select: %{
    score:
      fragment(
        "word_similarity(?, ?)",
        p.first_name,
        ^search_string
      ),
    person: p
  },
  order_by: [
    desc: fragment(
      "word_similarity(?, ?)",
      p.first_name,
      ^search_string
    )
  ]
)
|> Repo.all()

[
  %{
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 1,
      last_name: "Smith",
      first_name: "John",
      title: "Software Engineer",
      department: "Product",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    },
    score: 1.0
  },
  %{
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 2,
      last_name: "Johnson",
      first_name: "John",
      title: "Network Engineer",
      department: "Operations",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    },
    score: 1.0
  },
  %{
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 3,
      last_name: "Johanson",
      first_name: "Jonathan",
      title: "QA Engineer",
      department: "Quality",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    },
    score: 0.2222222238779068
  },
  %{
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 17,
      last_name: "Gonzalez",
      first_name: "James",
      title: "Network Engineer",
      department: "Operations",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    },
    score: 0.1666666716337204
  }
]

If you get the following error, Then you need to fix (or install) the pg_trgm extension.

** (Postgrex.Error) ERROR 42883 (undefined_function) function word_similarity(character varying, unknown) does not exist

Play with the search string a bit (and the threshold) to see how it works. You will probably find that search strings LONGER than 3 characters work best.

A score of 1 means a perfect match (100% quality match). The closer to 0, the worse, the match. The closer to 1, the better, the match.

You may also notice we have two records with a 100% match, but they have different lastnames, roles, etc. In order to refine our search - let’s explore multi-column fuzzy searches.

Let’s try matching on first and last name using:

iex -S mix phx.server

import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person

# if ewe lower the threshold we get more results
search_string = "john smith"
threshold = 0.4

from(p in Person,
  where:
    fragment(
      "word_similarity(CONCAT_WS(' ', ?, ?), ?) > ?",
      p.first_name,
      p.last_name,
      ^search_string,
      ^threshold
    ),
  select: %{
    score:
      fragment(
        "word_similarity(CONCAT_WS(' ', ?, ?), ?)",
        p.first_name,
        p.last_name,
        ^search_string
      ),
    person: p
  },
  order_by: [
    desc: fragment(
      "word_similarity(CONCAT_WS(' ', ?, ?), ?)",
      p.first_name,
      p.last_name,
      ^search_string
    )
  ]
)
|> Repo.all()


[
  %{
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 1,
      last_name: "Smith",
      first_name: "John",
      title: "Software Engineer",
      department: "Product",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    },
    score: 1.0
  },
  %{
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 2,
      last_name: "Johnson",
      first_name: "John",
      title: "Network Engineer",
      department: "Operations",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    },
    score: 0.5555555820465088
  },
  %{
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 4,
      last_name: "Smith",
      first_name: "Charles",
      title: "Tester",
      department: "Quality",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    },
    score: 0.4285714328289032
  }
]

you will see that longer strings comparisons are more likely to match (and shorter strings, the lower the match scores).

you can / should of course include all the person fields.

To improve the search we can add multiple columns to the search (including accross tables).

Let’s search first and last name, job title and username.

iex -S mix phx.server

import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person

# if ewe lower the threshold we get more results
search_string = "john an engineer in product"
threshold = 0.3

from(p in Person,
  join: a in assoc(p, :account), # with associations defined in the schema
  # join: a in Account, on: a.person_id == p.id, # without associations in schema
  where:
    # needs a threshold for a binary result (include in results or not)
    fragment(
      "similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?) > ?",
       # concat columns with a space between them
        a.username,
      p.first_name,
      p.last_name,
      p.title,
      p.department,
      # input params
      ^search_string,
      ^threshold
    ),
  select: %{
    score:
      fragment(
        "similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
        # concat columns with a space between them
        a.username,
        p.first_name,
        p.last_name,
        p.title,
        p.department,
        # input params
        ^search_string
      ),
    person: p
  },
  order_by:
    [
      desc: fragment(
        "similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
        # concat columns with a space between them
        a.username,
        p.first_name,
        p.last_name,
        p.title,
        p.department,
        # input params
        ^search_string
      )
    ]
) |> Repo.all()

# returns:
[
  %{
    score: 0.5,
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 1,
      last_name: "Smith",
      first_name: "John",
      title: "Software Engineer",
      department: "Product",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    }
  },
  %{
    score: 0.375,
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 11,
      last_name: "Davis",
      first_name: "Ava",
      title: "Software Engineer",
      department: "Product",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    }
  },
  %{
    score: 0.3400000035762787,
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 10,
      last_name: "Miller",
      first_name: "Noah",
      title: "Software Engineer",
      department: "Product",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    }
  }
]

Now the queries are more complicated, but also more useful.

Fuzzy Search without Threshold

using limit

iex -S mix phx.server

import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person

# if ewe lower the threshold we get more results
search_string = "john an engineer in product"

from(p in Person,
  join: a in assoc(p, :account), # with associations defined in the schema
  # join: a in Account, on: a.person_id == p.id, # without associations in schema
  select: %{
    score:
      fragment(
        "similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
        # concat columns with a space between them
        a.username,
        p.first_name,
        p.last_name,
        p.title,
        p.department,
        # input params
        ^search_string
      ),
    person: p
  },
  order_by:
    [
      desc: fragment(
        "similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
        # concat columns with a space between them
        a.username,
        p.first_name,
        p.last_name,
        p.title,
        p.department,
        # input params
        ^search_string
      )
    ],
  limit: 2
) |> Repo.all()

[
  %{
    score: 0.5,
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 1,
      last_name: "Smith",
      first_name: "John",
      title: "Software Engineer",
      department: "Product",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    }
  },
  %{
    score: 0.375,
    person: %Fuzzy.Person{
      __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
      id: 11,
      last_name: "Davis",
      first_name: "Ava",
      title: "Software Engineer",
      department: "Product",
      account: #Ecto.Association.NotLoaded<association :account is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z]
    }
  }
]

Only Sort & Limit

iex -S mix phx.server

import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person

# if ewe lower the threshold we get more results
search_string = "john an engineer in product"

# or just in sorting
from(p in Person,
  join: a in assoc(p, :account), # with associations defined in the schema
  select:  p,
  preload: [:account],
  order_by:
    [
      desc: fragment(
        "similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
        # concat columns with a space between them
        a.username,
        p.first_name,
        p.last_name,
        p.title,
        p.department,
        # input params
        ^search_string
      )
    ],
  limit: 2
) |> Repo.all()

[
  %Fuzzy.Person{
    __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
    id: 1,
    last_name: "Smith",
    first_name: "John",
    title: "Software Engineer",
    department: "Product",
    account: #Fuzzy.Account<
      __meta__: #Ecto.Schema.Metadata<:loaded, "accounts">,
      id: 1,
      status: :active,
      username: "johnsmith",
      person_id: 1,
      person: #Ecto.Association.NotLoaded<association :person is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z],
      ...
    >,
    inserted_at: ~U[2024-10-04 09:31:56Z],
    updated_at: ~U[2024-10-04 09:31:56Z]
  },
  %Fuzzy.Person{
    __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
    id: 11,
    last_name: "Davis",
    first_name: "Ava",
    title: "Software Engineer",
    department: "Product",
    account: #Fuzzy.Account<
      __meta__: #Ecto.Schema.Metadata<:loaded, "accounts">,
      id: 11,
      status: :active,
      username: "avadavis",
      person_id: 11,
      person: #Ecto.Association.NotLoaded<association :person is not loaded>,
      inserted_at: ~U[2024-10-04 09:31:56Z],
      updated_at: ~U[2024-10-04 09:31:56Z],
      ...
    >,
    inserted_at: ~U[2024-10-04 09:31:56Z],
    updated_at: ~U[2024-10-04 09:31:56Z]
  }
]

Or if you only every want the top response:

iex -S mix phx.server

import Ecto.Query
alias Fuzzy.Repo
alias Fuzzy.Person

# if ewe lower the threshold we get more results
search_string = "john an engineer in product"

from(p in Person,
  join: a in assoc(p, :account), # with associations defined in the schema
  select:  p,
  preload: [:account],
  order_by:
    [
      desc: fragment(
        "similarity(CONCAT_WS(' ', ?, ?, ?, ?, ?), ?)",
        # concat columns with a space between them
        a.username,
        p.first_name,
        p.last_name,
        p.title,
        p.department,
        # input params
        ^search_string
      )
    ],
  limit: 1
) |> Repo.one()

%Fuzzy.Person{
  __meta__: #Ecto.Schema.Metadata<:loaded, "people">,
  id: 1,
  last_name: "Smith",
  first_name: "John",
  title: "Software Engineer",
  department: "Product",
  account: #Fuzzy.Account<
    __meta__: #Ecto.Schema.Metadata<:loaded, "accounts">,
    id: 1,
    status: :active,
    username: "johnsmith",
    person_id: 1,
    person: #Ecto.Association.NotLoaded<association :person is not loaded>,
    inserted_at: ~U[2024-10-04 09:31:56Z],
    updated_at: ~U[2024-10-04 09:31:56Z],
    ...
  >,
  inserted_at: ~U[2024-10-04 09:31:56Z],
  updated_at: ~U[2024-10-04 09:31:56Z]
}

I personally prefer at least returning the score to assess the match quality even when I set the threshold.

Conclusion

This is a straight forward way to find records from incomplete information and when used in coordination with other strategies this can be a powerful way to help users locate data.

PS - generally speaking 0.3 is considered a generous, yet healthy cutoff.

Resources & Articles

Docs

Bill Tihen
Bill Tihen
Developer, Data Enthusiast, Educator and Nature’s Friend

very curious – known to explore knownledge and nature