Skip to main content

Overview

ondoki uses a hybrid search system that combines PostgreSQL full-text search with pgvector semantic search. Results are merged using Reciprocal Rank Fusion (RRF) and boosted by popularity and recency signals. A trigram fallback handles typos.

How It Works

User Query

    ├─── Full-Text Search (tsvector)
    │       Keyword matching with prefix support

    ├─── Semantic Search (pgvector)
    │       Vector similarity using embeddings

    ├─── Trigram Fallback
    │       Fuzzy matching for typos


RRF Fusion

    ├─── Combine rankings from all sources
    ├─── Apply boosts (views, recency, type)


Ranked Results with Highlighted Snippets
Uses PostgreSQL’s built-in full-text search capabilities:
  • tsvector columns on documents, workflows, and workflow steps
  • plainto_tsquery for query parsing
  • Prefix matching — the last word in the query is treated as a prefix, enabling as-you-type search (e.g., “deploy back” matches “deploy backend”)
  • Indexed fields: document names, workflow names, summaries, step descriptions, tags, guide content
Uses pgvector for vector similarity search:
  • 1536-dimensional embeddings (OpenAI-compatible)
  • Cosine distance for similarity measurement
  • Indexed content types: workflows (whole + per-step), documents, document chunks, knowledge sources
  • Content hashing (SHA-256) to skip re-embedding unchanged content
Semantic search requires an embedding model to be available. Embeddings are generated automatically when content is created or updated. If pgvector is not available, ondoki gracefully falls back to full-text search only.

Reciprocal Rank Fusion (RRF)

RRF combines rankings from multiple search sources without needing to normalize scores:
RRF_score = Σ (1 / (k + rank_i))
Where k is a constant (typically 60) and rank_i is the position in each ranked list. This ensures that items ranked highly by multiple sources get the highest combined scores.

Ranking Boosts

After RRF fusion, additional boosts are applied:
BoostFactorDescription
View countPopularityMore-viewed resources rank higher
RecencyTime decayRecently updated content is preferred
Resource typeType weightWorkflows may be weighted over documents
Exact title matchTitle bonusExact title matches get a significant boost

Trigram Fallback

If full-text and semantic search return insufficient results, ondoki falls back to PostgreSQL trigram matching (pg_trgm). This handles:
  • Typos (e.g., “deploymnet” → “deployment”)
  • Partial word matches
  • Character transpositions

Search API

Endpoint: GET /api/v1/search/unified-v2
ParameterTypeDescription
qstringSearch query
project_idstringScope to a specific project
limitintegerMax results (default: 20)
Response fields per result:
FieldDescription
typeworkflow, document, or step
idResource ID
titleResource title
snippetText excerpt with <mark> highlighted matches
scoreCombined relevance score
matched_fieldsWhich fields matched (name, summary, etc.)
updated_atLast update timestamp

What Gets Indexed

ResourceIndexed Fields
DocumentsName, extracted plain text content
WorkflowsName, summary, tags, guide markdown
Workflow StepsDescription, generated title, generated description
Knowledge SourcesProcessed content from uploaded files

Full-Text Index

Plain text is extracted from TipTap JSON content and stored in search_text. A search_tsv tsvector column is maintained for fast queries.

Semantic Index

Embeddings are generated for each resource and stored in the embedding table with:
  • source_type — what kind of resource
  • source_id — which specific resource
  • content_hash — SHA-256 of the content (to skip re-embedding)
  • embedding — 1536-dimensional vector
  • metadata — additional context (JSON)