Skip to content

feat(datafusion): auto-register built-in table functions on catalog registration#324

Merged
JingsongLi merged 9 commits into
apache:mainfrom
shyjsarah:feat/register-table-function
May 19, 2026
Merged

feat(datafusion): auto-register built-in table functions on catalog registration#324
JingsongLi merged 9 commits into
apache:mainfrom
shyjsarah:feat/register-table-function

Conversation

@shyjsarah
Copy link
Copy Markdown
Contributor

@shyjsarah shyjsarah commented May 18, 2026

Purpose

Linked issue: close #xxx

The vector_search and full_text_search table-valued functions could only be made available by manually calling their register_* function on a SessionContext. SQLContext users (including the Python binding / pypaimon) had no way to reach register_udtf, so these functions were effectively unusable through SQLContext.

This registers them automatically, in Rust, when a catalog is registered, so no caller-side setup is needed.

Brief change log

  • SQLContext::register_catalog now registers vector_search and full_text_search against the catalog being registered. Any SQLContext with a catalog gets them with no extra call.
  • full_text_search registration is #[cfg(feature = "fulltext")]-gated, so builds without the feature are unaffected.
  • The Python binding (bindings/python/Cargo.toml) enables the fulltext feature on paimon-datafusion so full_text_search is compiled into the binding.

Note: referenced_files_size / physical_files_size are intentionally not covered — they are system tables (table$referenced_files_size), not UDTFs, so they need no registration.

Tests

bindings/python/tests/test_datafusion.pytest_table_functions_registered_with_catalog: after register_catalog, calling vector_search / full_text_search with the wrong argument count surfaces each function's own validation error, proving it is registered (an unregistered name would instead fail with "table function
not found").

API and Format

  • Behavior change: SQLContext::register_catalog now also registers the vector_search and full_text_search table functions. No new public API, no signature change.
  • Build: the Python binding enables paimon-datafusion/fulltext (adds the pure-Rust tantivy dependency).
  • No storage format change.

Documentation

docs/src/sql.md is updated: the Vector Search / Full-Text Search registration sections now state that a SQLContext registers these functions automatically when a catalog is registered, and that the explicit register_* call is only needed with a raw SessionContext.

shyjsarah and others added 2 commits May 18, 2026 01:38
Add `SQLContext.register_table_function(name, default_database=None)`
to the Python binding so Paimon table-valued functions can be
registered from Python — the binding previously had no way to reach
`register_udtf`.

A single dispatch method keeps the API surface stable: it currently
supports `vector_search` and `full_text_search`, and the same `match`
will pick up `referenced_files_size` / `physical_files_size` once
those land, without changing the Python signature.

The function binds to the current catalog. So the binding can obtain
that catalog without keeping a duplicate handle of its own,
`SQLContext::current_catalog` is made public. The binding also enables
the `fulltext` feature so `register_full_text_search` is available.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add tests for `SQLContext.register_table_function`:
- vector_search / full_text_search register without error
- the optional default_database keyword is accepted
- an unknown function name raises a clear error
- calling it before any catalog is registered raises

Registration alone touches neither the Lumina nor Tantivy runtime,
so these tests are deterministic and need no index fixtures.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should register it in Catalog by default in Rust. This is a legacy work from before. Can you modify it?

shyjsarah and others added 5 commits May 18, 2026 19:17
Per review: register the built-in table-valued functions in Rust by
default when a catalog is registered, instead of exposing an explicit
register_table_function method on the Python binding.

SQLContext::register_catalog now registers vector_search,
full_text_search, referenced_files_size and physical_files_size against
the catalog being registered, so every SQLContext user gets them with
no extra call. The Python register_table_function method and the
SQLContext::current_catalog visibility change are reverted; the binding
keeps the fulltext feature so full_text_search compiles in.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The register_full_text_search call fits within the line width on a
single line; rustfmt rejected the wrapped form.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Upstream apache#325 converted referenced_files_size / physical_files_size
from table functions to system tables, so they no longer have
register_* functions. register_catalog now auto-registers only the
remaining UDTFs — vector_search and full_text_search.

The binding test is reworked accordingly: it verifies the two UDTFs
are registered by triggering their own argument-count validation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shyjsarah shyjsarah changed the title feat(python): expose register_table_function for Paimon UDTFs feat(datafusion): auto-register built-in table functions on catalog registration May 19, 2026
Comment thread crates/integrations/datafusion/src/sql_context.rs Outdated
shyjsarah and others added 2 commits May 18, 2026 23:31
Per review: pull the inline built-in table-function registration in
register_catalog into a dedicated function. It is the single place
that knows the built-in table functions — new ones are added there.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The Vector Search / Full-Text Search registration sections still told
readers to call register_* manually. With a SQLContext that is now
automatic on register_catalog; the explicit call is only needed with a
raw SessionContext.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit df6074a into apache:main May 19, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants