Skip to content

feat(spill): optimize external sort with leveled merger to reduce write amplification#288

Open
zjw1111 wants to merge 10 commits into
alibaba:mainfrom
zjw1111:optimize-spill-performance
Open

feat(spill): optimize external sort with leveled merger to reduce write amplification#288
zjw1111 wants to merge 10 commits into
alibaba:mainfrom
zjw1111:optimize-spill-performance

Conversation

@zjw1111
Copy link
Copy Markdown
Collaborator

@zjw1111 zjw1111 commented May 18, 2026

Purpose

Optimize external sort spill performance by introducing a LeveledMerger (LSM-tree-like structure) to manage spill files. This reduces read/write amplification from O(N/K) to O(log_K(N)) compared to the previous naive sequential merge approach.

Key changes:

  • Add LeveledMerger class that organizes spill files into levels and triggers compaction when a level accumulates >= max_fan_in files
  • Add dynamic estimation of spill parameters (spill_batch_size_, actual_max_fan_in_) based on actual row sizes to better utilize memory budget
  • Add validation for local-sort.max-num-file-handles config (must be >= 2)
  • Fix state inconsistency: MergeAndReplaceFiles now only deletes input files on success, relying on SpillChannelManager::Reset() for cleanup on failure
  • Add divide-by-zero guard in EstimateSpillParameters()

Tests

  • LeveledMergerTest.NoCompactionBelowFanIn
  • LeveledMergerTest.CompactionTriggeredAtFanIn
  • LeveledMergerTest.MinimalFanInTwo
  • LeveledMergerTest.MultiLevelCompaction
  • LeveledMergerTest.ManyFilesWithFanInTwo
  • LeveledMergerTest.FinalCleanupReducesFileCount
  • LeveledMergerTest.FinalCleanupMergesSmallestFirst
  • LeveledMergerTest.FinalCleanupNoOpWhenAlreadyBelowTarget
  • LeveledMergerTest.FinalCleanupConvergesToTarget
  • LeveledMergerTest.MergeFnFailurePreservesState
  • LeveledMergerTest.ClearRemovesAllFiles
  • LeveledMergerTest.SetMaxFanInAffectsCompaction
  • LeveledMergerTest.CompactionOnlyTakesFanInFilesFromLevel
  • SortBufferTest.TestInMemorySortBufferEstimateMemoryUseForEachRow

API and Format

No API or storage format changes.

Documentation

No.

Generative AI tooling

Generated-by: Claude Code (Claude Opus 4.6)

…te amplification

Introduce LeveledMerger (LSM-tree-like structure) to manage spill files in levels,
reducing read/write amplification from O(N/K) to O(log_K(N)). Also adds dynamic
estimation of spill parameters based on actual row sizes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 18, 2026 06:13
zjw1111 and others added 4 commits May 18, 2026 14:39
- Add Doxygen comments to public methods in LeveledMerger and InMemorySortBuffer
- Replace bare 'int' with 'int32_t' in leveled_merger_test.cpp
- Remove unused #include <numeric>
- Add integer type convention to docs/code-style.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- WriteBufferTest.TestMergeSpilledFilesSkipsWithSingleFile: update
  min file handles from 1 to 2 (now enforced minimum)
- WriteBufferTest.TestSpillDiskQuotaEnforcement Case 3: use single
  spill_file_size as quota since leveled compaction changes disk usage
- WriteInteTest: relax intermediate file count assertions to allow
  leveled merger's multi-level structure (files cleaned at read time)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-free

When FlushWriteBuffer fails mid-operation (e.g., IO error during
rolling_writer->Write()), the producer thread may leave a KeyValue
cached in merge_function_wrapper_. That KeyValue holds Arrow data
referencing SpillReader::arrow_pool_ via raw pointer. After the
async producer/consumer is closed and SpillReaders are destroyed,
the cached KeyValue becomes a dangling reference, causing SEGV
during MergeTreeWriter destruction.

Fix: call merge_function_wrapper_->Reset() in the write_guard
ScopeGuard to release cached Arrow data before SpillReaders are
destroyed.

Also adapts MergeTreeWriterTest assertions to accommodate leveled
compaction file count behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread src/paimon/core/mergetree/external_sort_buffer.h
Comment thread src/paimon/core/mergetree/external_sort_buffer.cpp
Comment thread src/paimon/core/core_options.h
Comment thread src/paimon/core/mergetree/external_sort_buffer.cpp
Comment thread src/paimon/core/mergetree/spill_file_merger.cpp
Comment thread src/paimon/core/mergetree/spill_file_merger.cpp
Comment thread src/paimon/core/mergetree/merge_tree_writer.cpp
Comment thread src/paimon/core/mergetree/merge_tree_writer_test.cpp
zjw1111 and others added 3 commits May 18, 2026 23:09
…t use-after-free

Move the merge_function_wrapper_->Reset() into SortMergeReader::Close()
(both LoserTree and MinHeap variants) so that cached KeyValue data is
released before underlying readers are destroyed. This prevents Arrow
Buffer objects from calling Free() on a dangling pool pointer when
SpillReader is later destructed.

Also reorder the cleanup in MergeTreeCompactRewriter::RewriteCompaction
to call reader->Close() before merge_file_split_read_.reset(), ensuring
the Reset() triggered by Close() runs while all resource providers are
still alive.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…arity

- Rename class LeveledMerger → SpillFileMerger, files leveled_merger → spill_file_merger
- Rename Compact/Compaction methods to Merge (RunMergeIfNeeded, RunFinalMergeIfNeeded, etc.)
- Move kMinFanIn validation from CoreOptions to ExternalSortBuffer::Create
- Improve variable names (l→level_idx, a/b→lhs/rhs, n→files_to_merge, f→file)
- Convert 6 spill tests from TEST_P to TEST_F (no parameterization needed)
- Fix 4 misused TEST_P in btree_global_index and file_system tests
- Make test assertions exact instead of range-based

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add TestSpillWithIOException exercising IO error injection across all
  spill code paths (write, intermediate merge, final merge, flush)
- Add SetMaxFanInToLargerValueSuppressesCompaction verifying dynamic fan-in
- Improve inline comments explaining leveled merge behavior in spill tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zjw1111 zjw1111 requested review from Copilot and removed request for Copilot May 19, 2026 07:01
zjw1111 and others added 2 commits May 19, 2026 16:59
…mments

- Rename compact/compaction to merge in spill-related comments and test names
- Refine Write/FlushMemory return value variable names for clearer semantics
- Add Write 3 case to TestSpillDiskQuotaEnforcement
- Add comments in sort_buffer_test.cpp for Clear and empty batch behavior
- Fix ASSERT_LE to ASSERT_EQ for deterministic spill file counts in inte tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants