Skip to content

Feature: file cache: max size#945

Open
autoantwort wants to merge 57 commits into
microsoft:mainfrom
autoantwort:feature/file-cache-max-size
Open

Feature: file cache: max size#945
autoantwort wants to merge 57 commits into
microsoft:mainfrom
autoantwort:feature/file-cache-max-size

Conversation

@autoantwort
Copy link
Copy Markdown
Contributor

@autoantwort autoantwort commented Mar 8, 2023

Fixes microsoft/vcpkg#19452

If no settings are found a new settings file is created. You can set the following properties:

"properties": {
"delete-policy": {
"description": "After which policy objects should be deleted.",
"enum": [
"None",
"OldestAccessDate",
"OldestModificationDate"
]
},
"max-size-in-gb": {
"description": "The maximum size of the cache in gigabytes.",
"type": "number",
"minimum": 0
},
"max-age-in-days": {
"description": "The maximum age of the cache in days.",
"type": "number",
"minimum": 0
},
"keep-available-in-percentage": {
"description": "How much space should be kept available on the disk in percentage.",
"type": "number",
"minimum": 0
}
},

To ensure that the limits are respected when multiple instances are running the following is done:
0. Every instance has a unique "sync" file in "/sync/<random_number>".

  1. Append a line to sync file of the format "<name_of_object>;<file_size>\n"
  2. Read the new entries from other sync files
  3. Check if files must be deleted to respect the limits

# Conflicts:
#	include/vcpkg/base/messages.h
#	src/vcpkg/base/messages.cpp
# Conflicts:
#	src/vcpkg/binarycaching.cpp
# Conflicts:
#	src/vcpkg-test/metrics.cpp
#	src/vcpkg/binarycaching.cpp
# Conflicts:
#	include/vcpkg/base/files.h
#	src/vcpkg-test/files.cpp
#	src/vcpkg/base/files.cpp
#	src/vcpkg/binarycaching.cpp
@autoantwort autoantwort marked this pull request as ready for review August 19, 2023 16:19
@autoantwort
Copy link
Copy Markdown
Contributor Author

This is not 100% ready for merge, but ready for review. When I get a design approval I will handle all edge case errors and do localization.

# Conflicts:
#	include/vcpkg/base/message-data.inc.h
Copy link
Copy Markdown
Member

@BillyONeal BillyONeal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0. Every instance has a unique "sync" file in "<root>/sync/<random_number>". 
1. Append a line to sync file of the format "<name_of_object>;<file_size>\n"
2. Read the new entries from other sync files
3. Check if files must be deleted to respect the limits

I do not believe this is correct. There is nothing to stop 2 instances from concurrently reading all the sync files, deciding that something needs to be deleted, writing their intent to delete to their own sync files, and now there is no way to resolve the ambiguity on who 'won' that race.

In particular, this is assuming that writes or appends within a file will be atomic, which is a feature most file systems do not provide and several network file systems extremely do not provide. The only operation which we can assume is atomic is the creation or removal of a file system entry; as in rename.

I think you need something like a read/write oplock here, where instances trying to remove entries from the cache are writers, and instances trying to do anything else are readers. Only one instance should be trying to delete out of the cache at a time, and if any instance is doing so, we need to make sure we don't delete a cache entry that other instances are potentially touching.

Comment thread src/vcpkg/binarycaching.cpp Outdated
Comment thread src/vcpkg/binarycaching.cpp Outdated
Comment thread src/vcpkg/binarycaching.cpp Outdated
cache.own_sync_file = get_own_sync_file(cache.sync_root_dir);
if (cache.folder_settings.delete_policy != FolderSettings::DeletePolicy::None)
{
std::unordered_map<std::string, uint64_t> file_sizes;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that in general this works correctly, due to concurrent insertions into the cache. Cache entries are not meaningfully part of the cache until they have been renamed into place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a cache so that I don't have to call fs.file_size() for every cache entry. If a file is not in the cache anymore, this cache size is not used.

Comment thread src/vcpkg/binarycaching.cpp Outdated
}
}

size_t push_success(const BinaryPackageWriteInfo& request, MessageSink& msg_sink) override
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think 'during push_success' is the right time to be doing this. It probably should be a one time pass that looks at the cache(s) after all cache operations this particular vcpkg instance will do on this run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then the cache could be larger then its max size in the meantime?

@autoantwort
Copy link
Copy Markdown
Contributor Author

autoantwort commented Oct 25, 2023

I do not believe this is correct. There is nothing to stop 2 instances from concurrently reading all the sync files, deciding that something needs to be deleted, writing their intent to delete to their own sync files, and now there is no way to resolve the ambiguity on who 'won' that race.

The instances only write to the sync files what they want to add, not what they want to delete.
So if you want to add a file with hash xxx and size 123 to the binary cache you append the following to your own sync file:
xxx;123\n

In particular, this is assuming that writes or appends within a file will be atomic, which is a feature most file systems do not provide and several network file systems extremely do not provide. The only operation which we can assume is atomic is the creation or removal of a file system entry; as in rename.

No this code does not assumes this. I explicitly handle this case ... I just realize that I wanted to implement this but haven't done this yet ... 🤦 🙈 Edit: Now implemented
It this is implemented, the implementation is save:

Start:
Current binary cache size: 9.5 GB
Two instances want to add 0.6 GB

Instance 1 Instance 2
1 Add half a line to own sync file
2 Add line to own sync file
3 Read other sync file
4 Delete files until there is space for the 0.6 GB file
5 Write rest of line
6 Read other sync file
7 Delete files until there is space for the 0.6 GB file

@julianxhokaxhiu
Copy link
Copy Markdown

Has there any progress been made on this direction? it's becoming quite an effort to cleanup manually the vcpkg cache directory from time to time. Thank you in advance 🙏🏻

@autoantwort
Copy link
Copy Markdown
Contributor Author

@julianxhokaxhiu You could compile your own version of vcpkg-tool 🙈 I am using this on mac/linux/windows on a daily basis since years and never had any problems.

Copilot AI review requested due to automatic review settings May 16, 2026 16:19
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces configurable eviction for the on-disk “files” binary cache by adding a settings file (settings.json) + schema, tracking cache entries across concurrent vcpkg instances via per-process sync files, and extending the filesystem/json infrastructure needed to enforce size/age/free-space limits.

Changes:

  • Add a FilesCacheManager to coordinate cache-size enforcement (including multi-process sync updates) for the files binary provider.
  • Add JSON helpers (Json::parse_file, PositiveNumberDeserializer) and new localized messages for settings parsing/reporting.
  • Extend filesystem APIs with access-time getters/setters and space() info; adjust JSON numeric stringify formatting (test updates included).

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 23 comments.

Show a summary per file
File Description
src/vcpkg/binarycaching.cpp Adds file-cache settings parsing and a multi-process cache manager that evicts entries based on configured limits.
src/vcpkg/base/json.cpp Adds parse_file(), introduces PositiveNumberDeserializer, and changes JSON number stringify formatting.
src/vcpkg/base/files.cpp Adds filesystem support for access times, disk space queries, and overwrite control for file creation.
src/vcpkg/base/downloads.cpp Updates WriteFilePointer construction to match new overwrite-aware signature.
src/vcpkg-test/metrics.cpp Updates expected JSON payload formatting due to numeric stringify changes.
src/vcpkg-test/files.cpp Adds tests for new filesystem time APIs and strengthens temp directory setup.
locales/messages.json Adds new localized message strings used by settings parsing/reporting.
include/vcpkg/base/message-data.inc.h Declares new message IDs used by settings parsing/reporting.
include/vcpkg/base/jsonreader.h Declares PositiveNumberDeserializer.
include/vcpkg/base/json.h Declares Json::parse_file() and adds needed forward declarations.
include/vcpkg/base/fwd/files.h Introduces Overwrite enum.
include/vcpkg/base/files.h Updates WriteFilePointer API; adds space_info, access-time and space APIs.
docs/file-cache-settings.schema.json Adds JSON schema for settings.json controlling cache eviction behavior.
Comments suppressed due to low confidence (8)

docs/file-cache-settings.schema.json:26

  • Schema allows 0 for max-age-in-days (minimum: 0), but the current deserializer rejects 0. If 0 is intended to mean “no age limit”, the parser should accept it; otherwise, make the schema require > 0.
    "max-age-in-days": {
      "description": "The maximum age of the cache in days.",
      "type": "number",
      "minimum": 0
    },

docs/file-cache-settings.schema.json:31

  • Schema allows 0 for keep-available-in-percentage (minimum: 0), but the current parser rejects 0 even though the eviction code treats 0 as disabling the check. Align schema and parsing so 0 works as documented/intended.
    "keep-available-in-percentage": {
      "description": "How much space should be kept available on the disk in percentage.",
      "type": "number",
      "minimum": 0
    }

src/vcpkg/binarycaching.cpp:471

  • make_space_for() reads sync updates and calls folder_settings.last_time(...) even when eviction is disabled (delete_policy == None). With the current last_time() implementation this will crash. Skip sync processing when eviction is disabled or make last_time() safe for None.
            // 2. Read changes from other instances
            get_sync_updates([&](auto id, auto size) {
                const auto archive_path = archives_root_dir / files_archive_subpath(id.to_string());
                auto last_time = folder_settings.last_time(fs, archive_path, IgnoreErrors{});
                auto size_as_int = Strings::strto<uint64_t>(size).value_or_exit(VCPKG_LINE_INFO);

src/vcpkg/binarycaching.cpp:542

  • file_added_to_cache() always calls folder_settings.last_time(...), which will crash for delete-policy: "None". If eviction is disabled, this should be a no-op (or should use a safe timestamp source).
        void file_added_to_cache(const Path& file_path, uint64_t file_size)
        {
            auto last_time = folder_settings.last_time(fs, file_path, IgnoreErrors{});
            file_data.push(FileData{file_path, file_size, last_time});
            current_size += file_size;

src/vcpkg/binarycaching.cpp:337

  • KEEP_AVAILABLE_PERCENTAGE is parsed twice. Remove the duplicate optional_object_field call to avoid confusion and make future edits less error-prone.
            reader.optional_object_field(obj,
                                         KEEP_AVAILABLE_PERCENTAGE,
                                         folder_settings.keep_available_percentage,
                                         Json::PositiveNumberDeserializer::instance);

src/vcpkg/binarycaching.cpp:307

  • The parse error text "Unexped DeletePolicy" has a typo and is hard to understand. Replace it with a correctly spelled, descriptive message (ideally via the message system).
            r.add_generic_error(type_name(), LocalizedString::from_raw("Unexped DeletePolicy"));
            return nullopt;

src/vcpkg/base/files.cpp:4121

  • filetime_to_int64() multiplies by 100 to return nanoseconds after applying an epoch shift. If callers compare this to file_time_now()/last_write_time() on Windows (which typically use 100ns file_clock ticks), eviction and age calculations will be wrong. Prefer returning the raw FILETIME tick count (or otherwise match file_time_type’s convention).
        static int64_t filetime_to_int64(FILETIME filetime)
        {
            ULARGE_INTEGER large_integer;
            large_integer.HighPart = filetime.dwHighDateTime;
            large_integer.LowPart = filetime.dwLowDateTime;

src/vcpkg/base/files.cpp:4156

  • last_access_time() returns the converted value from filetime_to_int64(). Ensure the returned timestamp is in the same units/epoch as last_write_time()/file_time_now() on Windows so access-time eviction policies and comparisons work correctly.
            FILETIME last_access_time;
            if (!GetFileTime(fh.h_file, nullptr, &last_access_time, nullptr))
            {
                ec.assign(GetLastError(), std::system_category());
                return {};

Comment on lines +480 to +481
const auto oldest_date =
(folder_settings.max_age.count() ? fs.file_time_now() - folder_settings.max_age.count() : 0);
Comment on lines +650 to +651
if (fs.file_time_now() - fs.last_write_time(file, VCPKG_LINE_INFO) >
duration_cast<nanoseconds>(24h).count())
auto file_handle = fs.open_for_read(file, VCPKG_LINE_INFO);
file_handle.try_seek_to(cur_size).value_or_exit(VCPKG_LINE_INFO);
std::error_code ec;
auto file_content = file_handle.read_to_end(ec);
current_size += size_as_int;
});

// 3. Delete files if not enouph space is available
}
max_size_in_bytes -= file_size;
Debug::print(fmt::format("{:<25}{:>20}\n", "max_cache_size", max_size_in_bytes));
// 5. Delete files until the constraints are fullfilled
Comment thread src/vcpkg/base/json.cpp

Optional<double> PositiveNumberDeserializer::visit_number(Reader&, double value) const
{
if (value <= 0)
Comment thread src/vcpkg/base/files.cpp
Comment on lines +4110 to +4131
// FILETIME contains a 64-bit value representing the number of 100-nanosecond intervals since January 1, 1601
// (UTC). shift epoch by 400 years to fit into int64_t (can hold 292 years)
static constexpr uint64_t epoch_shift =
std::chrono::duration_cast<std::chrono::duration<uint64_t, std::nano>>(std::chrono::hours{24 * 365 * 400})
.count() /
100;

static int64_t filetime_to_int64(FILETIME filetime)
{
ULARGE_INTEGER large_integer;
large_integer.HighPart = filetime.dwHighDateTime;
large_integer.LowPart = filetime.dwLowDateTime;
large_integer.QuadPart -= epoch_shift;
return large_integer.QuadPart * 100;
}

static FILETIME int64_to_filetime(int64_t value)
{
ULARGE_INTEGER large_integer;
FILETIME filetime;
large_integer.QuadPart = static_cast<uint64_t>(value / 100);
large_integer.QuadPart += epoch_shift;
Comment on lines +428 to +432
std::unordered_map<std::string, uint64_t> file_sizes;
get_sync_updates(
[&](auto id, auto size) {
auto size_as_int = Strings::strto<uint64_t>(size).value_or_exit(VCPKG_LINE_INFO);
file_sizes.emplace(id.to_string(), size_as_int);
Comment on lines +628 to +632
while (true)
{
Path path = sync_root_dir / fmt::format("{}", rand());
std::error_code ec;
WriteFilePointer wp(path, Append::NO, Overwrite::NO, ec);
constexpr static StringLiteral MAX_AGE_DAYS = "max-age-in-days";
constexpr static StringLiteral KEEP_AVAILABLE_PERCENTAGE = "keep-available-in-percentage";
constexpr static StringLiteral DELETE_POLICY = "delete-policy";
constexpr static StringLiteral MODIFICATION_DATE_UPDATE_INTERVAL = "modification-date-update-interval";
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants