Skip to content

fix(limit-count): upgrade redis-cluster lib so NOSCRIPT is not treated as node failure#13579

Open
shreemaan-abhishek wants to merge 1 commit into
apache:masterfrom
shreemaan-abhishek:fix/redis-cluster-noscript
Open

fix(limit-count): upgrade redis-cluster lib so NOSCRIPT is not treated as node failure#13579
shreemaan-abhishek wants to merge 1 commit into
apache:masterfrom
shreemaan-abhishek:fix/redis-cluster-noscript

Conversation

@shreemaan-abhishek

Copy link
Copy Markdown
Contributor

Description

The limit-count redis-cluster path executes the rate-limit script via evalsha with a fallback to eval when the server returns NOSCRIPT (added in #13363).

However, the bundled resty-redis-cluster = 1.05-1 treats any command error other than MOVED/ASK/CLUSTERDOWN (including NOSCRIPT) as a possible node failure and triggers a full cluster slot refresh:

elseif string.sub(err, 1, 11) == "CLUSTERDOWN" then
    return nil, "Cannot executing command, cluster status is failed!"
else
    --There might be node fail, we should also refresh slot cache
    self:refresh_slots()   -- NOSCRIPT lands here
    return nil, err
end

So in cluster mode every NOSCRIPT (first call on a fresh node, after SCRIPT FLUSH, or after failover to a replica that does not yet have the script cached) forces an unnecessary refresh_slots() over the whole cluster before the eval fallback runs. The request still succeeds, but each cache miss pays for a needless cluster-wide topology refresh.

lua-resty-redis-cluster = 1.3.3-0 handles NOSCRIPT explicitly and returns the error without refreshing slots, which matches the plugin's evalsha fallback:

elseif string.sub(err, 1, 8) == "NOSCRIPT" then
    return nil, err

This PR upgrades the dependency accordingly and adds a cluster test that flushes the script cache and verifies the request still succeeds through the eval fallback.

Note: APISIX already depends on the API7-maintained api7-lua-resty-redis-connector, so adopting the maintained lua-resty-redis-cluster fork (published on luarocks.org) is consistent with existing dependencies.

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (if not, please discuss on the APISIX mailing list first)

…d as node failure

The limit-count redis-cluster path runs evalsha with a NOSCRIPT fallback
to eval (apache#13363). The bundled resty-redis-cluster 1.05 treats any error
other than MOVED/ASK/CLUSTERDOWN, including NOSCRIPT, as a node failure
and triggers a full cluster slot refresh. So every NOSCRIPT (fresh node,
SCRIPT FLUSH, or failover to a replica missing the script) forces an
unnecessary refresh_slots() before the eval fallback runs.

lua-resty-redis-cluster 1.3.3 handles NOSCRIPT explicitly and returns the
error without refreshing slots, matching the plugin's evalsha fallback.

Add a cluster test that flushes the script cache and asserts the request
still succeeds via the eval fallback.
@dosubot dosubot Bot added size:M This PR changes 30-99 lines, ignoring generated files. dependencies Pull requests that update a dependency file labels Jun 20, 2026

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR upgrades the Redis Cluster Lua client dependency used by APISIX so that NOSCRIPT errors from EVALSHA are no longer treated as node failures (avoiding unnecessary cluster slot refreshes), and adds a regression test to ensure the limit-count Redis Cluster path correctly falls back to EVAL after script cache flush.

Changes:

  • Upgrade LuaRocks dependency from resty-redis-cluster to lua-resty-redis-cluster (v1.3.3-0) to avoid slot refresh on NOSCRIPT.
  • Add a Redis Cluster test that flushes scripts across the cluster and asserts the EVALSHAEVAL fallback path succeeds.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
apisix-master-0.rockspec Updates the Redis Cluster client dependency to a version that handles NOSCRIPT without forcing a slot refresh.
t/plugin/limit-count-redis-cluster.t Adds coverage for NOSCRIPT fallback behavior in Redis Cluster by flushing script caches and validating a successful request + expected log.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread t/plugin/limit-count-redis-cluster.t
@membphis

Copy link
Copy Markdown
Member

Blocking concern: this dependency upgrade appears to change Redis Cluster timeout/retry behavior implicitly, beyond the intended NOSCRIPT fix.

The NOSCRIPT handling in lua-resty-redis-cluster = 1.3.3-0 matches the limit-count evalsha -> eval fallback, so that part looks good. However, the replacement is not behavior-neutral compared with resty-redis-cluster = 1.05-1:

  • The old library defaults to DEFAULT_MAX_REDIRECTION = 5 and DEFAULT_MAX_CONNECTION_ATTEMPTS = 3; the new library defaults to 2 and 2.
  • APISIX currently passes connect_timeout/read_timeout/send_timeout = conf.redis_timeout from apisix/utils/rediscluster.lua, and the plugin docs/schema expose redis_timeout as applying to redis-cluster.
  • In the new library, try_hosts_slots() and handle_command_with_retry() use fixed connect timeouts (100ms for the first attempt, 800ms for later attempts) instead of consistently using config.connect_timeout.

This can make policy = "redis-cluster" fail earlier than before in high-latency, TLS, cross-AZ, rolling replacement, MOVED/ASK-heavy, or transient node-failure scenarios. More importantly, increasing redis_timeout may no longer fully affect the Redis Cluster connection path as documented.

Could we either preserve the previous APISIX behavior in the wrapper, for example by explicitly setting the old retry defaults and ensuring redis_timeout is still honored by the cluster connect paths, or document the compatibility change and add tests that cover the new timeout/retry semantics? Without that, this dependency upgrade has a user-visible compatibility risk that is not disclosed by the PR title/description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants