Skip to content

Add more configuration params#1019

Open
vishesh92 wants to merge 1 commit into
kantord:mainfrom
vishesh92:add-more-configuration-params
Open

Add more configuration params#1019
vishesh92 wants to merge 1 commit into
kantord:mainfrom
vishesh92:add-more-configuration-params

Conversation

@vishesh92
Copy link
Copy Markdown

@vishesh92 vishesh92 commented Oct 13, 2025

This pull request introduces a comprehensive set of configuration options for performance tuning and search quality in SeaGOAT. It adds new server-side configuration sections for vector search, text search, engine processing, and query defaults, allowing users to fine-tune behavior for different repository sizes and hardware constraints. The codebase is updated to use these new config values throughout, replacing previous hardcoded limits and improving flexibility.

Configuration System Enhancements

  • Added new configuration sections (chroma, ripgrep, engine, query) to docs/configuration.md and seagoat/utils/config.py, with schema validation and sensible defaults for vector search, text search, engine processing, and query parameters. [1] [2] [3] [4]
  • Updated documentation to include detailed descriptions of new settings and practical performance tuning examples for various scenarios (large repos, memory-constrained systems, faster analysis, better search quality).

Engine and Query Processing

  • The engine now dynamically determines the minimum number of chunks to analyze and the number of worker threads based on configuration, replacing fixed values. [1] [2]
  • Query endpoints in seagoat/server.py use configurable defaults for result limits and context lines, improving usability and consistency. [1] [2] [3]

Vector Search Improvements

  • All ChromaDB vector search logic now uses configurable values for maximum vector distance, chunk fetch limits, and result over-fetching, replacing previous constants. [1] [2] [3]

Ripgrep Text Search Improvements

  • Ripgrep caching and search logic now use configurable file size and memory-mapped cache limits, improving scalability and resource control. [1] [2] [3]
  • Ripgrep search results are scored using the configured vector distance for consistency with vector search. [1] [2]

These changes make SeaGOAT much more configurable, allowing users to optimize performance and search quality for their specific use case.

Copy link
Copy Markdown

@JiwaniZakir JiwaniZakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rename from embedding_function to embeddingFunction in the chroma config block (visible in the docs diff) is a silent breaking change for any existing users who have already configured this field — there's no deprecation warning, fallback handling, or migration note. At minimum, the code that reads this config should check both keys and warn if the old snake_case form is found.

In engine.py, a new ThreadPoolExecutor is created on every call to query() but is never explicitly shut down. With maxWorkers now configurable up to 32, repeated queries could accumulate thread pool overhead. The executor should be used as a context manager (async with or a try/finally with .shutdown(wait=False)) or reused as an instance variable.

Finally, the config values read in _create_vector_embeddings (e.g., self.config["server"]["engine"]["minChunksToAnalyze"]["percentage"]) lack any runtime validation — a user setting percentage to 0.0 and minValue to 0 would result in minimum_chunks_to_analyze = 0, silently producing empty analysis. Given that ranges are documented (e.g., 0.01-1.0), there should be a validation step — either at config load time or with a clear error before use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants