MAVANN: Metadata-Aware Vector Approximate Nearest Neighbor

1. Introduction
2. Comparison with Traditional Filtering Approaches
3. Vector Structure in MAVANN
4. Phased Distance Metric Computation
- 4.1. Step 1: Metadata Segment Evaluation
- 4.2. Step 2: Full Vector Similarity Computation
5. Index Structure
6. Query Processing
7. Performance Considerations
8. Conclusion

1. Introduction

This document details the design and implementation of metadata-based filtering capabilities in the Cosdata vector database, specifically for HNSW dense vectors. The core innovation of MAVANN (Metadata-Aware Vector Approximate Nearest Neighbor) lies in extending vector dimensions to encode metadata directly within the vector space. By appending metadata dimensions to the base vector representation, MAVANN enables efficient filtering while preserving the accuracy of similarity search.

2. Comparison with Traditional Filtering Approaches

2.1. Pre-Filtering

Pre-filtering applies constraints before the similarity search begins, reducing the candidate pool. While effective, it suffers from:

Reduced Search Space: By filtering before the vector search, it artificially limits the potential matches and may miss semantically relevant results that don't exactly match the metadata criteria.
Performance Issues: Maintaining separate metadata indexes alongside vector indexes can be slow, especially with high-cardinality metadata fields. Query planning overhead also adds complexity.

2.2. Post-Filtering

Post-filtering applies metadata constraints after retrieving the nearest neighbors. While preserving ANN speed, it introduces:

Inefficient Resource Usage: Full vector similarity search is performed before filtering, leading to wasted computation on discarded results.
Result Quality Issues: If too many top matches are filtered out, the final result set may be insufficient, requiring retrieval of more candidates.
Architectural Complexity: Requires separate filtering logic, complicating caching, optimization, and ranking.

2.3. MAVANN's Novel Approach

MAVANN eliminates the inefficiencies of pre- and post-filtering by integrating metadata into the vector representation itself. By encoding metadata within the vector space, filtering becomes an inherent part of the ANN search process, avoiding unnecessary computation and improving precision.

3. Vector Structure in MAVANN

Instead of treating metadata as separate filters applied before or after the similarity search, MAVANN expands the vector space by appending metadata dimensions to the core feature vector. This means each vector has two segments:

Values Segment: The primary dimensions capturing the core vector representation (e.g., an embedding from a neural network).
Metadata Segment: Additional dimensions encoding metadata, such as categorical tags, numerical constraints, or temporal information.

Formally, for an original vector v with d dimensions, MAVANN extends it to v' with d + m dimensions, wherem represents metadata dimensions.

4. Phased Distance Metric Computation

MAVANN enables a staged similarity computation, optimizing performance while maintaining precision. This staged approach is particularly beneficial in hierarchical search structures like HNSW and is relevant to both indexing and query processing.

Step 1: Metadata Segment Evaluation

Compute the distance metric (e.g., cosine similarity, Euclidean distance) only on the metadata segment.
If the metadata segment satisfies the required constraints (e.g., similarity threshold is met), proceed to the next step.

Step 2: Full Vector Similarity Computation

Compute the similarity using the full vector (values + metadata).
This ensures that filtering is integrated into the ANN search without requiring explicit pre- or post-processing.

5. Index Structure

5.1. Base Vector Representation

All vectors in the system, whether used for pure similarity search or metadata filtering, must maintain the same dimensionality for HNSW graph construction and search. To achieve this, MAVANN extends the values segment with additional metadata segment dimensions, ensuring consistent indexing across all vectors.

5.2. Metadata Segment Encoding

The metadata segment follows a carefully designed encoding scheme that balances accuracy, efficiency, and storage requirements. Each metadata field requires a specific number of additional dimensions based on its cardinality (number of possible values).

Dimension Allocation

For each metadata field, the number of required dimensions is calculated by rounding up the field's cardinality to the nearest power of 2. This allocation ensures efficient binary encoding of values. Examples include:

A field representing months (12 possible values) requires 4 dimensions (equivalent to2^4 = 16 binary bits).
A field for days of the week (7 values) requires 3 dimensions (equivalent to2^3 binary bits).
A binary field (2 values) requires 1 dimension (equivalent to 2^1 binary bit).
A field with 100 possible values requires 7 dimensions (equivalent to 2^7 binary bits).

5.3. Metadata Schema

At the time of creating a collection, the user specifies a metadata schema, which includes:

Metadata fields that require filtering.
Unique values for each metadata field, allowing implicit lexicographical sorting for numeric mapping.
AND/OR query support specifications to optimize logical index creation.

The metadata schema is stored in LMDB. If an insert request contains metadata values outside the schema, an error is returned.

Updating Metadata Schema

If new metadata values need to be added, there may be available dimensions due to power-of-2 rounding. If no room exists, a reindexing process is required. Adding a new metadata field also necessitates reindexing due to dimensionality changes.

5.4. Metadata Filtering Index

The unified index structure contains multiple copies of each vector, where each copy is optimized for different metadata filtering scenarios. While these appear logically separate, they are physically stored within a single HNSW graph structure sharing a common root node. The key aspects are:

Base Vector Copy: Contains metadata dimensions but no high-weight values
Combined Field Copies: For AND query support, vectors are copied with high-weight values for multiple field combinations. The field combinations that need to be supported will be specified in the metadata_schema at the time of collection creation
All fields combined copy: For OR query support, a single copy with high-weight values for all the fields is sufficient. This way, we avoid creating copies for individual fields. If the metadata_schema includes AND query support for all fields, then this copy will be created anyway

Vector Extension and Indexing Process

The indexing process requires multiple passes for each vector:

First Pass: Index the base vector with metadata dimensions (no high weights)
Subsequent Passes: For each required metadata field combination (AND query support):
- Create a copy of the vector
- Set appropriate high-weight values in the metadata dimensions for all relevant fields
- Index this copy
Additional Pass (Optional): For all fields combination (OR query support):
- Create a copy with high-weight values for all fields
- Index this combination copy
- This step is optional because if the metadata_schema requires AND field support for all the fields, then this copy would have already been created in step 2

All these passes contribute to building the unified HNSW graph structure. The metadata dimensions ensure consistent vector dimensionality across all copies, while the high-weight values enable effective filtering during queries.

6. Query Processing

6.1. Pure Similarity Search

When no metadata filtering is needed, queries use vectors with metadata dimensions (no high weights), effectively matching the base vector copies in the index.

6.2. Metadata Filtering Queries

For metadata-filtered searches, the query vector is constructed with appropriate +1/-1 values in the metadata dimensions:

+1 for matching the desired value's position
-1 for other positions to prevent false matches

The system automatically routes the query to the appropriate vector copies based on the filtering criteria:

Single field filters use the all-fields copy
A naive implementation for OR conditions will require multiple searches using all-fields copy, with results merged
AND conditions use the pre-computed combination copies

6.3. Query Vector Encoding Details

The effectiveness of metadata filtering relies on a carefully designed query vector encoding scheme that uses +1/-1 values to ensure precise matching. This encoding scheme is fundamental to supporting both equality and inequality filters.

Equality Filter Encoding

When searching for vectors with a specific metadata value, the system employs a binary encoding strategy across the dimensions allocated for that field. For example, when filtering for value 1 in a field, the query vector would have:

A positive value (+1) in the position corresponding to bit 0
A negative value (-1) in the position corresponding to bit 1
Similar negative values in all other bit positions for that field

This encoding ensures accurate discrimination between different values. For instance, when searching for value 1, a vector with value 3 (binary 11) will not match because the negative query value at position 1 will create a repelling force in the dot product calculation, effectively eliminating false matches.

Inequality Filter Encoding

For inequality filters (field != x), the system inverts the encoding used in equality filters. Taking the same example of filtering for "not equal to 1":

The positive and negative values from the equality encoding are inverted
Position 0 becomes -1
Position 1 becomes +1
Other positions retain appropriate values to maintain filtering accuracy

During dot product calculations, these inverted values create attractive forces for all values except the one being excluded, effectively implementing the inequality constraint.

The high-weight values used in the indexed vectors, combined with the +1/-1 encoding in query vectors, create substantial differences in dot product results between matching and non-matching vectors. This ensures reliable filtering even in the presence of approximate nearest neighbor search.

7. Performance Considerations

MAVANN achieves efficient ANN filtering without degrading search performance:

Metadata dimensions do not impact pure similarity searches.
Staged filtering minimizes unnecessary similarity computations.
High-weight metadata values create definitive separation in dot product calculations.
Logical indexing balances storage overhead with retrieval speed.

8. Conclusion

MAVANN integrates metadata filtering seamlessly into the ANN search process by expanding the vector space and encoding metadata within the vector representation. By leveraging metadata-aware indexing and staged similarity computation, MAVANN ensures efficient and accurate retrieval while preserving the high performance of HNSW-based vector search. Unlike traditional pre-filtering and post-filtering methods, MAVANN optimally balances precision, recall, and computational efficiency, making it a superior approach for metadata-aware ANN search.

Table of Contents