A Comprehensive Guide to Vector Databases

While vector databases have been available for over a decade, they’ve become significantly more prominent in the last few years. One of the main reasons for their recent rapid emergence is the recognition of their suitability for use within generative AI applications. This is best exemplified by the tens of millions in capital raised by companies such as ChromaDB, Weaviate, and Pinecone, as investors seek to capitalise on the integral role vector databases are set to play in the coming AI boom.

In this guide, we delve into vector databases, detailing how they work, what their benefits are, and how they can be used to enhance the functionality and development of generative AI applications.

What are Vector Databases?

A vector embedding is an array of numbers used to represent data, such as text, an image, an audio file, etc. Each embedding has a particular number of dimensions that represent a data point’s features or attributes, which can range from a few to thousands, depending on its complexity. Vector embeddings provide a way to represent unstructured data numerically while retaining its semantic meaning, which makes it ideal for machine processing and understanding.

A simple and relatable example of an embedding is how colours are represented in an RGB colour scheme. In this case, an embedding has three dimensions that range from 0 -255: representing a proportion of red, green, and blue respectively. The embedding for the color purple, for example, is (128, 0, 128).

Subsequently, a vector database is a type of data store designed for the storage of vector embeddings. A vector embedding maps each data point onto a continuous, high-dimensional vector space (also commonly referred to as an embedding space) in which the distance between embeddings represents their semantic similarity. The smaller the distance between the vector embeddings, the stronger the relationship between the data points they represent.

For instance, the embeddings for the words “cow” and “goat” should have a small distance between them, as both are animals – and, more specifically, farm animals. Similarly, the embeddings for “USA” and “Washington D.C.” should be close together to reflect their relationship as a country and its capital city. Conversely, these two sets of embeddings wouldn’t be as close together within a vector space as they don’t have as strong a semantic relationship as to each other.

How do Vector Databases Work?

Fundamentally, a vector database operates in the same way as other types of databases: data is stored within the database and is later retrieved according to a specific query. However, because vector databases are specifically designed to process and store embeddings, there are a few key differences in how they operate.

Vector Database Storage

First, let’s consider how a vector database stores data – for which the initial step is taking raw data and converting it into vector embeddings. This requires putting each data point through an embedding model that quantifies each of its attributes into a dimension. To achieve this, an embedding model must be trained on a large dataset to learn how to apply meaningful vector representations to a given data point and where to map them in a continuous vector space.

From there, the created embedding is inserted in the vector database alongside a reference to the data it represents. However, simply adding embeddings into the database means that the query data must be compared against every entry to determine their similarity. While this might prove sufficient for a relatively small dataset, this becomes increasingly computationally expensive (and, from a user’s perspective, slower) as the number of data points extends into the millions – or even into the billions.

The solution to this is creating a vector index, a separate data structure that pre-calculates and stores the distances between embeddings and stores similar embeddings closer together. So, instead of comparing the query against the database, it’s compared against the indexed embeddings to enable faster retrieval.

To index vector embeddings, vector databases typically use a combination of different types of approximate nearest neighbour (ANN) algorithms. Here are some of the frequently used ANN algorithms:

Hierarchical Navigatable Small World (HNSW): a graph-based algorithm that organizes vectors into a structure: where nodes represent clusters of vectors and edges represent their similarity. This is the most commonly used ANN algorithm.
Locally Sensitive Hashing (LSH): uses a hashing function to separate vectors into “buckets” according to their similarity.
Random Projection: a compression-based algorithm that reduces the dimensionality of vectors to make them easier to query.
Product Quantisation (PQ): another compression-based algorithm that breaks down vectors into chunks.

Querying and Retrieval

As with inputting information into a vector database, the first step in performing a query is to convert it into an embedding. It can then be compared against the embeddings contained within the vector index to determine their similarity to retrieve the most relevant entries from the database. This similarity is defined by which stored vector embeddings are closest, i.e., have the smallest distance to the query vector.

Here are a few of the most commonly used methods for measuring similarity between vector embeddings:

Cosine Similarity: this measures the cosine of the angle between two vectors, which can range from -1 to 1. The closer the cosine similarity of two vectors is to 1, the greater their similarity (with a value of 1 meaning the vectors are identical).
Euclidean Distance: this measures the distance between two vectors, which can be any value between 0 and infinity. The closer the Euclidean distance is to zero, the greater their similarity (with a value of 0 signifying identical vectors). Conversely, the larger the distance, the more dissimilar the vectors.
Dot Product: this measures the product of the magnitudes, i.e., the square root of the sum of the squares of each vector’s dimensions, and the cosine of the angle between them. Ranging from -infinity to infinity, a positive value indicates similarity between vectors while a negative value represents dissimilarity. The larger the value, the greater the extent of their similarity.

What are the Benefits of a Vector Database?

Let’s look at the main advantages of vector databases and why they’re well suited for use with AI applications.

Similarity Search Capabilities: the most significant advantage of vector databases is their ability to retrieve data based on semantic similarity, i.e., likeness, as opposed to the exact matching capabilities of conventional databases. This makes them ideally suited for use with generative AI models that must interpret the context of a given input, as opposed to just matching keywords, to provide the most relevant and accurate response.
Flexibility: vector databases are designed to process high-dimensional data, which makes them capable of storing text, images, audio, video, and other complex unstructured data. This makes them useful in a variety of use cases, including generative AI models and other modern applications that require large amounts of multi-modal data.
Speed: the ANN indexing algorithms employed by vector databases are specially designed for high-dimensional data and help optimise semantic similarity searches across increasingly large datasets – which could include billions of data points. However, the caveat here is that ANN algorithms (by definition) are designed to return the approximate nearest results, creating a slight trade-off in accuracy for increased efficiency and speed.
Scalability: vector databases are capable of being scaled both horizontally and vertically, making it easier and more cost-effective to accommodate the increasing data requirements of GenAI applications as they grow.
Cost: vector databases can be used as an external data source when implementing retrieval augmented generation (RAG). As RAG is a cost-effective alternative to fine-tuning, vector databases are instrumental in reducing the cost of tailoring foundational GenAI models to specific use cases.

Use of Vector Databases in GenAI

Having explored how vector databases work and the benefits they offer, let’s examine how they can be used to improve GenAI applications.

Retrieval Augmented Generation (RAG)

RAG is an architectural framework for LLMs that enables them to retrieve information from an external source and add it to a user’s input prompt to enhance the generated output. By retrieving external additional contextual information, an LLM has access to information it wouldn’t have had access to otherwise and provides more accurate, reliable, detailed, and current responses to prompts.

This offers several advantages when developing a LLM, such as:

Fewer Hallucinations: augmenting an LLM with an external data source provides it with additional data that it may not have had access to during its training. This reduces the chances of inaccurate output due to a lack of exposure to pertinent information during training. Just as importantly, it makes it easier to determine how the LLM formulated its given response, allowing users to track the provenance of an LLM’s output
More Up-To-Date Information: the recency of the information in an LLM’s output is determined by that of the information it was trained on: the “training cut-off date”. If its corpus only goes up to a certain date (such as January 2022, in the case of ChatGPT 3.5), then it won’t have access to pertinent information after that date. With RAG, an LLM’s information is as current as that in the external data source it can retrieve data from – which can be updated more frequently.
More Specific Information: RAG allows for the use of information specific to an LLM’s intended use case. This could include domain-specific information (e.g., healthcare, finance, science, technology), proprietary information, and perhaps most importantly, sensitive data.
By-Passes the Need for Fine-Tuning: RAG offers a faster and cheaper alternative to fine-tuning a pre-trained LLM. Instead of training a base model on additional information, you can simply input it into an external data source and retrieve it at the time of inference.

As vector databases are specifically designed to store the same embeddings utilised by LLMs, they are ideally suited as an external data source within a RAG framework. Since they store vectorized high-dimensional data, vector databases can be updated efficiently, providing LLMs with up-to-date information more frequently. Additionally, a vector database’s efficient semantic search capabilities allow LLMs to retrieve appropriate data quickly for faster response times.

Long-Term Memory for LLMs

Vector databases help rectify one of the most significant drawbacks of the current generation of LLMS: the fact they have no long-term memory. By default, LLMs are stateless and can only access the information contained within their limited context window. To retain its previous context, an LLM has to enter it all in each subsequent prompt – which uses up its limited context window, increases the cost of generating answers, and increases response times.

A vector database solves this problem by allowing an LLM to save past queries and their associated responses. The database can then embed subsequent queries, compare them to those already in its index, and retrieve relevant information to provide the most comprehensive response. This is highly useful for mitigating context length limitations, which can cause an LLM to hallucinate if relevant information falls outside its context window (or in the case of models with extremely long context windows, even if that information falls in the middle of such a window).

Caching

Similar to their ability to provide long-term memory for LLMs, vector databases can be used to provide a shorter-term, high-speed cache. This can be achieved by setting up a smaller vector database for the most recent or frequent queries, which maintains a smaller index to compare query embeddings against.

GenAI Application Prototyping

Developing new GenAI applications, or adapting existing applications to fit specific use cases, typically requires several stages of prototyping to test new ideas and functionality. Vector databases can assist in streamlining this process in a number of ways:

Automatic data vectorization of data: they allow you to quickly create the embeddings required to train or fine-tune models.
Efficient Data Retrieval: as they are designed to retrieve data quickly, vector databases provide quick access to the data required during a GenAI model’s training stage.
Reduced Data Management: vector databases are capable of handling a variety of high-dimensional data types, reducing the need to pre-process data before its input into GenAI models. This reduces the data management overhead for GenAI developers and researchers, which allows them to concentrate their efforts on their models’ functionality and performance and accelerate iteration cycles.

How to Choose a Vector Database

Let’s turn our attention to what you need to consider when deciding which vector database to use.

Types of Vector Databases

Vector databases generally fall into two categories:

Dedicated vector databases
Databases that support vector search, i.e., vector-capable databases

Dedicated vector databases are generally more efficient because they’re optimized to handle embeddings. Prominent examples of dedicated vector databases include:

Vector-capable databases are those originally developed to store various types of data, and extended to support vector embeddings.

MongoDB (Atlas Vector Search)
Elastic Search (knn Search)
PostgreSQL (PG Vector)
Redis
Neo4j
SingleStore

Performance Metrics

There are three metrics used to directly evaluate the capabilities of a vector database:

Queries Per Second (QPS): the number of queries the database can process per second.
Query Latency: the time it takes to execute a query and receive a response. The most commonly expressed latency metrics are P95 (95^th percentile) and P99 (99^th percentile). This means that 95% or 99% of queries result in a response less than the P95 or P99 value respectively.
Recall: how accurate a query is in returning the n nearest neighbors found, i.e., retrieving similar results to a given query.

Additional Criteria

Other aspects to consider when selecting a vector database include:

Open-Source: whether the vector database has an open-source license (Pinecone is notable among dedicated vector databases for being closed-source).
Ease of Local Usage: can the vector database be implemented on-prem as well as in a cloud environment
Integration with IT Infrastructure: how well a prospective solution fits into your existing IT ecosystem
Managed Cloud: is it available as a managed cloud solution?
User Interface: does it have an intuitive user interface?
Fundraising: while not a direct indication of a database’s current capabilities, the amount of investment that it has managed to attract could be indicative of its potential, e.g., the amount of resources available for implementing new features, releasing regular updates, conducting research, etc.
Cost: how much the database costs to use (often expressed as cost per 100k vectors)

Conclusion

The ability to handle and process complex, high-dimensional data makes vector databases an indispensable part of the future of generative AI applications. And just as we’ve seen GenAI applications take significant strides in recent years, the increased interest that vector databases have received ensures that they will become more performant – further enhancing their utility and the number of use cases they can be applied to.

Kartik Talamadupula

Director of AI Research

Kartik Talamadupula is a research scientist who has spent over a decade applying AI techniques to business problems in automation, human-AI collaboration, and NLP.