Vector Databases

What is Vector Database?

Vector DB stores High-dimensional Tensors (Arrays of floats) + Metadata. These are used in AI/ML Operations for data storage and retrieval
These work on concept of similarity search instead of exact match. SQL vs NoSQL vs Vector Databases

How data is stored in a Vector Database?

let's say we have 3 log entries.
1. This is the Raw Data


Log_A: "John uploaded financial report to Gmail."
Log_B: "Sarah downloaded malicious payload from unknown domain."
Log_C: "Service account accessed internal S3 bucket."

2. The Transformation (Neural Net Forward Pass)
You pass these through an embedding model. Let's pretend the output is a 3-dimensional tensor (in reality, it's 768 or 1536 dimensions).


Log_A becomes → Tensor A: [0.95, 0.20, 0.05]
Log_B becomes → Tensor B: [0.10, 0.90, 0.85]
Log_C becomes → Tensor C: [0.80, 0.15, 0.10]

3. Storage inside the Vector DB
Inside the VDB, the data is not stored as rows in a table. It is stored as Nodes in a graph (if using HNSW)
VDB draws edges between these vectors. It notices that Vector A (0.95) and Vector C (0.80) are mathematically close, so it makes them neighbors in a graph. Vector B (0.90, 0.85) is far away on the other side of the graph.


// Query to INSERT data into vector DB
# You compute the tensor first
vector = embedding_model.encode("John uploaded...") 
# Then you call the API to insert
index.upsert(
    vectors=[("chunk_42", vector, {"user": "John", "timestamp": "..."})]
)

{
  "id": "chunk_42",
  "vector": [0.95, 0.20, 0.05],   // The actual Tensor
  "metadata": {                    // The payload (this IS like NoSQL/SQL)
      "source_log": "proxy_server_01",
      "timestamp": "2026-06-19T14:03:00Z",
      "user": "John",
      "action": "BLOCKED"
  },
  "text": "John uploaded financial report to Gmail." // Original text for LLM
}

VDBs does batch insertions

SQL: When SQL receives an INSERT, it appends the row to disk and updates a B-Tree (cheap).
VDB: When the VDB receives an upsert, it has to rebalance the graph index (HNSW). It inserts this new vector into the graph, calculates its nearest neighbors, and draws new edges between them. This is computationally heavier and is why VDBs often batch insertions

Why is a VDB needed even with SQL/NoSQL existing

Imagine you have 1 million security logs in your SQL database.
A user asks: "Find logs that look semantically like this new threat: 'exfiltration via encrypted tunnel'."

Using SQL:


Select * from table where LIKE '%exfiltration%' OR LIKE '%encrypted%'
-> No results. Word "Data Leak" or "TLS bypass" are not present in DB

OR

pull all 1 million tensors out of storage into RAM.
compute a Dot Product between the Query Tensor and each of the 1 million tensors.
O(n) complexity.    // Too Slow

Using VDB:
- It has pre-built the HNSW graph
- It starts at a random entry point and greedily hops across the graph toward the query tensor.
- It only calculates ~100 dot products. O(log n) complexity. It finds the Top-5 matches in ~10 milliseconds