LangChain Integration

arango-typed provides seamless integration with LangChain, enabling you to build powerful RAG (Retrieval-Augmented Generation) systems, vector stores, and AI applications using ArangoDB as your vector database backend.

What is LangChain Integration?

LangChain is a framework for developing applications powered by language models. arango-typed's LangChain integration provides:

  • VectorStore Implementation: Compatible with LangChain's VectorStore interface for storing and retrieving embeddings
  • RAG Support: Built-in RAG (Retrieval-Augmented Generation) implementation for context-aware AI applications
  • MCP Support: Model Context Protocol implementation for unified LLM interactions
  • Retriever Interface: LangChain-compatible retrievers for use in chains

Prerequisites

Before using LangChain integration, ensure you have:

  • arango-typed and arangojs installed
  • An embeddings provider (OpenAI, HuggingFace, local models, etc.)
  • LangChain core packages installed
  • ArangoDB instance running and connected
npm install arango-typed arangojs
npm install @langchain/core @langchain/textsplitters

# Optional: For OpenAI embeddings
npm install @langchain/openai

# Optional: For other embedding providers
npm install @langchain/community

Installation and Setup

First, connect to your ArangoDB database:

import { connect, getDatabase } from 'arango-typed';

// Connect to ArangoDB
await connect({
  url: 'http://localhost:8529',
  database: 'myapp',
  username: 'root',
  password: ''
});

const db = getDatabase();

ArangoLangChainStore - VectorStore Implementation

ArangoLangChainStore is a LangChain-compatible VectorStore implementation that uses ArangoDB for storing and retrieving document embeddings.

Creating a VectorStore

You can create a vector store in several ways:

import { ArangoLangChainStore } from 'arango-typed/integrations/langchain';
import { OpenAIEmbeddings } from '@langchain/openai';
import { getDatabase } from 'arango-typed';

const db = getDatabase();

// Option 1: Create from texts
const store = await ArangoLangChainStore.fromTexts(
  ['Document 1 text', 'Document 2 text'],
  [{ source: 'doc1' }, { source: 'doc2' }],
  new OpenAIEmbeddings({ openAIApiKey: 'your-key' }),
  { database: db, collectionName: 'documents' }
);

// Option 2: Create from documents
const documents = [
  { pageContent: 'Document 1', metadata: { source: 'doc1' } },
  { pageContent: 'Document 2', metadata: { source: 'doc2' } }
];
const store2 = await ArangoLangChainStore.fromDocuments(
  documents,
  new OpenAIEmbeddings({ openAIApiKey: 'your-key' }),
  { database: db, collectionName: 'documents' }
);

// Option 3: Create with custom options
const store3 = new ArangoLangChainStore(
  new OpenAIEmbeddings({ openAIApiKey: 'your-key' }),
  { database: db, collectionName: 'documents' },
  {
    vectorField: 'embedding',  // Field name for embeddings (default: 'embedding')
    textField: 'text',         // Field name for text content (default: 'text')
    metadataFields: ['source', 'author'] // Fields to preserve as metadata
  }
);

Adding Documents

Add documents to the vector store:

// Add documents (embeddings generated automatically)
const ids = await store.addDocuments([
  { pageContent: 'New document text', metadata: { source: 'new-doc' } }
]);

// Add vectors directly (if you already have embeddings)
const vectors = [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]];
const ids2 = await store.addVectors(vectors, [
  { pageContent: 'Document 1', metadata: {} },
  { pageContent: 'Document 2', metadata: {} }
]);

Similarity Search

Search for similar documents:

// Basic similarity search
const results = await store.similaritySearch('query text', 5);

// Similarity search with scores
const resultsWithScores = await store.similaritySearchWithScore('query text', 5);

// Similarity search with metadata filtering
const filteredResults = await store.similaritySearch(
  'query text',
  5,
  { source: 'doc1', category: 'tech' }
);

Using with Models

You can use arango-typed Models with the vector store for automatic validation and hooks:

import { model, Schema } from 'arango-typed';

const DocumentSchema = new Schema({
  text: { type: String, required: true },
  embedding: { type: Array, required: true },
  source: String,
  metadata: Object
});

const Document = model('documents', DocumentSchema);

const storeWithModel = new ArangoLangChainStore(
  new OpenAIEmbeddings({ openAIApiKey: 'your-key' }),
  { 
    database: db, 
    collectionName: 'documents',
    model: Document  // Use model for automatic validation
  }
);

ArangoRAG - RAG Implementation

ArangoRAG provides a complete RAG (Retrieval-Augmented Generation) implementation with support for reranking, hybrid search, and metadata filtering.

Creating a RAG Instance

import { ArangoRAG } from 'arango-typed/integrations/langchain';
import { OpenAIEmbeddings } from '@langchain/openai';

const rag = new ArangoRAG(
  new OpenAIEmbeddings({ openAIApiKey: 'your-key' }),
  db,
  {
    collectionName: 'documents',
    vectorField: 'embedding',
    textField: 'text',
    topK: 5,                    // Number of documents to retrieve
    scoreThreshold: 0.7,        // Minimum similarity score
    reranker: async (docs) => { // Optional reranker function
      // Custom reranking logic
      return docs.sort((a, b) => /* your logic */);
    }
  }
);

Retrieving Documents

// Basic retrieval
const documents = await rag.retrieve('user query');

// Retrieval with metadata filtering
const filteredDocs = await rag.retrieveWithMetadata(
  'user query',
  { category: 'tech', author: 'John' }
);

// Hybrid search (vector + keyword)
const hybridResults = await rag.hybridRetrieve(
  'user query',
  'keyword search terms',
  { category: 'tech' }
);

Creating a Retriever for LangChain Chains

// Create a retriever
const retriever = rag.createRetriever({ category: 'tech' });

// Use in LangChain chain
import { ChatPromptTemplate } from '@langchain/core/prompts';
import { ChatOpenAI } from '@langchain/openai';

const prompt = ChatPromptTemplate.fromTemplate(`
Use the following context to answer the question:
{context}

Question: {question}
`);

const chain = prompt
  .pipe(retriever)
  .pipe(new ChatOpenAI({ modelName: 'gpt-4' }));

const result = await chain.invoke({ question: 'What is TypeScript?' });

ArangoMCP - Model Context Protocol

ArangoMCP provides a unified interface for LLMs to interact with ArangoDB, combining vector search and graph capabilities.

Creating an MCP Instance

import { ArangoMCP } from 'arango-typed/integrations/langchain';

// Basic MCP (vector search only)
const mcp = new ArangoMCP(db);

// MCP with graph support
const mcpWithGraph = new ArangoMCP(db, 'myGraph');

Getting Context

// Get context with embeddings
const context = await mcp.getContext({
  query: 'user question',
  embeddings: [0.1, 0.2, 0.3, ...], // Query embeddings
  metadata: { category: 'tech' },
  graphTraversal: true  // Enable graph context if graph is configured
});

// Get context with graph paths
const pathContext = await mcpWithGraph.getContextWithPaths(
  'users/123',      // Start vertex
  'users/456',      // End vertex (optional)
  3                 // Max depth
);

Storing Context

// Store single context
const docId = await mcp.storeContext(
  'context_collection',
  'document text',
  [0.1, 0.2, 0.3, ...], // Embeddings
  { source: 'doc1', category: 'tech' }
);

// Batch store contexts
const docIds = await mcp.storeContexts(
  'context_collection',
  [
    { text: 'Doc 1', embeddings: [0.1, 0.2], metadata: { source: 'doc1' } },
    { text: 'Doc 2', embeddings: [0.3, 0.4], metadata: { source: 'doc2' } }
  ]
);

// Update context
await mcp.updateContext('context_collection', docId, {
  updatedAt: new Date(),
  category: 'updated-category'
});

// Delete context
await mcp.deleteContext('context_collection', docId);

Complete RAG Example

Here's a complete example of building a RAG system:

import { connect, getDatabase, model, Schema } from 'arango-typed';
import { ArangoRAG } from 'arango-typed/integrations/langchain';
import { OpenAIEmbeddings } from '@langchain/openai';
import { ChatOpenAI } from '@langchain/openai';
import { ChatPromptTemplate } from '@langchain/core/prompts';
import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters';
import { ArangoLangChainStore } from 'arango-typed/integrations/langchain';

// 1. Connect to database
await connect({
  url: 'http://localhost:8529',
  database: 'rag_app',
  username: 'root',
  password: ''
});

const db = getDatabase();

// 2. Create embeddings instance
const embeddings = new OpenAIEmbeddings({ openAIApiKey: process.env.OPENAI_API_KEY });

// 3. Create document schema
const DocumentSchema = new Schema({
  text: { type: String, required: true },
  embedding: { type: Array, required: true },
  source: String,
  metadata: Object,
  createdAt: { type: Date, default: Date.now }
});

const Document = model('documents', DocumentSchema);

// 4. Index documents
async function indexDocuments(texts: string[], metadata: Record[]) {
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 150
  });

  const allChunks: string[] = [];
  for (const text of texts) {
    const chunks = await splitter.splitText(text);
    allChunks.push(...chunks);
  }

  const documents = allChunks.map((chunk, i) => ({
    pageContent: chunk,
    metadata: metadata[i] || {}
  }));

  const store = await ArangoLangChainStore.fromDocuments(
    documents,
    embeddings,
    { database: db, collectionName: 'documents', model: Document }
  );

  return store;
}

// 5. Create RAG instance
const rag = new ArangoRAG(embeddings, db, {
  collectionName: 'documents',
  topK: 5,
  scoreThreshold: 0.7
});

// 6. Create LangChain chain
const prompt = ChatPromptTemplate.fromTemplate(`
You are a helpful assistant. Use the following context to answer the question.
If you don't know the answer, say so.

Context:
{context}

Question: {question}

Answer:
`);

const llm = new ChatOpenAI({ 
  modelName: 'gpt-4',
  temperature: 0.7 
});

const retriever = rag.createRetriever();

// 7. Query function
async function askQuestion(question: string) {
  // Retrieve relevant documents
  const docs = await retriever.getRelevantDocuments(question);
  const context = docs.map(d => d.pageContent).join('\n\n');

  // Generate answer
  const result = await prompt.pipe(llm).invoke({ context, question });
  return result.content;
}

// Usage
const answer = await askQuestion('What is TypeScript?');
console.log(answer);

Advanced Usage

Custom Embeddings Provider

You can use any embeddings provider that implements the LangChainEmbeddings interface:

interface LangChainEmbeddings {
  embedDocuments(texts: string[]): Promise;
  embedQuery(text: string): Promise;
}

// Example: Custom embeddings
class CustomEmbeddings implements LangChainEmbeddings {
  async embedDocuments(texts: string[]): Promise {
    // Your embedding logic
    return texts.map(text => this.embed(text));
  }

  async embedQuery(text: string): Promise {
    return this.embed(text);
  }

  private embed(text: string): number[] {
    // Your embedding implementation
    return [];
  }
}

const store = new ArangoLangChainStore(
  new CustomEmbeddings(),
  { database: db, collectionName: 'documents' }
);

Reranking

Implement custom reranking for better relevance:

const rag = new ArangoRAG(embeddings, db, {
  collectionName: 'documents',
  topK: 10,  // Retrieve more initially
  reranker: async (docs) => {
    // Custom reranking logic
    // Example: Boost documents with certain metadata
    return docs.sort((a, b) => {
      const scoreA = a.metadata.priority || 0;
      const scoreB = b.metadata.priority || 0;
      return scoreB - scoreA;
    }).slice(0, 5);  // Return top 5 after reranking
  }
});

Multi-Tenancy Support

Combine with multi-tenancy for isolated document storage:

import { tenantMiddleware } from 'arango-typed/integrations/express';

// Enable multi-tenancy
app.use(tenantMiddleware({ extractFrom: 'header' }));

const Document = model('documents', DocumentSchema, { tenantEnabled: true });

// Documents are automatically filtered by tenant
const store = new ArangoLangChainStore(
  embeddings,
  { database: db, collectionName: 'documents', model: Document }
);

// Retrieval automatically respects tenant context
const docs = await store.similaritySearch('query', 5);

Best Practices

  • Chunking Strategy: Use 800-1200 tokens per chunk with 10-20% overlap for optimal retrieval quality
  • Metadata Filtering: Add rich metadata (source, category, date, etc.) and use filters during retrieval for better relevance
  • Hybrid Search: Use hybridRetrieve when keyword precision is important alongside semantic similarity
  • Score Thresholds: Set appropriate scoreThreshold values to filter out low-relevance results
  • Reranking: Implement custom reranking for domain-specific relevance improvements
  • Performance: Enable precomputed magnitudes for vector search, index metadata fields, and use connection pooling
  • Freshness: Maintain updatedAt fields and periodically re-embed changed content
  • Error Handling: Always handle embedding generation failures and database connection errors gracefully

Common Use Cases

  • Document Q&A Systems: Build question-answering systems over large document collections
  • Code Search: Semantic search over codebases for finding similar code patterns
  • Customer Support: RAG-powered chatbots with knowledge base retrieval
  • Research Assistants: Academic paper search and summarization systems
  • Content Recommendations: Similar content discovery based on semantic similarity

Troubleshooting

Common Issues

  • Embedding Dimension Mismatch: Ensure all embeddings use the same dimension (e.g., 1536 for OpenAI)
  • Collection Not Found: Make sure the collection exists or enable auto-creation in connection options
  • Low Retrieval Quality: Try adjusting chunk size, overlap, or implementing reranking
  • Performance Issues: Enable precomputed magnitudes, add indexes on metadata fields, and use connection pooling

API Reference

For detailed API reference, see LangChain Module API Reference.