RAG Enriched Prompt with SQL Table Data

RAG Enriched Prompt with SQL Table Data

Introduction

Recently, we've been working on several GenAI use cases and identified a need for parsing and storing relevant information from arbitrary description inputs (initially, this involves text, but it could extend to images or files in the future). In essence, we must process unstructured text provided by the user, retrieve custom business domain data from PostgreSQL, and map fields from the input to a JSON request payload - the payload must conform to the data schema to be compatible with the existing REST API on our platform.

Consider a hypothetical example where you receive user input describing a Video entity in natural language. This might include various attributes such as `title`, `frame_rate`, `file_size`, `duration`, etc. Let’s assume that the user will be sure to provide the fields marked as required in the backend, but they can use any phrasing they want. The task is to correctly parse the necessary information from the input and construct a request payload according to your latest backend schema.

We consider using a Foundation LLM Model (FM model); The attributes described so far, broadly speaking, can be picked up by the LLM and matched quite easily against the associated DB tables if fronted by readily available tooling such as RAG pipelines (we will get to these in more detail soon), and NLP2SQL packages.

To make things slightly more interesting, someone also asks you to add the ability to categorise the videos against a set of predefined tags / labels, for instance movie genres. There could be quite a few of these, say around 50 to start (this is also called Cardinality ⇒ higher number of distinct possibilities lead to higher cardinality, refer to an interesting post on this for more info). Let’s also assume for now that we don’t need to perform Semantic Searching, and that user input has high degree of similarity to the labels stored in the backend.

So all in all, the final create request for such a Video entity would at a very simple level look something like:

$ curl -X POST <http://example.com/api/v1/video> --json
{
   "title": "Independence Day",
   "duration": "2:25",
   "file_size": "12.4G",
   "file_format": "mkv",
   "frame_rate": 24,
   "tags": [ 
      {
         "id":1,
         "name": "<Genre_1>"
      },
      {
         "id":2,
         "name":"<Genre_2>"
      },
      {
         "id":3,
         "name":"<Genre_3>"
      },
      ...
   ]

}

One naive way to implement that is to pass the array of possible movie genres to the LLM via the system prompt, and that might be a good enough place to start testing on. But the solution will soon fall apart as these labels can be created, modified and deleted at source on a regular basis. It also doesn’t scale; larger number of possible options increase the prompt size which slows down the LLM processing and for smaller models might even exceed their context size eventually.

Implementation

We therefore need to decouple our NLP processing layer from the database schemas and their evolution, in other words we’re looking at a pretty good usecase for a simple RAG pipeline - we lookup the possible fields from the database and pass as enriching context to the LLM.

This is the first pass at this, we cover at the end how this could be further enhanced using Vector Embeddings to perform more optimized lookups and also achieve semantic searching to better take into account the user’s intent.

Now let’s diver deeper!

Tooling and infra:

  • Infrastructure provider: AWS Bedrock
  • Programming Language: Python 3.10
  • Foundation LLM: Anthropic Claude 3 Sonnet
  • LLM Orchestrator: LangChain
  • Database: Postgres
# pre-req dependencies
!pip install langchain-community boto3 langchain typing langchain_core

For this particular usecase, the input processing happens in an application that acted as almost a gateway and was separate from the backend processing API. We therefore had to create high level objects corresponding to the DTO’s first. In your case this can potentially be avoided if you put the NLP parsing logic within the same service that handles your API requests.

That was done as follows. We used Field to achieve some light validation of the attribute types.

from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List, Optional

class Tag(BaseModel):
    id: Optional[int] = Field(description="Numeric ID of the tag")
    name: Optional[str] = Field(description="Name of a tag, only include alphabetical characters")

class Video(BaseModel):
    title: str = Field(description="Title of the Video")
    duration: str = Field(description="Duration of the video")
    file_size: int = Field(description="The size in Gigabytes of the video")
    file_format: str = Field(description="The file format extension must be 3 letters at most")
    frame_rate: int = Field(description="the framerate of the video, can be 24,48,60,120")
    tags: List[Tag] = Field(description="List of associated tags that describe the video")

Next we set up our LLM interaction handler as well as an orchestrator to help making the code more portable and production-ready. As mentioned above we’ve used Bedrock and LangChain but you can use plenty of other available technologies for each function (for instance LlamaIndex is another great orchestration platform).

LangChain provides components called Loaders that allow you to abstractly represent any document and load them for indexing. We found lots of different tutorials such as on PDF, YouTube, Video and Text loaders but took us slightly longer to come across a SQL one that could load rows based on specific queries ⇒ thankfully SQLDatabaseLoader and SQLDatabase from langchain_community came to the rescue!

from langchain_community.document_loaders.sql_database import SQLDatabaseLoader
from langchain_community.utilities import SQLDatabase
from langchain_community.chat_models import BedrockChat
import boto3

bedrock_client = boto3.client(
    service_name="bedrock-runtime",
    region_name="us-east-1",
    aws_access_key_id="<YOUR_ACCESS_KEY>",
    aws_secret_access_key="<YOUR_SECRET_KEY"
)

model_kwargs =  {
    "max_tokens": 10000,
    "temperature": 0.0,
    "top_k": 250,
}

chat = BedrockChat(
    client=bedrock_client,
    model_id="anthropic.claude-3-sonnet-20240229-v1:0",
    model_kwargs=model_kwargs,
)

db = SQLDatabase.from_uri( "postgresql+psycopg2://<db_conn_string>" )
tagLoader = SQLDatabaseLoader(
    query="SELECT id,name FROM tags",
    db= db
)

tagDocuments = tagLoader.load()
> print(tagDocuments)

[
    Document(page_content='id: 1\\nname: Science Fiction'),
    Document(page_content='id: 2\\nname: Alien Invasion'),
    Document(page_content='id: 3\\nname: Extraterrestrial Threat'),
    Document(page_content='id: 4\\nname: Action-Packed'),
    Document(page_content='id: 5\\nname: Patriotic'),
    Document(page_content='id: 6\\nname: Special Effects Spectacle'),
    Document(page_content='id: 7\\nname: Will Smith'),
    Document(page_content='id: 8\\nname: Bill Pullman'),
    Document(page_content='id: 9\\nname: Jeff Goldblum'),
    Document(page_content='id: 10\\nname: Saving Humanity'),
    Document(page_content='id: 11\\nname: Drama'),
    Document(page_content='id: 12\\nname: Shakespeare'),
    Document(page_content='id: 13\\nname: Comedy'),
    Document(page_content='id: 14\\nname: Horror'),
    
    ...

]

Next comes the formatting piece, we have to give the LLM instructions on how to format the output. The LangChain gods have provided us with a PydanticOutputParser that allows us to output the format of our Entity we defined previously so we can easily feed it into the prompt!

from langchain.output_parsers import PydanticOutputParser

pydantic_parser = PydanticOutputParser(pydantic_object=Video)
format_instructions = pydantic_parser.get_format_instructions()

And putting it all together in a nice PromptTemplate

from langchain.prompts import PromptTemplate

TEMPLATE = """
    Your goal is to understand and parse a video description in text to JSON,
    based on the formatting instructions.
    You are also tasked with extracting the various tags found in the
    description, match ONLY to the tags provided in the list,
    return only those you find with high degree of certainty.
    Do not try to guess, if you cannot match with a high degree of certainty,
    leave the field null
    The output should be a markdown code snippet formatted in the
    following schema, including the leading and trailing
    "```json" and "```":

    <format_instructions>
    {format_instructions}
    </format_instructions>

    Video Description:
    <description>
    {description}
    </description>

    <Tags>
    {tags}
    </Tags>

    """

prompt = PromptTemplate(
    template=TEMPLATE,
    input_variables=[
        "description"
        "tags",
        "format_instructions",
    ],
)

Now the final piece is the parser, as we’ve outlined, whatever output the LLM spits out, we’d like it to be in JSON Format, I’ve already given instructions on how to do so at the prompt level but that’s not enough, we need to run it through an output parser you can read more about it here, for our case it’s the [JSONOutputParser](<https://python.langchain.com/docs/modules/model_io/output_parsers/types/json/>) .

The job is nearly done, we now chain those interfaces together (which is another way of saying performing a sequence of interactions):

parser = JsonOutputParser(pydantic_object=Video)

chain = prompt | chat | parser

Let’s invoke it against one of the best action movies ever made, Independence Day!! (I may be biased here)

videoDescription = """
"Independence Day" is a 1996 science fiction film directed by Roland Emmerich.
The film follows humanity's struggle against a hostile alien invasion on July 4th, hence the title referencing the American holiday.
The frame rate, file format, and file size would depend on the specific version or release of the movie,
 but it's commonly available in formats like MPEG-4 (MP4) or Matroska (MKV) at a standard frame rate of 24 frames per second (fps).
The duration of the movie is approximately 2 hours and 25 minutes.
"""

response = chain.invoke({"description": videoDescription,
                         "format_instructions": format_instructions,
                         "tags": tagDocuments})
print(response)
{
   "title": "Independence Day",
   "duration": "2 hours and 25 minutes",
   "file_size": "None",
   "file_format": "MPEG-4 (MP4)",
   "frame_rate": 24,
   "tags": [
      {
         "id":1,
         "name":"Science Fiction"
      },
      {
         "id":2,
         "name":"Alien Invasion"
      },
      {
         "id":3,
         "name":"Extraterrestrial Threat"
      },
      {
         "id":4,
         "name":"Action-Packed"
      },
      {
         "id":6,
         "name":"Special Effects Spectacle"
      },
      {
         "id":7,
         "name":"Will Smith"
      },
      {
         "id":9,
         "name":"Jeff Goldblum"
      },
      {
         "id":10,
         "name":"Saving Humanity"
      }
   ]
}

As you can see, the mapping works pretty well already! It captures the more obvious genres and ignores the tags that are not related to the movie.

Future work

Above is a great initial result, but it’s not perfect. For instance the out of the box embeddings used by the LLM is not able to reason about “4th of July” and its possible correlation with the “Patriotic” genre.

So how could things be made better?

👉 Well… the first step would be to actually evaluate properly - It’s beyond the scope of this post to cover those but by now there are plenty of tools out there (including LangChain and Bedrock’s own utilities) that allow developers to measure accuracy and profile things such response / retrieval latency etc. and tweak various parameters to achieve more favourable results. Using orchestrators also makes it easier than ever to swap out the backend LLM and test with different models as they come out.

👉 Tuning the system prompt - better instructions, addressing potential edge cases and providing a good amount of few-shot examples.

👉 Embeddings, vector storage (and semantic search) - we’ve not touched this yet, but it’s an exciting potential next thing to go after. By using either more specialised embeddings such as HF and/or creating our own custom ones, we could enhance search operation performed by the workload and have more reliable matching of fields.

Text embedding models | 🦜️🔗 LangChain
Head to Integrations for documentation on built-in integrations with text embedding model providers.

👉 More complex workloads, multi-modality, Agentic approach - As described in the very beginning, soon we might want to also support parsing descriptions from various file formats (say users prefer to upload a text file instead). At that stage one could look at more complex multi-modal orchestration options as well as potentially usage of LangChain Agents. Agents will add a layer of reasoning on actions to take (compared to user’s predefined chains) but will add additional processing and might therefore be an overkill for use-cases that don’t need it. On the other hand, if your workloads involve taking actions against multiple possible external tools / integrations, then agents are a good option to consider.

How-to guides | 🦜️🔗 LangChain
Here you’ll find answers to “How do I….?” types of questions.

Nonetheless, the technology is quite powerful and flexible. You might run into certain issues regarding dependency compatibility and old documentation, but you will at least have a fun time getting there!

Written by
Daniel Tapia
Kubernetes Master and slave
Comments
Great! You’ve successfully signed up.
Welcome back! You've successfully signed in.
You've successfully subscribed to Binome.
Your link has expired.
Success! Check your email for magic link to sign-in.
Success! Your billing info has been updated.
Your billing was not updated.