LLM sql Agent

 In this tutorial we will be using OpenAI’s gpt-3.5-turbo model for our LLM model and Dataherald’s real_estate for our database. This tutorial will be using postgres as the sql dialect. If you are using a different sql dialect please check out the SQLAlchemy documentation on how to setup your database connection.

Let us first install the required packages. For postgres, make sure you have installed postgreSQL on your machine. To utilize OpenAI’s API, be sure to have an OpenAI account and its API key ready.

Install the following packages:

pip install langchain 
pip install openai
pip install psycopg2

Next create a python file called main.py and import the following:

from langchain.agents import create_sql_agent 
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain.llms.openai import OpenAI
from langchain.agents import AgentExecutor
from langchain.agents.agent_types import AgentType
from langchain.chat_models import ChatOpenAI

2. Connect the database

2. Connect the database

For postgres databases, use the following format string for the database URI and declare the required variables above:

pg_uri = f"postgresql+psycopg2://{username}:{password}@{host}:{port}/{mydatabase}"

Now let’s setup our database connection:

db = SQLDatabase.from_uri(pg_uri)

3. Setup LLM

Since we will be using OpenAI’s gpt-3.5-turbo model, let us use our OpenAI account’s API key:

OPENAI_API_KEY = "your OpenAI key"

Let’s define our LLM model:

gpt = OpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, model_name='gpt-3.5-turbo')

4. Setup Agent

Let us define our agent’s toolkit which will be used to answer the user question:

toolkit = SQLDatabaseToolkit(db=db, llm=gpt)

Then we can create our agent executor with agent type ZERO_SHOT_REACT_DESCRIPTION:

agent_executor = create_sql_agent(
llm=gpt,
toolkit=toolkit,
verbose=True,
agent_type=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
)

For more information on available agent types and their characteristics, please checkout LangChain’s documentation here.

5. Ask a query

Finally, let us define a question and execute the agent with the question:

question = "Average rent in Chicago from Oct 2022 till Dec 2022"
agent_executor.run(question)

Here are the results after we ran our program:

> Entering new  chain...
Action: sql_db_list_tables
Action Input: ""
Observation: redfin_homes_sold, zillow_home_value_index, redfin_median_list_price, redfin_median_ppsf, renthub_neighborhood_listings, zillow_observed_rent_index, zillow_rent, redfin_avg_sale_to_list, renthub_median_rent, zillow_home_value, redfin_inventory, redfin_new_listings, renthub_num_rental_listings, redfin_median_sale_price, redfin_pending_sales, renthub_average_rent
Thought:I can query the tables "renthub_average_rent" and "zillow_rent" to get the average rent price in Chicago. I will query the schema of these tables to determine the relevant columns.
Action: sql_db_schema
Action Input: "renthub_average_rent, zillow_rent"
Observation:
CREATE TABLE renthub_average_rent (
period_start DATE NOT NULL,
period_end DATE NOT NULL,
period_type TEXT NOT NULL,
...
)

/*
3 rows from renthub_average_rent table:
period_start period_end period_type geo_type property_type location_name dh_state_fips dh_state_name dh_state_abbr dh_county_fips dh_county_name dh_county_fullname dh_place_fips dh_place_name dh_place_fullname dh_city_name dh_city_fullname dh_neighborhood_id dh_neighborhood_name dh_zip_code is_seasonally_adjusted metric_value hg_date_updated
2022-02-01 2022-02-28 monthly neighborhood House Far North 48 Texas TX 085 Collin Collin County 19000 Dallas Dallas city Dallas Dallas city 273698 Far North - False 2543.752941176471 2023-05-30 14:57:38.812353
2020-12-01 2020-12-31 monthly neighborhood Apartment Building Aurora Highlands 08 Colorado CO 001 Adams Adams County 04000 Aurora Aurora city Aurora Aurora city 342766 Aurora Highlands - False 1306.857142857143 2023-05-30 14:57:38.812353
2021-12-01 2021-12-31 monthly neighborhood Apartment Building Penrose 51 Virginia VA 013 Arlington Arlington County 03000 Arlington Arlington CDP Arlington Arlington CDP 271765 Penrose - False 2009.9245283018868 2023-05-30 14:57:38.812353
*/


CREATE TABLE zillow_rent (
period_start DATE NOT NULL,
period_end DATE NOT NULL,
period_type TEXT NOT NULL,
...
)

/*
3 rows from zillow_rent table:
period_start period_end period_type geo_type is_seasonally_adjusted property_type location_name location_fullname dh_state_fips dh_state_name dh_state_abbr dh_county_fips dh_county_name dh_county_fullname dh_place_fips dh_place_name dh_place_fullname dh_city_name dh_city_fullname dh_zip_code metric_value hg_date_updated

*/
Thought:I can query the "renthub_average_rent" and "zillow_rent" tables to get the average rent price in Chicago. The relevant columns are "metric_value" in the "renthub_average_rent" table and "metric_value" in the "zillow_rent" table. I will write a query to calculate the average rent price from October 2022 to December 2022 in Chicago.
Action: sql_db_query_checker
Action Input: "SELECT AVG(metric_value) FROM renthub_average_rent WHERE location_name = 'Chicago' AND period_start >= '2022-10-01' AND period_end <= '2022-12-31'"
Observation: The original query does not contain any common mistakes. Here is the reproduced original query:

```sql
SELECT AVG(metric_value)
FROM renthub_average_rent
WHERE location_name = 'Chicago'
AND period_start >= '2022-10-01'
AND period_end <= '2022-12-31';
```
Thought:The query is correct and ready to be executed to get the average rent price in Chicago from October 2022 to December 2022.
Action: sql_db_query
Action Input: "SELECT AVG(metric_value) FROM renthub_average_rent WHERE location_name = 'Chicago' AND period_start >= '2022-10-01' AND period_end <= '2022-12-31'"
Observation: [(2344.4993233104487,)]
Thought:The average rent price in Chicago from October 2022 to December 2022 is $2344.50.
Final Answer: $2344.50

voilà

Discussion

During our tests, we ran multiple questions on both SQLChain and SQLAgent using GPT-3.5 and compared their respective results. Our findings revealed that SQLAgent outperformed SQLChain by answering a greater number of questions. For instance, when asked “Is the average rent price in Chicago increasing from June 2022 till March 2023”, the output we got from SQLChain is:

There doesn't seem to be a table specifically for average rent in Chicago. Since I don't have access to the relevant data, I don't know the answer to the question.

and the output for SQLAgent is:

Final Answer: The average rent price in Chicago is increasing from June 2022 till March 2023.

For accuracy, however, our findings indicate a higher incidence of incorrect responses from SQLAgent. Besides the general shortcomings (see conclusion in How to — LangChain SQLChain) of using LLM to query database, we hypothesize that SQLAgent will occasionally make its best attempt to answer a question even when concrete results cannot be obtained from the SQL query. For example, when asked the question: “Are home prices increasing in San Diego from Nov 2022 till Dec 2022?”, the SQLAgent produces the following chain of thought and action:

Action: list_tables_sql_db
Action Input: ""

Thought:I can query the tables "zillow_home_value_index" and "zillow_home_value" to get information about home prices. I should query the schema of these tables to see what columns are available.
Action: schema_sql_db
Action Input: "zillow_home_value_index, zillow_home_value"

Thought:I can query the tables "zillow_home_value_index" and "zillow_home_value" to get information about home prices. The relevant columns are "period_start", "period_end", "location_name", and "metric_value". I can filter the results based on the location name "San Diego" and the period between November 2022 and December 2022. I should use the "zillow_home_value" table because it has more specific property types.
Action: query_sql_db
Action Input: "SELECT period_start, period_end, location_name, metric_value FROM zillow_home_value WHERE location_name = 'San Diego' AND period_start >= '2022-11-01' AND period_end <= '2022-12-31' LIMIT 10"
Observation: []
Thought:There are no results for the query. This means that there is no data available for home prices in San Diego from November 2022 till December 2022.
Final Answer: No, home prices are not increasing in San Diego from Nov 2022 till Dec 2022.

As we can see, SQLAgent did not observe any data from the sql query but chose to provide an answer anyways.

Hence, if you find yourself in need of SQLAgent’s capability to construct intricate chains of calls to language models and other tools, it would be prudent to formulate well-defined and precise questions while also cross-verifying the results with your database.

Conclusion

LangChain SQLAgent is a powerful tool that creates complex LLM chain calls for answering user questions. Although it return a response for most relevant questions, it fails to prevent LLM hallucination.

If you haven’t read the blog about how to use LangChain SQLChain, please check it out here. We will also be covering how to use LlamaIndex for connecting LLMs to databases later this month so if you’re interested in learning more about using LLM to answer database questions please consider subscribing, thank you.



Comments

Popular posts from this blog

Flutter for Single-Page Scrollable Websites with Navigator 2.0

A Data Science Portfolio is More Valuable than a Resume

Better File Storage in Oracle Cloud