bimals.net

Psycopg3: Separate connections vs Connection Pool

Overview

I compared performance of single connection objects vs connection pool in Psycopg3 for database insert operations. Throughout this document, I am using players data returned by Fantasy Premier League’s API.

Single Connection

Firstly, I created a function db_connection which makes use of psycopg’s connect() function and creates a new database session while returning an instance of Connection class.

def db_connection():
	connection = psycopg.connect(
			dbname=os.getenv("DB_NAME", "test_db"),
      host=os.getenv("DB_HOST", "localhost"),
      user=os.getenv("DB_USER", "postgres"),
      password=os.getenv("DB_PASSWORD", "postgres"),
      port=os.getenv("DB_PORT", "5432"),)
  return connection

In a separate file, I created another function that takes the db_connection() in a context manager and inserts all the data into the players table.

def insert_players(players):
    inserted_count = 0
    start_time = time.time()

    with db_connection() as conn:
        for player in players:
            columns = ', '.join(player.keys())
            placeholders = ', '.join(['%s'] * len(player))
            values = tuple(player.values())
            query = f"INSERT INTO players ({columns})VALUES ({placeholders})"
            conn.execute(query, values)
            inserted_count += 1
    end_time = time.time()
    print(f'{inserted_count} players in {end_time - start_time} seconds')

Finally, I fetched players from remote API and validated it with Pydantic before passing it into the above function for insertion.

def get_players():
	players = call_bootstrap_api()
	filtered_players_data = validate_required_attributes(players)
	insert_players(filtered_players_data)
	
if __name__ == "__main__":
    get_players()
    

# Output:
661 players in 0.1283118724822998 seconds

That is perfectly fast enough for my usage. Now on to connection pool.

Connection Pool

I installed this with pip install psycopg[pool] and followed this to create a pool object.

from psycopg_pool import ConnectionPool

DATABASE_URL = os.getenv('DATABASE_URL', "postgresql://postgres:postgres@localhost:5432/test_db")
pool = ConnectionPool(conninfo=DATABASE_URL)

To simulate the use of connection pool, I followed this example given in the psycopg official documentation. Firstly, I created a function to batch the players data into two different list.

def batch_insert(players):
    mid_point = len(players) // 2
    first_half = players[:mid_point]
    second_half = players[mid_point:]
    start_time = time.time()
    with ThreadPoolExecutor() as executor:
        inserts_batch_1 = executor.submit(insert_players, first_half)
        inserts_batch_2 = executor.submit(insert_players, second_half)

        inserts_batch_1.result()
        inserts_batch_2.result()

    end_time = time.time()
    print(f"Inserted players in two batches in {end_time - start_time} seconds")

And then I updated the insert_players function from above to use connection from pool instead of a Connection object and executed the same flow.

def insert_players(players):
    inserted_count = 0
    start_time = time.time()

# Changed db_connection() object with the pool connection
    with pool.connection() as conn:
        for player in players:
            columns = ', '.join(player.keys())
            placeholders = ', '.join(['%s'] * len(player))
            values = tuple(player.values())
            query = f"INSERT INTO players ({columns})VALUES ({placeholders})"
            conn.execute(query, values)
            inserted_count += 1
    end_time = time.time()
    print(f'{inserted_count} players in {end_time - start_time} seconds')

def get_players():
	players = call_bootstrap_api()
	filtered_players_data = validate_required_attributes(players)
	batch_insert(filtered_players_data)
	
if __name__ == "__main__":
    get_players()
    
# Output with pool:
Inserted players in two batches in 0.07747578620910645 seconds

Comparing the output, pooling was better and much faster(relatively) but this could have happened because of batching and not because of the available pool. To test this, instead of using connection pools, I used the normal connection object and sent the batched data.

# Instead of 
with pool.connection() as conn:
# I used the single connection object
with db_connection() as conn:

# Output
Inserted players in two batches in 0.09993529319763184 seconds

Although inserting data using multiple threads and normal connection object resulted in faster writes than inserting in a single thread, it was even faster to use a connection pool. In my opinion this is because the overhead of opening and closing two different connections was more than the overhead of managing the connection pool for my dataset.

However, I had to push the limit of the tests (and my Laptop’s resource) as far as possible with a larger dataset before coming to any conclusion.

So, I changed my validation function to duplicate the same data multiple times. This meant I could increase the number of records by looping through it multiple times just by changing the player_id(primary key).

def validate_required_attributes(players_data):

    filtered_players = []
    total_records = 20000
    data_length = len(players_data)
    repeats = total_needed // data_length + (total_records % data_length > 0)

    for repeat_index in range(repeats):
        for player in players_data:
            validated_player = PlayerCompleteSchema(**player).model_dump()
            # Update the player_id based on the current index
            new_player_id = repeat_index * data_length + len(filtered_players) + 1
            validated_player['player_id'] = new_player_id
            filtered_players.append(validated_player)

            # Break the loop if total needed is reached
            if len(filtered_players) >= total_records:
                break

    return filtered_players[:total_needed]

From the function above I updated total_records and inserted thousands of data in database using single connection object and connection pool. Here’s my findings for some:

# 50k rows, no pool, 2 batches: 5.062958002090454 seconds
# 50k rows, pool, 2 batches: 5.0906898975372314 seconds

# 100k rows, no pool, 2 batches: 9.828521013259888 seconds
# 100k rows, pool, 2 batches: 10.176682949066162 seconds

# 50k rows, no pool, 4 batches: 5.622668743133545 seconds
# 50k rows, pool, 4 batch: 5.833988666534424 seconds

# 100k rows, no pool, 4 batches: 11.731716871261597 seconds
# 100k rows, pool, 4 batches: 11.783261060714722 seconds

To my surprise, creating separate connection object in each thread was slightly better than managing connection pool when the number of records increased. I had assumed that there would be more overhead of creating and destroying connection objects, instead managing pool was more expensive.

Conclusion

I had started this test with the assumption that connection pools would result in faster write operations than creating/destroying separate connection objects. But the difference wasn’t that much even when I increased the number of rows from 600 to 100k at one point.

Combining my tests with this section of the documentation, I have come to the conclusion that unless the performance of the Postgres server starts to suffer or there’s a situation the number of connection is close to the server’s max_connections settings, using the connection object liberally is perfectly fine.

Obviously, this test is too basic for most use cases and results could be different depending on multiple factors. I want to use and test async connection objects as well to get a better understanding. Because Connection objects are thread-safe, I also want to re-use the same connection object to see the performance difference.

Overall, I feel like I understand database operations a bit better after this exercise which was the main point.

(This post is a small part of my FPL Dashboard project.)

Last updated: 10/17/2024

Tags:PythonPsycopgFastAPIPostgresPostgresql