As Senior Software Engineer at company building messaging platform for healthcare industry I am responsible, including other duties, for performance of our application. We develop pretty standard web-service using Ruby on Rails application for business logic and API, React + Redux for users' facing single page application, as database we use PostgreSQL. Common reasons for performance problems in similar stacks are heavy queries to database and I would like to tell the story how we applied non-standard but fairly simple optimisations to improve performance.
Our business operates in US, so we have to be HIPAA compliant and follow certain security policies, security audit is something that we are always prepared for. To reduce risks and costs we rely on a special cloud provider to run our applications and databases, very similar to what Heroku does. On one hand it allows us to focus on building our platform but on the other hand it adds an additional limitation to our infrastructure. Talking shortly — we cannot scale up infinitely. As a successful startup we double number of users every few month and one day our monitoring told us that we were exceeding disk IO quota on the database server. Underlying AWS started throttling which was resulting in a significant performance degradation. Ruby application was not capable to serve all incoming traffic because Unicorn workers were spending too much time awaiting for database's response, customers were unhappy.
Standard solutions
In the beginning of the article I mentioned the phrase "non standard optimisations" because all low-hanging fruits were already picked:
- we removed all N + 1 queries. Ruby gem Bullet was the primary tool
- all needed indices on database were added, all not-needed were removed, thanks to pg_stat_statements
- some queries with multiple joins were rewritten for better efficiency
- we separated queries to fetch paginated collections from decoration queries. For example, initially we added counter of messages per dialog by joining tables but it was replaced with an additional query to augment results. The next query does Index Only Scan and really cheap:
SELECT COUNT(1), dialog_id
FROM messages
WHERE dialog_id IN (1, 2, 3, 4)
GROUP BY dialog_id;
- added a few caches. Actually, this did not work well because as a messaging application we have many updates
All these tricks did a great job for a couple of month until we faced again the same performance problem — more users, higher load. We were looking for something else.
Advanced solutions
We did not want to use heavy artillery and implement denormalisation and partitioning because these solutions require deep knowledge in databases, shift team's focus from implementing features to maintenance and by the end we wanted to avoid complexity in our application. Lastly, we used PostgreSQL 9.3 where partitions are based on triggers with all their costs. KISS principle in action.
Custom solutions
Compress data
We decided to focus on the main symptom — disk IO. As less data we store, as less IO capacity we need, this was the primary idea. We started looking for opportunities to compress data and the first candidates were columns like user_type
provided with polymorphic associations by ActiveRecord. In the application we use modules a lot which lead us to have long strings like Module::SubModule::ModelName
for polymorphic associations. What we did — convert all these columns' types from varchar to ENUM. The Rails migration looks like this:
class AddUserEnumType < ActiveRecord::Migration[5.2]
disable_ddl_transaction!
def up
ActiveRecord::Base.connection.execute <<~SQL
CREATE TYPE user_type_enum AS ENUM (
'Module::Submodule::UserModel1',
'Module::Submodule::UserModel2',
'Module::Submodule::UserModel3'
);
SQL
add_column :messages, :sender_type_temp, :user_type_enum
Message
.in_batches(of: 10000)
.update_all('sender_type_temp = sender_type::user_type_enum')
safety_assured do
rename_column :messages, :sender_type, :sender_type_old
rename_column :messages, :sender_type_temp, :sender_type
end
end
end
A few notes about this migration for people who not familiar with Rails:
- disable_ddl_transaction! disables transactional migration. This is very risky to do but we wanted to avoid long transaction. Please be sure that you don't disable transactions on migration without a need for it.
- On the first step we create a new ENUM data type on PostgreSQL. The best feature on ENUM is a small size, a really small comparing to varchar. ENUM has some difficulties with adding new values but usually we don't add new user types often.
- add a new column sender_type_temp with the user_type_enum
- fill values to the new column in_batches to avoid a long lock on the table messages
- last step swaps old column with new. This is the most dangerous step because if the column sender_type got turned to sender_type_old but sender_type_temp had failed to become sender_type we would get a lot of troubles.
- safety_assured comes from the gem strong_migration which helps to avoid mistakes on writing migrations. Renaming column is not a safe operation, so we had to confirm that we understand what we were doing. Actually, there is a safer but longer way including multiple deployments.
Needless to say that we run all similar migrations during lowest activity periods with proper testing.
We converted all polymorphic columns into ENUM, dropped old columns after a few days of monitoring and finally run VACUUM to decrease fragmentation. This saved us roughly 10% of total disk space but
some tables with a few columns were compressed twice! What was more important — some tables started to be cached in memory (remember, we cannot easily add more RAM) by PostgreSQL and this dramatically decreased the required disk IO.
Don't trust your service provider
Another thing was found in the article How a single PostgreSQL config change improved slow query performance by 50x — our PostgreSQL provider makes an automatic configuration for server based on requested volume of RAM, Disk and CPU but by whatever reason they left the parameter random_page_cost with the default value which is 4 optimised for HDD. They charge us to run databases on SSD but did not configure PostgreSQL properly. After contacting them we got way better execution plans:
EXPLAIN ANALYSE SELECT COUNT(*) AS count_dialog_id, dialog_id as dialog_id
FROM messages
WHERE sender_type = 'Module::Submodule::UserModel1'
AND sender_id = 1234
GROUP BY dialog_id;
db=# SET random_page_cost = 4;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=76405.45..76411.92 rows=647 width=12) (actual time=2428.412..2428.420 rows=12 loops=1)
Group Key: dialog_id
-> Bitmap Heap Scan on messages (cost=605.90..76287.72 rows=23545 width=4) (actual time=82.442..2376.033 rows=79466 loops=1)
Recheck Cond: ((sender_id = 1234) AND (sender_type = 'Module::Submodule::UserModel1'::user_type_enum))
Heap Blocks: exact=23672
-> Bitmap Index Scan on index_messages_on_sender_id_and_sender_type_and_message_type (cost=0.00..600.01 rows=23545 width=0) (actual time=76.067..76.068 rows=79466 loops=1)
Index Cond: ((sender_id = 1234) AND (sender_type = 'Module::Submodule::UserModel1'::user_type_enum))
Planning time: 3.849 ms
Execution time: 2428.691 ms
(9 rows)
db=# SET random_page_cost = 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=21359.54..21366.01 rows=647 width=12) (actual time=97.905..97.913 rows=12 loops=1)
Group Key: dialog_id
-> Index Scan using index_messages_on_sender_id_and_sender_type_and_message_type on messages (cost=0.56..21241.81 rows=23545 width=4) (actual time=0.058..60.444 rows=79466 loops=1)
Index Cond: ((sender_id = 1234) AND (sender_type = 'Module::Submodule::UserModel1'::user_type_enum))
Planning time: 0.277 ms
Execution time: 98.070 ms
(6 rows)
Move data away
We moved a huge table to another database. We have to keep audits of every change in the system by law and this requirement is implemented with gem PaperTrail. This library creates a table in the production database where all changes of objects under monitoring are saved. We use library multiverse to integrate another database instance to our Rails app. By the way — it is going to be a standard feature of Rails 6. There are some configurations:
Describe connection in the file config/database.yml
external_default: &external_default
url: "<%= ENV['AUDIT_DATABASE_URL'] %>"
external_development:
<<: *external_default
Base class for ActiveRecord models from another database:
class ExternalRecord < ActiveRecord::Base
self.abstract_class = true
establish_connection :"external_#{Rails.env}"
end
Model which implements PaperTrail versions:
class ExternalVersion < ExternalRecord
include PaperTrail::VersionConcern
end
Use case in the model under audit:
class Message < ActiveRecord::Base
has_paper_trail class_name: "ExternalVersion"
end
Summary
We finally added more RAM to our PostgreSQL instance and currently we consume only 10% of available disk IO. We survived up until this point because we applied a few tricks — compressed data in our production database, corrected configuration and moved not relevant data away. Probably, these won't help in your particular case but I hope this article could give some ideas about custom and simple optimisation. Of course, don't forget to go through the check list of standard problems listed in the beginning.
P.S.: Highly recommend a great DBA add-on for psql.