How We Migrated a Large Complex Database from MySQL to PostgreSQL at Gotedo

How We Migrated a Large Complex Database from MySQL to PostgreSQL at Gotedo

Recently at the Gotedo Church Management Software, we were faced with the decision to continue using MySQL and implement hierarchical data completely from scratch or migrate to PostgreSQL and take advantage of the ltree data type for easy handling of hierarchical data. Other reasons included taking advantage of the advanced full-text search capabilities of PostgreSQL and positioning the software for future multi-tenancy requirements with PostgreSQL's multi-schema database design. It would also be easy to perform this kind of huge database engine migration now that the software has few active users. The MySQL database had 114 tables with delicate foreign key relationships.

We use AdonisJS for the Gotedo backend, so the examples will be highlighted with AdonisJS Lucid ORM and JavaScript language. You can easily adapt them to suit your language/framework. Also, at the time of writing this, Gotedo is still using AdonisJS 4.x, so the syntax of the examples will be a bit different from that of AdonisJS 5.x.

Migration Steps

Table Migrations

The first step was refactoring the existing table migration scripts and making them compatible with PostgreSQL. Since AdonisJS uses Knex.js underneath the Lucid ORM, there was minimal refactoring of the migration scripts as Knex.js ensures compatibility with MySQL and PostgreSQL out of the box. However, there were some raw queries in the table migration scripts which needed to be adapted for PostgreSQL.

Generally, the following changes were made to the table migration scripts:

1. No AFTER or BEFORE While Creating Columns

PostgreSQL does not support specifying the column before or after which to insert a column.

2. Creation of Full-text Indexes

PostgreSQL GIN index was used to replace MySQL FULLTEXT index. Learn more here.

/** @type {import('@adonisjs/lucid/src/Schema')} */
const Schema = use('Schema');

class UserProfileSchema extends Schema {
  up() {
-    this.raw('ALTER TABLE user_profiles ADD FULLTEXT (full_name)');
+    this.raw(`CREATE INDEX user_profiles_full_name_fulltext_idx ON user_profiles USING GIN (to_tsvector('english', full_name));`);

  }
}

module.exports = UserProfileSchema;

3. Creation of BTREE indexes

/** @type {import('@adonisjs/lucid/src/Schema')} */
const Schema = use('Schema');

class ServiceSchema extends Schema {
  up() {
-    this.raw('ALTER TABLE services ADD INDEX (service_year)');
-    this.raw('ALTER TABLE services ADD INDEX (service_month)');
-    this.raw('ALTER TABLE services ADD INDEX (service_day)');
+    this.raw('CREATE INDEX IF NOT EXISTS services_service_year_idx ON services (service_year)');
+    this.raw('CREATE INDEX IF NOT EXISTS services_service_month_idx ON services (service_month)');
+    this.raw('CREATE INDEX IF NOT EXISTS services_service_day_idx ON services (service_day)');
  }
}

module.exports = ServiceSchema;

4. Generated Columns

PostgreSQL is very strict with the expressions for generated columns. Most expressions which are successful in MySQL will fail in PostgreSQL either due to the error, generation expression is not immutable or that the function does not exist in PostgreSQL.

In the case of generating substrings from a date data type, the SUBSTRING function could not be used due to the immutable expression error. So, I opted to use the DATE_PART function - which I must admit was very elegant.

/** @type {import('@adonisjs/lucid/src/Schema')} */
const Schema = use('Schema');

class ServiceSchema extends Schema {
  up() {
-    this.raw(
-      'ALTER TABLE services ADD COLUMN service_year INT(6) GENERATED ALWAYS AS (SUBSTRING(service_date,1,4)*1) STORED AFTER service_date'
-    );
+    this.raw("ALTER TABLE services ADD COLUMN service_year INTEGER GENERATED ALWAYS AS (DATE_PART('year', service_date)) STORED");

-    this.raw(
-      'ALTER TABLE services ADD COLUMN service_month INT(4) GENERATED ALWAYS AS (SUBSTRING(service_date,6,2)*1) STORED AFTER service_year'
-    );
+    this.raw("ALTER TABLE services ADD COLUMN service_month INTEGER GENERATED ALWAYS AS (DATE_PART('month', service_date)) STORED");

-    this.raw(
-      'ALTER TABLE services ADD COLUMN service_day INT(4) GENERATED ALWAYS AS (SUBSTRING(service_date,9,2)*1) STORED AFTER service_month'
-    );
+    this.raw("ALTER TABLE services ADD COLUMN service_day INTEGER GENERATED ALWAYS AS (DATE_PART('day', service_date)) STORED");

  }
}

module.exports = ServiceSchema;

In the case of generating a stored full_name column from first name, last_name, and middle_name columns, the IF, CONCAT, and CONCAT_WS functions could not be used. I opted for using the CASE statement plus PostgreSQL string concatenation operator (||). See example below.

'use strict';

/** @type {import('@adonisjs/lucid/src/Schema')} */
const Schema = use('Schema');

class UserProfileSchema extends Schema {
  up() {
    // Using CONCAT_WS causes error: generation expression is not immutable
    this.raw(
-      "ALTER TABLE user_profiles ADD COLUMN full_name VARCHAR(100) GENERATED ALWAYS AS (IF(middle_name IS NULL, CONCAT(first_name,' ', last_name), CONCAT(first_name,' ',middle_name,' ', last_name))) STORED AFTER last_name"
+      "ALTER TABLE user_profiles ADD COLUMN full_name VARCHAR(100) GENERATED ALWAYS AS (CASE WHEN middle_name IS NULL THEN first_name || ' ' || last_name ELSE first_name || ' ' || middle_name || ' ' || last_name END) STORED"
    );
  }
}

module.exports = UserProfileSchema;

MySQL to PostgreSQL Data Migration with pgloader

Once the table migrations were successful, it was time to transfer existing data from MySQL to PostgreSQL.

The internet will filled with recommendations for pgloader but this article IS NOT a recommendation for pgloader. As a matter of fact, if pgloader had worked for me, I might not have had any need to document this process. However, it could work for others so I will share what I learnt while struggling to make pgloader work and the challenges I faced.

You can find installation instructions here. For my Macbook M1 laptop, I had to build pgloader from source for it to work. The build steps include:

  1. Download the latest release from github.com/dimitri/pgloader/tags.
  2. Extract compress file.
  3. Change into the directory of the extracted folder. e.g. cd pgloader-3.6.3.
  4. Run the command: make and wait for the build process to be completed.
  5. Now, you run pgloader with ./build/bin/pgloader.
  6. Test the programme with ./build/bin/pgloader --help.

To get pgloader to work, I spent hours which ran into days tweaking the configuration file for pgloader. I arrived at this configuration below. With a configuration file, you can run pgloader with:

./build/bin/pgloader ~/Documents/Mysql_Psql_Migration.txt

Where ~/Documents/Mysql_Psql_Migration.txt is the path to the configuration file.

In the configuration file below, uppercase words wrapped with %% indicate variables to be changed by you. Explanations for all options used in the configuration below can be found on this doc page: Migrating a MySQL Database to PostgreSQL .

LOAD DATABASE
     FROM mysql://%MYSQL_USERNAME%:%MYSQL_PASSWORD%@%MYSQL_HOST%/%MYSQL_DB_NAME%
     INTO postgresql://%PG_USERNAME%:%PG_PASSWORD%@%PG_HOST%/%PG_DB_NAME%

WITH disable triggers, create tables, create indexes, foreign keys, uniquify index names, reset sequences, drop indexes, drop schema

SET PostgreSQL PARAMETERS
   session_replication_role to 'replica'

SET MySQL PARAMETERS
   net_read_timeout = '500', net_write_timeout = '500'

ALTER schema '%DB_NAME%' rename to 'public'

BEFORE LOAD DO
   $$ CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; $$

AFTER LOAD DO
   $$ SET session_replication_role = 'origin'; $$;

pgloader stalled each time I attempt to migrate data with it. It never successfully completed the migration. At the point where it stalls, a visual inspection of the tables showed that foreign keys were not created even though all data were inserted. What is relational database without referential integrity!

So, I decided to face my face my fears to write a custom data migration script with AdonisJS Ace command. It turned out to be the best decision. As expected, that also had its own challenges but since I was the one creating the data migration script I had control over every aspect and could fix any errors which surfaced. I will share these experiences below.

Custom Data Migration with AdonisJS

The data migration script is written for AdonisJS 4.x.

Step 1: Create the PostgreSQL Database

This database creation step is very important because you need to define the encoding and collation which affects how PostgreSQL handles sorting, capitalisation, and character sets.

If you need to drop the PostgreSQL database used for the data migration, use the following command. Be careful not to drop another important database.

psql -U DB_USER -c "DROP DATABASE IF EXISTS DB_NAME WITH (FORCE);"

Where DB_USER and DB_NAME are variables to be changed. Option FORCE will close any active connection to the database and allow it to be dropped. Read more here.

The command for creating a database is:

psql -U DB_USER -c "CREATE DATABASE DB_NAME OWNER DB_USER ENCODING 'UTF-8' LC_COLLATE 'en_US.UTF-8' LC_CTYPE 'en_US.UTF-8' template template0;"

The portion of the above psql command wrapped with double quotes is the actual CREATE statement. Learn more about the CREATE statement here.

There are two options for psql command above:

  • -U: for specifying the authenticating user for the command.
  • -c: for specifying the actual SQL statement to be ran e.g. CREATE DATABASE DB_NAME OWNER DB_USER ENCODING 'UTF-8' LC_COLLATE 'en_US.UTF-8' LC_CTYPE 'en_US.UTF-8' template template0;

  • ENCODING specifies the character set the database supports. Using UTF-8 ensures that the database supports all possible character sets. Learn more here.

  • LC_COLLATE specifies the collation which controls how query results are sorted.
  • LC_CTYPE controls how capitalisation is handled within your database.
  • template specifies which database template should be used as the base/reference point for creating the new database. If template is not provided, PSQL will use the template of the maintenance database - usually template1 - (which might not have the correct collation and encoding settings we want).

Step 2: Prepare the PostgreSQL environmental variables and configuration

For AdonisJS, you will need to add some environment variables for the PostgreSQL database into your .env file. Make sure you set DB_CONNECTION to pg.

DB_CONNECTION=pg
PG_DB_HOST=127.0.0.1
PG_DB_PORT=5432
PG_DB_USER=postgres
PG_DB_PASSWORD=postgres
PG_DB_DATABASE=DB_NAME

Update your config/database.js file to include the PG variables:

'use strict';

/** @type {import('@adonisjs/framework/src/Env')} */
const Env = use('Env');

/** @type {import('@adonisjs/ignitor/src/Helpers')} */
const Helpers = use('Helpers');

module.exports = {
  /*
  |--------------------------------------------------------------------------
  | Default Connection
  |--------------------------------------------------------------------------
  |
  | Connection defines the default connection settings to be used while
  | interacting with SQL databases.
  |
  */
  connection: Env.get('DB_CONNECTION', 'sqlite'),

  /*
  |--------------------------------------------------------------------------
  | MySQL
  |--------------------------------------------------------------------------
  |
  | Here we define connection settings for MySQL database.
  |
  | npm i --save mysql
  |
  */
  mysql: {
    client: 'mysql',
    connection: {
      host: Env.get('DB_HOST', 'localhost'),
      port: Env.get('DB_PORT', ''),
      user: Env.get('DB_USER', 'root'),
      password: Env.get('DB_PASSWORD', ''),
      database: Env.get('DB_DATABASE', 'adonis'),
    },
    debug: Env.get('DB_DEBUG', 'false') === 'true',
  },

  /*
  |--------------------------------------------------------------------------
  | PostgreSQL
  |--------------------------------------------------------------------------
  |
  | Here we define connection settings for PostgreSQL database.
  |
  | npm i --save pg
  |
  */
  pg: {
    client: 'pg',
    connection: {
      host: Env.get('PG_DB_HOST', 'localhost'),
      port: Env.get('PG_DB_PORT', ''),
      user: Env.get('PG_DB_USER', 'root'),
      password: Env.get('PG_DB_PASSWORD', ''),
      database: Env.get('PG_DB_DATABASE', 'adonis'),
    },
    debug: Env.get('DB_DEBUG', 'false') === 'true',
  },
};

Step 3: Migrate the tables

Now, you are ready to migrate the PostgreSQL tables of the created database. Change into the directory of your AdonisJs application and run:

./node_modules/.bin/adonis migration:refresh

Step 4: Create the Migration Command

Here, we implement the custom data migration script as shown below. The migration script is based on AdonisJs Ace command.

  1. Create the command
./node_modules/.bin/adonis make:command PostgresqlMigrator
  1. Register the command in start/app.js:
/*
|--------------------------------------------------------------------------
| Commands
|--------------------------------------------------------------------------
|
| Here you store ace commands for your package
|
*/
+ const commands = ['App/Commands/PostgresqlMigrator'];

Step 5: The MySQL to PostgreSQL Migration Script

Open app/Commands/PostgresqlMigrator.js and override the file with the following:

'use strict';

const Config = use('Config');
const Database = use('Database');
const lodash = require('lodash');
const { Command } = require('@adonisjs/ace');

class PostgresqlMigrator extends Command {
  static get signature() {
    return 'postgresql:migrator';
  }

  static get description() {
    return 'Migrate MySQL data to PostgreSQL';
  }

  async handle(args, options) {
    const pg = Config.get('database.pg');
    // Disable foreign key checks
    await Database.connection('pg').raw(`SET session_replication_role = 'replica'`);
    // Create a transaction
    const trx = await Database.connection('pg').beginTransaction();
    // Get all tables on the database
    let allTables = await Database.connection('mysql').raw(`SHOW TABLES FROM ${pg.connection.database}`);

    allTables = JSON.parse(JSON.stringify(allTables))[0];
    allTables = allTables.map(t => Object.values(t)[0]).filter(t => t !== 'adonis_schema');

    const generatedColumns = [
      { tableName: 'financial_budget_items', columns: ['projected_total_cost'] },
      { tableName: 'financial_request_items', columns: ['total_cost'] },
      { tableName: 'services', columns: ['service_day', 'service_month', 'service_year'] },
      { tableName: 'service_attendance', columns: ['service_year', 'service_month', 'service_day'] },
      { tableName: 'personal_profiles', columns: ['full_name', 'date_of_birth_year', 'date_of_birth_month', 'date_of_birth_day'] },
      { tableName: 'user_profiles', columns: ['full_name', 'date_of_birth_month', 'date_of_birth_day', 'date_of_birth_year'] },
    ];

    try {
      // Transfer data into all tables
      for (let i = 0; i < allTables.length; i++) {
        const tableName = allTables[i];
        let existingData = await Database.connection('mysql').from(tableName).select('*');
        existingData = JSON.parse(JSON.stringify(existingData));

        const hasGeneratedColumns = generatedColumns.find(e => e.tableName === tableName);
        if (hasGeneratedColumns) {
          const newExistingData = [];
          for (let j = 0; j < existingData.length; j++) {
            const data = existingData[j];
            newExistingData.push(lodash.omit(data, hasGeneratedColumns.columns));
          }
          existingData = newExistingData;
        }
        //console.log(existingData);
        // Insert records
        await trx.batchInsert(tableName, existingData, 100);

        // Set auto-increments
        // First exclude all uuid id columns
        let dataType = await trx.raw(
          `SELECT table_name, column_name, data_type 
          from information_schema.columns 
          where table_catalog = :db
          and table_schema = 'public'
          and table_name = :table limit 1;`,
          { db: pg.connection.database, table: tableName }
        );
        dataType = JSON.parse(JSON.stringify(dataType));
        dataType = dataType ? dataType.rows[0] : null;
        console.log(dataType);

        if (dataType && dataType.data_type !== 'uuid') {
          console.log('this will run');

          // Set the auto-increment value
          await trx.raw(
            `SELECT SETVAL(
                (SELECT PG_GET_SERIAL_SEQUENCE(:table, 'id')),
                (SELECT MAX("id") FROM :table:));
          `,
            { table: tableName }
          );
        }
      }

      await trx.commit();
    } catch (error) {
      console.log(error);
      await trx.rollback();
    }

    await Database.connection('pg').raw(`SET session_replication_role = 'origin'`);

    // Without the following line, the command will not exit!
    Database.close();
  }
}

module.exports = PostgresqlMigrator;

Some explanations of what is going on within the migration script.

  1. The handle method is the entry point into the command.
  2. const pg = Config.get('database.pg');: here, we start by reading the pg database settings in config/database.js.
  3. await Database.connection('pg').raw(`SET session_replication_role = 'replica'`);: here, we disable foreign key checks during the migration process. This is important because of how intertwined the foreign relationships are. For example, the organisations table references the users table at the column created_by while the users table references the organisations table at the column organisation_creator. No matter which table you migrate first, there will always be a referential integrity error.

    Without disabling foreign key checks, you will get the following error:

    {
    length: 296,
    severity: 'ERROR',
    code: '23503',
    detail: 'Key (org_size)=(3) is not present in table "organisation_sizes".',
    hint: undefined,
    position: undefined,
    internalPosition: undefined,
    internalQuery: undefined,
    where: undefined,
    schema: 'public',
    table: 'organisations',
    column: undefined,
    dataType: undefined,
    constraint: 'organisations_org_size_foreign',
    file: 'ri_triggers.c',
    line: '2539',
    routine: 'ri_ReportViolation'
    }
    
  4. const trx = await Database.connection('pg').beginTransaction();: here, we connect to the PostgreSQL database via the pg connection and begin a new database transaction. The transaction will be used to rollback the migrated data if any error occurs. With this, you do not need to drop and re-create the database after each failure. Since we are merely copying over data with the ids preserved, referential integrity will be maintained after the migration.

  5. Below, we read all tables available on the MySQL database via the mysql connection to the allTables variable, and map over the array to get only the table names while excluding the adonis_schema table name which is already populated with the table migration status information.

    let allTables = await Database.connection('mysql').raw(`SHOW TABLES FROM ${pg.connection.database}`);
    
     allTables = JSON.parse(JSON.stringify(allTables))[0];
     allTables = allTables.map(t => Object.values(t)[0]).filter(t => t !== 'adonis_schema');```js
    
  6. Below, we define an array which contains all tables with generated columns and the column names of the generated columns. This is used to skip the migration of such columns.

     const generatedColumns = [
       { tableName: 'financial_budget_items', columns: ['projected_total_cost'] },
       { tableName: 'financial_request_items', columns: ['total_cost'] },
       { tableName: 'services', columns: ['service_day', 'service_month', 'service_year'] },
       { tableName: 'service_attendance', columns: ['service_year', 'service_month', 'service_day'] },
       { tableName: 'personal_profiles', columns: ['full_name', 'date_of_birth_year', 'date_of_birth_month', 'date_of_birth_day'] },
       { tableName: 'user_profiles', columns: ['full_name', 'date_of_birth_month', 'date_of_birth_day', 'date_of_birth_year'] },
     ];
    
  7. With the try...catch block, we loop through the allTables array of table names.

    At the statement:

    let existingData = await Database.connection('mysql').from(tableName).select('*');
    

    We read/select all columns from the current table from the MySQL connection/database.

  8. We check if the table contains generated columns, then omit the generated columns if true.

         if (hasGeneratedColumns) {
           const newExistingData = [];
           for (let j = 0; j < existingData.length; j++) {
             const data = existingData[j];
             newExistingData.push(lodash.omit(data, hasGeneratedColumns.columns));
           }
           existingData = newExistingData;
         }
    

    Without the above check for generated columns, you will get the follow error:

    {
    length: 172,
    severity: 'ERROR',
    code: '428C9',
    detail: 'Column "full_name" is a generated column.',
    hint: undefined,
    position: undefined,
    internalPosition: undefined,
    internalQuery: undefined,
    where: undefined,
    schema: undefined,
    table: undefined,
    column: undefined,
    dataType: undefined,
    constraint: undefined,
    file: 'rewriteHandler.c',
    line: '908',
    routine: 'rewriteTargetListIU'
    }
    
  9. We insert the MySQL data into PostgreSQL database via the transaction reference. Knex.js batchInsert function is used to control the number of rows (100 in this case) inserted per batch.

    await trx.batchInsert(tableName, existingData, 100);
    
  10. Since, we are inserting the data with the ids from MySQL, we need to set the PostgreSQL sequence (or auto-increment) value to the maximum value for each migrated table. However, most id column on the database use the uuid data type. Trying to set the sequence value on a uuid column will result in an error. So, we have to skip the uuid columns as well. The information_schema.columns table hold all information about all tables on the database.

  11. Below, for each table, we read the corresponding data type of the id column from the information_schema.columns table.

        let dataType = await trx.raw(
          `SELECT table_name, column_name, data_type 
          from information_schema.columns 
          where table_catalog = :db
          and table_schema = 'public'
          and table_name = :table limit 1;`,
          { db: pg.connection.database, table: tableName }
        );
        dataType = JSON.parse(JSON.stringify(dataType));
        dataType = dataType ? dataType.rows[0] : null;
    
  12. If the id column is not of type uuid, we go ahead to set the value of the sequence/auto-increment for the column. Read more about PG_GET_SERIAL_SEQUENCE function here.

    
        if (dataType && dataType.data_type !== 'uuid') {
          console.log('this will run');
    
          // Set the auto-increment value
          await trx.raw(
            `SELECT SETVAL(
                (SELECT PG_GET_SERIAL_SEQUENCE(:table, 'id')),
                (SELECT MAX("id") FROM :table:));
          `,
            { table: tableName }
          );
        }
    

    Without the above check, you might get error like: - function max(uuid) does not exist

    {
    length: 203,
    severity: 'ERROR',
    code: '42883',
    detail: undefined,
    hint: 'No function matches the given name and argument types. You might need to add explicit type casts.',
    position: '109',
    internalPosition: undefined,
    internalQuery: undefined,
    where: undefined,
    schema: undefined,
    table: undefined,
    column: undefined,
    dataType: undefined,
    constraint: undefined,
    file: 'parse_func.c',
    line: '636',
    routine: 'ParseFuncOrColumn'
    }
    
  13. If there are no errors, the transaction will be committed at: await trx.commit();.

  14. Foreign key checks if restored via await Database.connection('pg').raw(`SET session_replication_role = 'origin'`);.

  15. Close all opened database connections: Database.close();

Other possible errors which could be encountered is:

{
  length: 136,
  severity: 'ERROR',
  code: '22P02',
  detail: undefined,
  hint: undefined,
  position: undefined,
  internalPosition: undefined,
  internalQuery: undefined,
  where: "unnamed portal parameter $152 = '...'",
  schema: undefined,
  table: undefined,
  column: undefined,
  dataType: undefined,
  constraint: undefined,
  file: 'uuid.c',
  line: '137',
  routine: 'string_to_uuid'
}

In this case, check to ensure that you are not trying to insert a string into a uuid column. If that is the case, change the column definition to string type and re-migration the tables.

I hope this helps someone out there. I will appreciate your feedback if you found this article helpful.