Skip to main content

Exporting a Postgres Schema to Parquet on S3 with LakeXpress

François Pacull
François Pacull, ARPE.IO Developer
2026-02-18 · 7 min · Tutorial · LakeXpress

LakeXpress exports database tables to Parquet files, uploads them to cloud storage, and can publish them to data lakes.

In this post, I'll walk through a minimal LakeXpress workflow: create a sync configuration, then run it. By the end, a full Postgres schema will be exported as Parquet files on S3 (no data lake publishing in this post).

LakeXpress architecture: PostgreSQL schema exported via LakeXpress + FastBCP to S3 bucket, with a Log DB for orchestration

Prerequisites

  • The LakeXpress binary (Linux in this example)
  • The FastBCP binary, used internally by LakeXpress for parallel table extraction
  • A logging/orchestration database (SQL Server, Postgres, MySQL, SQLite or DuckDB. This example uses a local SQL Server instance)
  • A running Postgres database containing the TPC-H dataset in a schema named tpch_1 (scale factor 1)
  • An S3 bucket and an AWS CLI profile configured
  • A credentials file (ds_credentials.json) -- more on this below

The credentials file

LakeXpress reads connection details from a single JSON file. Each entry is identified by a key that you reference later on the command line.

{
"ds_pg_01": {
"ds_type": "postgres",
"auth_mode": "classic",
"info": {
"username": "$env{PG_USER}",
"password": "$env{PG_PASSWORD}",
"server": "localhost",
"port": 5432,
"database": "tpch"
}
},
"log_db_ms": {
"ds_type": "mssql",
"auth_mode": "classic",
"info": {
"username": "$env{LOG_DB_USER}",
"password": "$env{LOG_DB_PASSWORD}",
"server": "localhost",
"port": 1433,
"database": "lakexpress"
}
},
"aws_s3_01": {
"ds_type": "s3",
"auth_mode": "profile",
"info": {
"directory": "s3://mybucket/myproject/",
"profile": "lakexpress"
}
}
}

Three entries:

  • ds_pg_01 -- the source Postgres database (TPC-H data lives here)
  • log_db_ms -- the logging and orchestration database (SQL Server here, but Postgres also works)
  • aws_s3_01 -- the S3 target, configured with an AWS CLI profile name and a base directory

Sensitive values like usernames and passwords can be given as environment variable with the $env{VARIABLE_NAME} syntax.

The log database persists sync configurations, run history, job statuses, file manifests, and table-level metrics. Because state is tracked per table, runs are resumable -- if an export is interrupted, LakeXpress picks up where it left off without re-exporting completed tables. The stored history also makes runs auditable: you can query past runs to see timings, row counts, and any errors.

The source data

The tpch_1 schema contains the eight standard TPC-H tables. A quick check with INFORMATION_SCHEMA:

SELECT table_name
FROM INFORMATION_SCHEMA."tables"
WHERE table_catalog = 'tpch'
AND table_schema = 'tpch_1';
 table_name
------------
nation
part
partsupp
orders
lineitem
customer
region
supplier
(8 rows)

Scale factor 1 means roughly 6 million rows in lineitem, 1.4 million in orders, and smaller counts for the remaining tables. Not a huge dataset, but enough to see how LakeXpress handles parallel exports.

Step 1 -- Create the sync configuration

The config create command registers a named sync configuration. It tells LakeXpress where to read from, where to write, and how much parallelism to use.

./LakeXpress config create \
-a data/ds_credentials.json \
--log_db_auth_id log_db_ms \
--source_db_auth_id ds_pg_01 \
--source_db_name tpch \
--source_schema_name tpch_1 \
--fastbcp_dir_path /home/francois/Workspace/FastBCP_exe/latest/ \
--fastbcp_p 2 \
--n_jobs 4 \
--target_storage_id aws_s3_01 \
--generate_metadata

What each argument does:

ArgumentValueDescription
-adata/ds_credentials.jsonPath to the credentials JSON file
--log_db_auth_idlog_db_msWhich credential entry to use for the logging database
--source_db_auth_idds_pg_01Which credential entry for the source Postgres
--source_db_nametpchThe Postgres database name
--source_schema_nametpch_1Schema to export (supports SQL LIKE patterns, e.g. tpch_%)
--fastbcp_dir_path/home/.../FastBCP_exe/latest/Path to the FastBCP executable directory
--fastbcp_p2Parallelism within each table export -- 2 concurrent readers per table
--n_jobs4Number of tables exported concurrently
--target_storage_idaws_s3_01Which credential entry for the S3 target
--generate_metadata(flag)Produce CDM metadata files alongside the Parquet data

LakeXpress assigns a unique sync_id and saves the configuration:

[...] config create: Creating sync configuration...
[...] Created sync configuration: sync-20260218-f97f78
[...] Saved sync to local registry: ~/.lakexpress/syncs.json
[...]
[...] ============================================================
[...] Sync configuration created successfully!
[...]
[...] sync_id: sync-20260218-f97f78
[...]
[...] Use this sync_id with the following commands:
[...] ./LakeXpress sync --sync_id sync-20260218-f97f78
[...] ./LakeXpress 'sync[export]' --sync_id sync-20260218-f97f78
[...] ./LakeXpress 'sync[publish]' --sync_id sync-20260218-f97f78
[...] ============================================================

The output shows the exact commands you can run next. sync runs the full pipeline (export + publish). sync[export] and sync[publish] let you run each phase independently.

Step 2 -- Run the sync

Now we execute the sync using the sync_id from the previous step:

./LakeXpress sync --sync_id sync-20260218-f97f78

sync is the subcommand; --sync_id identifies which saved configuration to run.

The output starts with a configuration summary:

[...] ============================================================
[...] Executing sync: sync-20260218-f97f78
[...] ============================================================
[...]
[...] Source database: ds_pg_01
[...] Source database name: tpch
[...] Source schemas: tpch_1
[...] Target storage: aws_s3_01
[...] FastBCP path: /home/francois/Workspace/FastBCP_exe/latest/
[...] Parallel degree: 2
[...] Concurrent jobs: 4

LakeXpress then discovers the tables in the schema, sorts them by estimated row count (largest first), and starts exporting:

[...] Discovered 8 tables in schema tpch_1
[...] Tables sorted by estimated row count (largest first) across all schemas:
[...] 1. tpch_1.lineitem (5,680,427 rows)
[...] 2. tpch_1.orders (1,410,012 rows)
[...] 3. tpch_1.part (200,000 rows)
[...] 4. tpch_1.customer (150,000 rows)
[...] 5. tpch_1.supplier (10,000 rows)
[...] 6. tpch_1.region (5 rows)
[...] 7. tpch_1.nation (0 rows)
[...] 8. tpch_1.partsupp (0 rows)

Scheduling the largest tables first is deliberate — it avoids the classic tail-wait problem where a single slow table stalls the end of the run.

With --n_jobs 4, four tables are exported concurrently. Each table export uses --fastbcp_p 2 parallel readers, so up to 8 concurrent streams are hitting Postgres at peak. A progress bar tracks overall completion:

[...] Export: CELLS: [####################] 100.0%  TABLES: [####################] 100.0% (8/8) - 01:21

After the export, LakeXpress generates CDM metadata files (because we passed --generate_metadata). These JSON files describe the schema structure for downstream catalog tools.

The run summary confirms everything went through:

[...] ============================================================
[...] RUN SUMMARY
[...] ============================================================
[...] Export: 8/8 tables succeeded
[...] ============================================================
[...] Total - time (s) : 85.330

8 tables, 85 seconds total.

You can also inspect the saved configuration in the log database. Every parameter passed to config create is persisted, giving you a full record of each sync's configuration:

SELECT *
FROM dbo.export_syncs
WHERE sync_id = 'sync-20260218-f97f78';
sync_id                  | sync-20260218-f97f78
env_name | default
source_db_auth_id | ds_pg_01
source_database | tpch
source_schemas | ["tpch_1"]
target_storage_id | aws_s3_01
fastbcp_dir_path | /home/francois/Workspace/FastBCP_exe/latest/
fastbcp_p | 2
n_jobs | 4
compression_type | Zstd
large_table_threshold | 100000
generate_metadata | 1
error_action | fail
is_enabled | 1
lakexpress_version | 0.2.7
created_at | 2026-02-18 15:51:16.220

Inspecting the results on S3

Let's see what landed in the bucket:

aws s3 ls s3://mybucket/myproject/tpch_1/ --recursive --profile lakexpress
2026-02-18 16:53:21       8386 myproject/tpch_1/customer.cdm.json
2026-02-18 16:51:58 4039874 myproject/tpch_1/customer/customer_chunk_000.parquet
2026-02-18 16:51:58 4041112 myproject/tpch_1/customer/customer_chunk_001.parquet
2026-02-18 16:53:21 14991 myproject/tpch_1/lineitem.cdm.json
2026-02-18 16:51:58 92809238 myproject/tpch_1/lineitem/lineitem_chunk_000.parquet
2026-02-18 16:51:58 92786196 myproject/tpch_1/lineitem/lineitem_chunk_001.parquet
2026-02-18 16:53:20 1669 myproject/tpch_1/manifest.json
2026-02-18 16:53:20 61471 myproject/tpch_1/model.json
2026-02-18 16:53:21 4752 myproject/tpch_1/nation.cdm.json
2026-02-18 16:52:14 2509 myproject/tpch_1/nation/nation.parquet
2026-02-18 16:53:21 9052 myproject/tpch_1/orders.cdm.json
2026-02-18 16:51:58 19938611 myproject/tpch_1/orders/orders_chunk_000.parquet
2026-02-18 16:51:58 19944629 myproject/tpch_1/orders/orders_chunk_001.parquet
2026-02-18 16:53:21 9263 myproject/tpch_1/part.cdm.json
2026-02-18 16:51:58 2067577 myproject/tpch_1/part/part_chunk_000.parquet
2026-02-18 16:51:58 2070774 myproject/tpch_1/part/part_chunk_001.parquet
2026-02-18 16:53:21 5708 myproject/tpch_1/partsupp.cdm.json
2026-02-18 16:52:16 244 myproject/tpch_1/partsupp/partsupp.parquet
2026-02-18 16:53:21 3918 myproject/tpch_1/region.cdm.json
2026-02-18 16:52:09 1413 myproject/tpch_1/region/region.parquet
2026-02-18 16:53:21 7515 myproject/tpch_1/supplier.cdm.json
2026-02-18 16:52:05 528903 myproject/tpch_1/supplier/supplier.parquet

Each table gets its own directory. Large tables (lineitem, orders, customer, part) are automatically split into chunks (_chunk_000, _chunk_001) based on the --fastbcp_p 2 parallelism setting. Smaller tables get a single file.

CDM metadata files sit at the schema level: one .cdm.json per table, plus a model.json (full schema description) and a manifest.json (list of all data files).

Zstd compression is effective -- the 6M-row lineitem table comes in at about 177 MB across two Parquet files.

Wrap-up

Two commands to go from a Postgres schema to production-ready Parquet files on S3:

  1. config create -- define what to export and where to land it
  2. sync -- run the export with built-in parallelism, compression, and metadata generation

LakeXpress takes care of the rest: discovering tables, extracting in parallel, chunking, compressing, and generating metadata. For the full CLI reference and more advanced features (incremental syncs, publishing to Snowflake, table-level configuration), see the documentation.