Notes from Postgres Conference Silicon Valley 2015

These are my quick jottings during the talks at PGConf SV today:

Citus DB (distributed postgresql) will be open sourced

citusdb is going open source as a PostgreSQL extension #pgconfsv – Josh Berkus

First applause of day as @umurc announces CitusDB is going open source. #PGConfSV – merv

Everybody loves Kafka

Lots of Kafka love here at #pgconfsv Seems like Postgres + Kafka is a love match right now … – Josh Berkus

Hasura says JSON > SQL

Intriguing consulting company from India, although I didn’t get a chance to talk to them, the gist is that they provide a MongoDB-like JSON querying interface on top of RDBMS databases.

Update: There’s also PostgREST which is an open source project in Haskell that is similar (via the awesome-postgres list).

TripAdvisor runs on Postgresql

Matthew Kelly of TripAdvisor.

4 datacenters. 100 dedicated Postgres servers. 768 GB RAM. Multi-terabyte databases. 315 million unique visitors per month.

Switching from DRBD to streaming replication.

Switching Collation: utf.en-us -> C because glibc keeps changing character sorting and affects indexes

Switching Hardware: RAM -> SSD

Cross datacenter replication is done by custom trigger-based replication.

Hopes to see BDR in core.

Active/Passive model of sites – two fully functional sites, keep flipping active role. Secondary site used for disaster recovery, load testing, hardware upgrades, etc.

Development environments – weekly dump restores of all schema and all non-PII (?) data into 3 mini sites – dev, prerelease and test lab. 36+ hour process that completes every weekend.

System Tuning:

  • Always separate your WAL, data and temp partitions onto different disks, even on SSDs.
  • Make sure your kernel thinks your SSD array isn’t a spinning disk array.

Cache Statements:

  • 60% CPU savings by properly caching prepared statements.

Cascading Failures:

  • Statement timeout is a must
  • Separating read and write threadpools

Standard Hardware:

  • From 256-768 GB RAM & 15K spinning drives to 256GB RAM & enterprise-grade SSDs
  • Next bottleneck
    • Kernel version – requires Puppet upgrade + moving to systemd
    • 1 Gbps networking isn’t enough

Prestogres – connecting presto query engine via postgresql protocol to visualization tools

Sadayuki Furuhashi of Treasure Data. Also created MessagePack and Fluentd.

Before: HDFC -> Hive daily/hourly batch -> Postgresql -> Dashboard / Interactive query Now: HDFC -> Presto -> Dashboard

Presto distributed query engine from Facebook. Connects to Cassandra, Hive, JDBC, Postgres, Kafka, etc.

Why Presto? Because elastic. Adding a server improves performance instantly. Scale performance when we need. Separate computation engine from storage engine.

Why Presto over MapReduce? Because:

  • memory-to-memory data transfer
    • no disk IO
    • data chunk must fit in memory
  • all stages are pipelined
    • no wait time
    • no fault tolerance

Writing connectors for data visualization & business intelligence tools to talk to Presto would be a lot of work, so why not create a Postgresql protocol adapter for Presto.

Other possible designs were:

  • MySQL protocol + libdrizzle : But Presto has syntax differences with MySQL
  • Postgresql + Foreign Data Wrapper : JOIN and aggregation pushdown is not available yet

Difficulties to implement Postgres protocol:

  • Emulating system catalogs : pg_class, pg_namespace, pg_proc, etc.
  • Rewriting transactions (BEGIN, COMMIT) since Presto doesn’t support transactions

Prestogres design: pgpool-II + postgresql + PL/Python. Basic idea is rewrite queries at pgpool-II and run presto queries using PL/Python.

Uses a patched pgpool-II which creates & runs functions in the postgresql instance that will create system tables & records, and queries will be translated via PL/Python into Presto queries.

Heap Analytics uses Citus DB

Dan Robinson, Heap Inc.

Store every event, analyze retroactively. Challenges:

  • 95% of data is never used.
  • Funnels, retention, behavioral cohorts, grouping, filtering, etc. can’t pre-aggregate.
  • As real-time as possible, within minutes.

5000 customers. 60 TB on disk. 80 billion events. 2 billion users. 2.4 billion events last week. Can’t scale vertically. So Citus DB.


users – customer id bigint, user id bigint, data jsonb. events – customer id foreign key, user id foreign key, event jsonb.

Basic Query:

select count(*) from users where customer_id = 123 group by properties ->> 'ab_test_grp' 

Complex queries with joins, group by, etc. done real-time via Citus DB. Citus DB parallelizes the queries among the individual postgres (shard) instances and aggregates them on the master node.

Making use of postgresql partial indexes (indexes on WHERE queries) when customer creates the query, for performance. This works well because data is sparse.

Make use of user-defined functions (UDFs), e.g. to analyze whether a user matches a funnel.

Where does data live before it gets into the Citus DB cluster? -> Use Kafka as a short-term commit log.

Kafka consumers make use of Postgres UDFs to make writes commutative and idempotent. Makes use of user exists checks, upserts, updates, etc.

Sharding by user, not time range. All shards written to all the time. How do we move shards, split shards, rehydrate new replicas, etc.? Use Kafka commit number to replicate the data and replay data after that commit number.

Future Work:

  • Majority of queries touch only last 2 weeks of data – can we split out recent data onto nicer hardware?
  • Numerical analysis beyonds counts – min, max, averages, histograms
  • Richer analysis, more behavioral cohorting, data pivoting, etc.
  • Live updates

How real-time is it? Events are ingested within minutes.

MixRank on Terabyte Postgresql

Scott Milliken, founder of MixRank.

Low maintenance thanks to Postgresql, compared to war stories with newer big data solutions.

Vacuum can change query plans and cause regressions in production.

In low digit percentages of queries, cannot predict query planner, so try them all. Use CTEs (Common Table Expressions) to force different plans, race them, kill the losers. Ugly but surprisingly effective. Implemented generically using our own higher-level query planner. Why CTEs? Because they are an optimization boundary.

Use SQLAlchemy. We don’t use the ORM parts, we use it as a DSL on top of SQL. So dynamically introspect the queries and do permutations to generate the different plans. Don’t try to generate different query plans by hand, that will be hard to maintain. One way to do this is to query the pg_class table to figure out which indexes are present, and generate permutations to use different indexes.

Comment from audience: You can write your own C module and override postgresql to use your own query planner.

Batch update, insert, delete queries are a great substitute for Hadoop (for us). But correct results can lag and performance can suffer.

Schedule pg_repack to run periodically, not vacuum full.

You can scale a single postgres pretty far, more than you think. We have 1 (good dedicated hardware) box with 3.7 GB/s. Performance on a good dedicated hardware over others is 10-100 times, i.e. 1-2 orders of magnitude.

Using lz4 encoding for ZFS compression results in 43% lesser data size.

Amazon RDS for PostgreSQL : Lessons learned and deep dive on new features

Grant McAlister, Senior Principal Engineer, AWS RDS.

What’s new in storage:

  • From 3TB limit to 6TB
    • PIOPS limit is still 30K
  • Encryption at rest
    • Uses AWS Key Management Service (KMS), part of AWS IAM
    • Includes all data files, log files, log backups, and snapshots
    • Low performance overhead, 5-10% overhead on heavy writes
      • Will reduce over time because Intel CPUs are getting better on offloading encryption
      • Unencrypted snapshot sharing, even share to public

Major version upgrade to Postgresql 9.4, uses pg_upgrade. Recommendation: Test first with a copy instance. Will also help you figure out how much downtime to expect.

Use rds_superuser_reserved_connections to reserve connections for admin purposes.

Use pg_buffercache to determine working set memory size.

Use AWS Database Migration Service (DMS) to move data to same or different database engine. From customer premises to AWS RDS. 11 hours for a terabyte, depending on your network speed. At least version 9.4 for Postgresql because using logical decoding feature. In Preview release now.

Use AWS Schema Conversion Tool (SCT) to migrate stored procedures, etc.

Scale and Availability:

  • select sql query will check for buffer in shared_buffers, if not load from pagecache/disk, if not load, load from EBS.
    • shared buffers = working set size
  • Have replicas in different availability zones, i.e. multi-AZ
  • Use DNS CNAMEs for failover, takes 65 seconds
  • Read replicas = availability

Burst mode: GP2 & T2

  • Earn credits when performance below base
  • If < 10,000 transactions per second, using burst mode will cost much lesser than PIOPS

Cross-region replication is being planned. Currently, you can copy snapshots across regions.

Published by swaroop