These are my quick jottings during the talks at PGConf SV today:
Citus DB (distributed postgresql) will be open sourced
citusdb is going open source as a PostgreSQL extension #pgconfsv – Josh Berkus
First applause of day as @umurc announces CitusDB is going open source. #PGConfSV – merv
Everybody loves Kafka
Lots of Kafka love here at #pgconfsv Seems like Postgres + Kafka is a love match right now … – Josh Berkus
Hasura says JSON > SQL
Intriguing consulting company from India, although I didn’t get a chance to talk to them, the gist is that they provide a MongoDB-like JSON querying interface on top of RDBMS databases.
TripAdvisor runs on Postgresql
Matthew Kelly of TripAdvisor.
4 datacenters. 100 dedicated Postgres servers. 768 GB RAM. Multi-terabyte databases. 315 million unique visitors per month.
Switching from DRBD to streaming replication.
Switching Collation: utf.en-us -> C because glibc keeps changing character sorting and affects indexes
Switching Hardware: RAM -> SSD
Cross datacenter replication is done by custom trigger-based replication.
Hopes to see BDR in core.
Active/Passive model of sites – two fully functional sites, keep flipping active role. Secondary site used for disaster recovery, load testing, hardware upgrades, etc.
Development environments – weekly dump restores of all schema and all non-PII (?) data into 3 mini sites – dev, prerelease and test lab. 36+ hour process that completes every weekend.
- Always separate your WAL, data and temp partitions onto different disks, even on SSDs.
- Make sure your kernel thinks your SSD array isn’t a spinning disk array.
- 60% CPU savings by properly caching prepared statements.
- Statement timeout is a must
- Separating read and write threadpools
- From 256-768 GB RAM & 15K spinning drives to 256GB RAM & enterprise-grade SSDs
- Next bottleneck
- Kernel version – requires Puppet upgrade + moving to systemd
- 1 Gbps networking isn’t enough
Prestogres – connecting presto query engine via postgresql protocol to visualization tools
Sadayuki Furuhashi of Treasure Data. Also created MessagePack and Fluentd.
Before: HDFC -> Hive daily/hourly batch -> Postgresql -> Dashboard / Interactive query Now: HDFC -> Presto -> Dashboard
Presto distributed query engine from Facebook. Connects to Cassandra, Hive, JDBC, Postgres, Kafka, etc.
Why Presto? Because elastic. Adding a server improves performance instantly. Scale performance when we need. Separate computation engine from storage engine.
Why Presto over MapReduce? Because:
- memory-to-memory data transfer
- no disk IO
- data chunk must fit in memory
- all stages are pipelined
- no wait time
- no fault tolerance
Writing connectors for data visualization & business intelligence tools to talk to Presto would be a lot of work, so why not create a Postgresql protocol adapter for Presto.
Other possible designs were:
- MySQL protocol + libdrizzle : But Presto has syntax differences with MySQL
- Postgresql + Foreign Data Wrapper : JOIN and aggregation pushdown is not available yet
Difficulties to implement Postgres protocol:
- Emulating system catalogs :
- Rewriting transactions (BEGIN, COMMIT) since Presto doesn’t support transactions
Prestogres design: pgpool-II + postgresql + PL/Python. Basic idea is rewrite queries at pgpool-II and run presto queries using PL/Python.
Uses a patched pgpool-II which creates & runs functions in the postgresql instance that will create system tables & records, and queries will be translated via PL/Python into Presto queries.
Heap Analytics uses Citus DB
Dan Robinson, Heap Inc.
Store every event, analyze retroactively. Challenges:
- 95% of data is never used.
- Funnels, retention, behavioral cohorts, grouping, filtering, etc. can’t pre-aggregate.
- As real-time as possible, within minutes.
5000 customers. 60 TB on disk. 80 billion events. 2 billion users. 2.4 billion events last week. Can’t scale vertically. So Citus DB.
users – customer id bigint, user id bigint, data jsonb. events – customer id foreign key, user id foreign key, event jsonb.
select count(*) from users where customer_id = 123 group by properties ->> 'ab_test_grp'
Complex queries with joins, group by, etc. done real-time via Citus DB. Citus DB parallelizes the queries among the individual postgres (shard) instances and aggregates them on the master node.
Making use of postgresql partial indexes (indexes on WHERE queries) when customer creates the query, for performance. This works well because data is sparse.
Make use of user-defined functions (UDFs), e.g. to analyze whether a user matches a funnel.
Where does data live before it gets into the Citus DB cluster? -> Use Kafka as a short-term commit log.
Kafka consumers make use of Postgres UDFs to make writes commutative and idempotent. Makes use of user exists checks, upserts, updates, etc.
Sharding by user, not time range. All shards written to all the time. How do we move shards, split shards, rehydrate new replicas, etc.? Use Kafka commit number to replicate the data and replay data after that commit number.
- Majority of queries touch only last 2 weeks of data – can we split out recent data onto nicer hardware?
- Numerical analysis beyonds counts – min, max, averages, histograms
- Richer analysis, more behavioral cohorting, data pivoting, etc.
- Live updates
How real-time is it? Events are ingested within minutes.
MixRank on Terabyte Postgresql
Scott Milliken, founder of MixRank.
Low maintenance thanks to Postgresql, compared to war stories with newer big data solutions.
Vacuum can change query plans and cause regressions in production.
In low digit percentages of queries, cannot predict query planner, so try them all. Use CTEs (Common Table Expressions) to force different plans, race them, kill the losers. Ugly but surprisingly effective. Implemented generically using our own higher-level query planner. Why CTEs? Because they are an optimization boundary.
Use SQLAlchemy. We don’t use the ORM parts, we use it as a DSL on top of SQL. So dynamically introspect the queries and do permutations to generate the different plans. Don’t try to generate different query plans by hand, that will be hard to maintain. One way to do this is to query the
pg_class table to figure out which indexes are present, and generate permutations to use different indexes.
Comment from audience: You can write your own C module and override postgresql to use your own query planner.
Batch update, insert, delete queries are a great substitute for Hadoop (for us). But correct results can lag and performance can suffer.
pg_repack to run periodically, not
You can scale a single postgres pretty far, more than you think. We have 1 (good dedicated hardware) box with 3.7 GB/s. Performance on a good dedicated hardware over others is 10-100 times, i.e. 1-2 orders of magnitude.
Using lz4 encoding for ZFS compression results in 43% lesser data size.
Amazon RDS for PostgreSQL : Lessons learned and deep dive on new features
Grant McAlister, Senior Principal Engineer, AWS RDS.
What’s new in storage:
- From 3TB limit to 6TB
- PIOPS limit is still 30K
- Encryption at rest
- Uses AWS Key Management Service (KMS), part of AWS IAM
- Includes all data files, log files, log backups, and snapshots
- Low performance overhead, 5-10% overhead on heavy writes
- Will reduce over time because Intel CPUs are getting better on offloading encryption
- Unencrypted snapshot sharing, even share to public
Major version upgrade to Postgresql 9.4, uses
pg_upgrade. Recommendation: Test first with a copy instance. Will also help you figure out how much downtime to expect.
rds_superuser_reserved_connections to reserve connections for admin purposes.
pg_buffercache to determine working set memory size.
Use AWS Database Migration Service (DMS) to move data to same or different database engine. From customer premises to AWS RDS. 11 hours for a terabyte, depending on your network speed. At least version 9.4 for Postgresql because using logical decoding feature. In Preview release now.
Use AWS Schema Conversion Tool (SCT) to migrate stored procedures, etc.
Scale and Availability:
- select sql query will check for buffer in
shared_buffers, if not load from pagecache/disk, if not load, load from EBS.
- shared buffers = working set size
- Have replicas in different availability zones, i.e. multi-AZ
- Use DNS CNAMEs for failover, takes 65 seconds
- Read replicas = availability
Burst mode: GP2 & T2
- Earn credits when performance below base
- If < 10,000 transactions per second, using burst mode will cost much lesser than PIOPS
Cross-region replication is being planned. Currently, you can copy snapshots across regions.