Four years ago, I migrated this blog from WordPress to Jekyll, with the intention of using whatever format I want to use inside Emacs… Subsequently, my posting rate dropped drastically to just 13 posts in 4 years!

I don’t think that was a coincidence. Tools matter.

I believe the speed and ease of writing dropped drastically. Even simple steps like using photos in a post meant using a separate tool such as Finder.app (on macOS) or command-line to move it to the right directory and then linking to it from the main post. In WordPress, that’s one drag-and-drop and done.

Similarly, no comments was demotivating as well. While there tends to be more nitpicking these days, I would still like to benefit from the wisdom of the crowds.

So now I have migrated back to WordPress. Let’s see how this goes.

 

Background

A few months ago, Mayank convinced me to get some Ether (Ethereum cryptocurrency) because it was going to go on a bull run, thanks to high-profile companies backing Ethereum by joining the Ethereum Enterprise Alliance (EEA). So I did. And that event did happen – including Microsoft, Intel, MasterCard, Cisco, JP Morgan and the State of Andhra Pradesh, and yes, Ethereum went through a bull run (to $336 per ETH, as of this writing).

That’s when I started going down the rabbit hole of the cryptocurrency space 😬

What is blockchain, cryptocurrency and Ethereum?

The way I understand it is that cryptocurrency is digital money. So why is it different from PayPal or Paytm? Because this is not a national currency like rupees or dollars, this is a currency “for the people, by the people, of the people”. No government has sanctioned it or vouches for it. Sounds nuts, right?

But that’s what so exciting. Think of how people tinkering with technology can start a transformation like Steve Wozniak designing the Mac or Tim Berners-Lee creating the world wide web. People are now tinkering with creating a virtual currency that nobody can control, except by the participants agreeing to make changes, which makes it democratic and hence chaotic at the same time.

A good introduction to blockchain is this video by Gavin Wood, one of the cofounders of Ethereum.

For a visual introduction to the parts of a cryptocurrency, see this video by 3Blue1Brown:

To know what is Ethereum, see WTF is Ethereum?

Blockchain @ Berkeley

So all this got me curious about things at an implementation level (yep, it’s an ongoing theme with me). So, again, via Twitter, when I read that Blockchain @ Berkeley was hosting an Ethereum dev bootcamp, I signed up!

Note that I could have probably learned the same stuff online such as going through the Blockchain @ Berkeley’s Decal videos, etc. I just preferred a 2-day immersion, so I went to the in-person course.

The first day was an introduction and tons of questions by the audience. Everything from architecture to economics and incentives to security. Then we got an introduction to Solidity language and used the Truffle framework to practice writing a simplistic ecommerce shop smart contract.

It is scary that Ethereum-based software, i.e. software that is also money and a financial system is being built on Javascript. No wonder our instructor said “the worst language you’ll ever use”. If you thought Javascript ecosystem was wild, Ethereum gets even wilder. There is some hope in the form of best practices codified as a library and other such wonders of open source code communities.

Also, Ethereum founders are shifting focus from the solidity language which is javascript-y to a new language called viper (that also runs on the Ethereum Virtual Machine) which is python-y. Maybe there’s a moral in there somewhere…

The second day was an overview of oracles, web3.js, metamask, security (how not to ICO), authentication. There was so much to absorb here.

Special thanks to the instructors Ali Mousa and Collin Chin for a useful course. In fact, they had just finished a smart contract project on an internal supply-chain system for Airbus, and had plenty of practical advice to offer.

Dangers

There are many dangers lurking such as cryptos being disinflationary, so be careful with investing in ICOs.

Also, question the value of building something on the blockchain. Maybe only advantage of something being a decentralized app is lack of censorship.

What does it all mean

The idealist in me really wonders if all of this is really happening. People are actually working to decentralize the web and on top of that, raising more money democratically than traditional venture capital via Initial Coin Offerings (ICOs). Even creating new kinds of venture enablers. But I do wonder about actual user adoption though. I guess this is a “build it and they will come” excitement.

There’s still a long way to go to make the development tools and the ecosystem better and safer though. Every podcast I’ve heard describes the current state as the “dialup days of decentralized web (web3)”.

Even then, all the nerds are excited. Why? Because we are so used to accessing databases like Facebook or Google via the Internet, this is the first time that we have a database built as a protocol on top of the Internet, and hence it is decentralized. And this database can act as money and a financial system, which means money can be democratized which has never happened before. There’s a reason why kings and governments are the only ones who can print money – because it means power.

Now take decentralized database and decentralized money and put decentralized smart contracts on top of it (via Ethereum) and you can get two parties to do business with each other without the need for trusted third-parties, like banks! Smart contracts will destroy the current idea of a legal system, the current idea of a law firm and of a lawyer. Take it one step further and you can run entire companies on Ethereum – everything from cap table, governance, fundraising, payroll, accounting to bylaws and running entire communities. Maybe someday we can replace “don’t be evil” with “can’t be evil”. Consider me mind-blown. The proof in the pudding is that right now you can work with a freelancer via an Ethereum-based platform.

In short, the blockchain will replace networks with markets and the arc of the internet is bending back towards decentralization.

If you don’t know what is machine learning, just know this from Francois Chollet (creator of Keras)’s “Deep Learning with Python” book:

Classical programming vs. Machine learning

After attending the AI Frontiers conference at the beginning of this year, I was amazed, fascinated and befuddled at what actually is machine learning and deep learning and all of the associated buzzwords at an implementation level. I wanted to learn more about this. So, on a whim, I downloaded the TWiML podcast to listen during my commute and happened to be listening to an interview with Siraj Raval. Next thing you know, I checked out Siraj’s YouTube channel and followed him on Twitter. On Twitter, he kept talking about big news coming up in a few days, and turns out that he was co-creating the Udacity’s Deep Learning Foundation course (a MOOC). I was excited by Siraj’s and Mat’s intro video, and I immediately signed up and waited in trepidation.

The good part about the course was that there is a weekly schedule of lessons and projects. As I keep saying to friends and colleagues, nothing in the modern world ever gets done without a deadline (don’t tell your boss that).

The bad part was that the course was literally being built while we were enrolled, so we would see a mad rush by the instructors to write and create the content every week for the upcoming week, which was okay by me, because getting introduced to a topic that has only become feasible in the recent years and making it accessible in a way for people who don’t have Ph.D in machine learning, was exciting and I was grateful.

In the first few weeks, we dove into Anaconda (I’ve been doing Python for 10 years and had never heard of it), Jupyter notebooks (again, had never paid attention to or used it before), and started learning about perceptrons and neural networks. I was lost in the first few weeks. The course was advertised as 3 hours / week which was clearly insufficient, I had to spend like ~15 hours/week to catch up on the course and make sense of it all.

Like a tortoise, slowly I caught up, and reading Andrew Trask’s brilliant introductory book which was the course’s prescribed text book, I started understanding a little. We started off mostly with supervised learning, where we provide the training data set and the expected output. The lessons got into higher gear with learning convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The way I understood is that CNNs are useful for working on the full input such as individual images, because you’re extracting and condensing patterns with several layers and getting a condensed representation of the full input. RNNs are useful for sequences where there is a dependency such as text, where a sentence can depend on a previous sentence.

Whenever motivation was low, Siraj’s videos kept the enthusiasm and fascination flowing!

The projects also kept me going throughout the course because that’s where the understanding is really put to the test. Since I was taking copious notes during the course, I was forced to pay attention to the details, and that helped a lot during the project.

The lessons combined with the great idea of using a forum and dedicated forum mentors who guide you on questions that you have, about both lessons and projects, was just a perfect learning environment. I can’t thank the forum mentors enough.

The last topic of the course was generative adversarial networks (GANs), a type of unsupervised learning, which is actually a relatively recent concept, the paper came out in 2014! It applies game theory to neural networks to make two neural networks to compete with each other, the generator creates new patterns and the discriminator (trained on real data) decides whether it is realistic enough or not, forcing the generator to create realistic data after sufficient training.

Unfortunately, life happened, and I was delayed by a month to work on the last project. So it took immense effort to get back into the groove. The project was to generate faces! Imagine that! That invigorated me and was so glad to finally see this screen:

Graduated!

There was plenty of other concepts we learned along the way such as autoencoders and reinforcement learning, it would take an entire article to list all the concepts we encountered.

I’m thankful to Udacity for this course, I could see that not all students were satisfied with this course, but this course was oh so worth it for me. Getting introduced to data science, machine learning and deep learning in a few months has been a gruelling and happy experience.

I signed off in the course slack community with this:

Slack sign-off

I’m not confident enough yet to create my own projects (e.g. data preparation) or compete in Kaggle, but I hope this is just the beginning. After all, it’s a brave new world of machine learning!

NOTE: This story here is my personal perspective, it does not represent the views of my employer.

6 years ago, I worked with Thejo Kote on NextDrop.

5 years ago, I joined Thejo on his (then) next adventure, Automatic which launched 4 years ago, that story is here. The premise sounded interesting – what can you do when you tap into the data generated from your car. The vision was “owning a car can be safer, cheaper, and smarter”.

Two years ago, we had a real API and events platform and mobile apps that our customers are happy with. Customers especially use it with IFTTT integration and do things like log their business trips to a spreadsheet for expense reporting, for generating SMS messages to friends or family, to switch on/off their thermostat at home, and so on.

Last year, we launched our 3G version of the device. I personally built our core ingest servers that takes in all the real-time data being uploaded from our connected devices plugged into cars, massages that data and sends it down to all the internal microservices, and we’re talking lots of different types of data and interaction models. That core ingest server is now the foundation of all our products. It was a fun and challenging project.

Along with the tech sector funding slowdown, the past year also was a tough phase in Automatic, including layoffs and Thejo stepping down as CEO.

Automatic then bounced back with partnerships such as with American Family insurance to take usage-based insurance forward.

Today, the exciting news is that Sirius XM has acquired Automatic for over $100M to take the product forward in a far bigger way than was possible for a startup! And already our customers love it.

What makes Sirius XM interesting?

  1. Sirius XM is a public company (the stock ticker symbol is $SIRI).
  2. Did you know that 3 out of 4 new cars sold in the USA have Sirius XM satellite radio installed? So, while yes, Sirius XM is a “content and music” company, it is equal parts a “car chip and entertainment system” and “satellite technology” company.
  3. They have scale: 30+ million paying customers
  4. They have a consistently well-performing business – growing between 9-13% in each of the past five years and $1.5 billion in free cash flow last year.
  5. Warren Buffett has placed his faith in Sirius XM’s growth by buying 3.5% of Sirius XM shares a few months ago.
  6. A kick-ass founder – Martine Rothblatt
    • How can one single person be so brilliant that their career spans from law to entrepreneurship to satellite radios to mathematically proving electric-powered helicopters to producing movies to learning biochemistry to create a biotech firm to cure his child’s illness to a Ph.D in medical ethics to creating lungs from pig genes to cybernetic companions!?
    • Check out her Wikipedia profile and her TED talk (via this tweetstorm)

It has been a privilege to work in the trenches with Thejo (the visionary, the deal maker), Dr. Jerry (putting the science in data), Ljuba (how to do UX right), Ram J (the original 10x engineer), and several other brilliant folks.

I’m glad the Automatic story continues and strongly. To the future, the connected car!


An overview of these companies:

About Automatic Correction: Automatic was founded in 2011.

About SiriusXM

Update on Nov 7, 2018 : Back to OrgMode. Again.


This is a quick note on why I have started using Bear notes app:Screenshot of Bear notes app

What I want

  • Notes app needed.
  • Must support images and attachments.
  • Mobile-first. Absolutely need a notes app that syncs across computer and phone. That’s just how I function.
  • Ideally, there should be a backup option that keeps my notes unlocked if when the app starts degrading a few years from now.
  • OrgMode is ideal, but images cause Emacs scrolling to be wonky on the desktop, and recreating agenda mode on the mobile would be a challenge. But glad to see apps like Orgzly take on it.
  • Since my blog and books are already in Markdown, it would make sense to just stick to Markdown for notes as well.

What I tried before

  • Evernote sync was so unreliable that I had stopped using it.
  • OneNote wouldn’t let me even create an account (would complain about password, regardless of how small or long, how simple or complicated a password I try)
  • Quiver seemed promising, but the iOS app is still in beta and currently only provides a read-only view to the notes. And it does not support iCloud sync, only third-party sync mechanisms, which is strange for a Mac+iOS app combination.
  • Currently using Apple Notes. The downside is that exporting notes for publishing / sharing is a pain. For example, I can’t copy a note for sharing as text on messaging platforms like Slack, because it loses all the links and formatting.

Why Bear app

Why not Bear app

  • No web version, esp. to access from my Linux laptop. They are working on it.
  • Long-term availability? I’m glad they have a subscription model, so that they are encouraged to maintain the app, instead of creating an upgrade treadmill (I’m looking at you, Alfred app [1]). Worst case, they have a really good backup feature, that also exports attachments.
    • Last time I checked, Evernote does a bad job at this. The “export” menu command only exports the text of the notes as an xml file. How can you not include attachments in the backup?
  • No Siri integration, not sure if Apple has provided a Siri “intent” for note-taking though.

[1] Alfred now has a Mega Supporter License with lifetime free upgrades.

 

My soothsayer friend BG told me last year that “deep learning is the next big thing”. I didn’t know what that meant. A few days ago, I attended the AIFrontiers conference in Santa Clara, California. Now I have a glimpse of what he meant :-)

What is Intelligence?

In this context, by “intelligence”, I interpret it as “smart”. Yes, we have smart phones, smart TVs, and smart speakers. But imagine way more smarter software and devices… like self-driving cars!

Note that artificial Intelligence is about understanding intelligence. Machine Learning is a “brute force” data-driven approach to simulating intelligence., they are related but not the same thing. There are many areas that will lead to Artificial General Intelligence (AGI) which means “a software that can do any task”, as opposed to Machine Learning which creates software that can do specific tasks. This conference was about Machine Learning, and specifically Deep Learning.

To summarize the scope of the areas, Artificial Intelligence > Machine Learning > Deep Learning.

From Analog to Digital to Intelligence

The mantra at this conference was that we will move from a software stack to an intelligence stack to solve future engineering challenges.

This was best explained by the legendary Jeff Dean in his keynote speech, talking about how many products at Google use deep learning:

Deep Learning at Google

What is Machine Learning?

Machine learning is one technique to achieve intelligence.

What is machine learning? My understanding is: it is about making computer programs whose behavior is learned from data instead of solely based on lines of code written by humans. Think spam filters – whenever we click on “Spam” or “Not Spam” buttons, the spam filtering system learns from this and the behavior changes over time to reflect that, without somebody explicitly writing code for every single email. On top of this idea, design the system to learn by itself, and it can learn and improve orders of magnitude faster.

What makes Machine Learning special? Because the system is now learning behaviors that is more accurate for the task and can handle more situations than the algorithms we humans could have imagined! Think converting sentences from one human language to another, self-driving cars, etc. Think of all the situations that such systems need to handle. We could have not written code to handle every situation.

Why now? Because machine learning requires:

  1. Lots of data – which we have now thanks to (a) so many people buying mobile phones, (b) mobile phones sensors and apps generating so much data.
  2. Lots of computers – which we have now thanks to cloud computing.
  3. Lots of parallel processing power (think matrix multiplications) – which we have now thanks to Graphics Processing Units (GPUs).

What is Deep Learning?

What is deep learning? It is a machine learning technique that is based on “layers of neurons”, i.e. think of millions of neurons in your human brain that work together to understand, perceive, store knowledge… deep learning tries to simulate your brain. At least, that’s the way I understood it.

Jeff Dean explains deep learning

What do you want in a Machine Learning System?

Jeff Dean talked about their first internal machine learning system, the problems they faced, and what they ideally wanted:

What do you want in a Machine Learning System?
Computation Time and Research Productivity

And eventually they designed TensorFlow to achieve those desirable features.

He went on to mention the algorithms they use for different products, which I found interesting, not because I understood what they meant, but because they are pointers in case you want to learn more. After all, the whole point of attending conferences and meetups is to know what is happening out there.

Speech Recognition
Google Photos Search
Google Search
Language Translation

Some of these models can be found at https://github.com/tensorflow/models.

Jeff Dean also mentioned the kind of impact they have had on products, esp. converting April Fool’s Day jokes into reality:

Google Inbox Smart Reply
Algorithms behind Google Inbox Smart Reply

Jeff Dean expects more reuse of machine learning-developed models across different tasks, described as zero-shot learning:

Zero-shot learning

And more compute-based model generation:

More compute

Jeff Dean also gave a glimpse of what kind of queries they hope to achieve in the future:

Google Search queries of the future

Autonomous Driving

There was a lot of info throughout the day, so I’ll only post what I found were interesting topics / slides in the discussions:

Speakers were from Waymo (Google), Tesla Motors (not in official capacity), Baidu Autonomous Driving Unit.
Google / Waymo designing a car specifically for autonomous driving

Baidu also played videos of their self-driving cars in China, so this is not just a USA-only phenomenon. China, indeed, may have an edge in AI.

Big Data and Machine Learning in the car

This is a reason why I feel C++, the beast, is making a comeback – because performance and efficient hardware usage is important again, because we now have to run a lot of processing on the Internet of Things, especially self-driving cars. And because it’s C++, correctness becomes a new risk. This might give a clue as to why Tesla Motors attracted Chris Lattner, the creator of the LLVM compiler, speculation is that Tesla Motors wants to build an integrated autopilot system from chip to compiler.

Computer Chips specifically for autonomous driving

With Google creating custom chips called Tensor Processing Units (“TPU”) for machine learning model generation in the cloud to NVidia making chips for self-driving cars to Intel releasing it’s Go platform containing 5G modems and chips for self-driving cars, efficient and performant chips for machine learning has become important. This explains why NVidia’s shares have gone up 225% in 2016.

The car is one node of the Internet of Things. It will connect and interact with the cloud.

This is very familiar to me because that is what we do at Automatic.

Speech-Enabled Assistants

Speakers were from Microsoft, Baidu, Amazon Alexa.

Microsoft:

Speech is not the same as text processing, there are more nuances.
Types of chatbots

Baidu:

Why deep learning
Handle issues such as background noise and multiple people speaking
Handle issues such as person speaking from other end of room
They converted existing voice recordings to far-field and used that to train models
How much compute power, you ask?
GPUs to the rescue
Deep Speech works for Mandarin
Deep Speech works for multiple languages
Why focus on speech? More inclusive and faster than typing.
Speech recognition can be more accurate than typing for non-technical people
Try the TalkType app for Android
Baidu’s Goal is AI for 100 million people

Amazon Alexa:

Speech recognition process
‘LSTM’ technique

See Wikipedia entry on Long short-term memory.

More techniques

Natural Language Processing

Speaker was from Google Brain

He talked about how deep learning has dramatically changed the field of NLP. Focused on “end-to-end” deep learning methods.

Computer Vision (Perception)

Speakers were from OpenCV, Bosch and Google

An example of using computer vision is from Jeff Dean’s keynote speech – https://www.google.com/get/sunroof – enter your address, it will tell you how much roof area you have and how much money you can save by switching to solar energy!

OpenCV is a popular open source computer vision library:

OpenCV 3
Deep Learning comes to OpenCV

Google:

Street View to Vision processing to Local Business discovery, cars, cameras, vision, and maps – all in one sentence
New machine learning techniques, better data and compute, you get the idea.
Future of Perception

Impact of AI on jobs

Speaker was from McKinsey
McKinsey study focus
Based on current AI/ML capabilities: Few jobs will be fully automatable. Most jobs will only be partially automatable. That’s a relief!

Internet of Things

Speakers were from Bosch, Nervana (Intel) and Vion

Vion Vision was the most interesting. They are deploying machine learning models to devices like cameras. They demonstrated their bus-counting cameras that helps bus operators to get real-time traffic so that they can deploy more buses in high-traffic routes, etc. They even had a demo of public-area cameras that auto-detect a crowd beating up a person and sending an alert to the local police station.

Vion Vision cameras
Camera counting
Custom chip for deep learning

Deep Learning Frameworks

Speakers were from Google, Facebook and Amazon

This was an amazing session where creators or prominent members of each Deep Learning Framework came up and talked about their thoughts on the framework status and future.

Rethinking slow float-based computation
Math Challenges
Unframework?
MAPS
  • Scalability – How do I train on multiple GPUs and CPUs? OpenMPI, NCCL, ZeroMQ, etc.
  • Portability – Cloud, Mobile, IoT, cars, drones, coffee makers. Constraints – limited computation, battery life, models maybe luxurious, ecosystem less developed
  • Augmented Computation Patterns – more than float dense math – quantized computation, sparse math libs, model compression, rethinking existing ops (ResNEXT)
  • Augmented Math Challenges
  • Modularity – reusability
No silver bullet

Amazon mxnet:

Why another framework?
Core philosophy of mxnet
Current state of industry
Future direction
Torch next generation
Another vote for sharing components

Thank You AIFrontiers Organizers

It was an excellent conference, with well-chosen topics and the best speakers imaginable – the platform creators themselves. People who were expecting deep-dives or technical details were disappointed, but it was a great “state of the industry” conference for people like me who know nothing about the topic.

Thank you to the conference organizers, the Silicon Valley AI and Big Data Association and all the sponsors.

Ending Note

Geoffrey Moore (author of “Crossing The Chasm”) says:

In the coming decade all global enterprises, both private and public, will target the trapped value in their ineffective and inefficient outward-facing relationships with their targeted constituencies, be they consumers, clients, customers, patients, students, or citizens. Authentic sustainable engagement will become the new scarce ingredient. The as-a-service model will expand from commodity transactions to incorporate more significant life interests as well—education, health, personal development, family relationships, wealth management, safety and security, and the like. Machine learning and artificial intelligence will be the new keys to the kingdom, enabling institutions to operate at global scale with unprecedented speed, relevance, and accuracy. Operating models will prioritize customer relationship effectiveness over the supply chain efficiency, causing CRM to displace ERP as the most prominent information system, and the hot expertise will lie in user experience design, data analytics, machine learning, and artificial intelligence.

Thank you Mo Lun for creating a brand new Chinese translation of the latest version of A Byte of Python book!

In Mo Lun’s words:

I am a common journalism student from CYU, Beijing. And actually, I am an absolute newbie in Python programming when I start to translate this book. Initially, it was just a whim, but when I done this work, I realized that a decision triggered by interest had prompted me to go so far. With the help of my predecessors’ translations and the vast amount of information provided by the developed Internet, and with the help of my friends, I prudently presented this translation edition. I just hope my translation work will help other newcomers in learning Python. At the same time, I am always waiting for my translation of the comments and suggestions, and ready to change or improve this superficial work.


Note that the full translations list is at https://python.swaroopch.com/translations.html and you can read how to create a new translation at https://python.swaroopch.com/translation_howto.html.

These are my quick jottings during the talks at PGConf SV today:

Citus DB (distributed postgresql) will be open sourced

citusdb is going open source as a PostgreSQL extension #pgconfsv – Josh Berkus

First applause of day as @umurc announces CitusDB is going open source. #PGConfSV – merv

Everybody loves Kafka

Lots of Kafka love here at #pgconfsv Seems like Postgres + Kafka is a love match right now … – Josh Berkus

Hasura says JSON > SQL

Intriguing consulting company from India, although I didn’t get a chance to talk to them, the gist is that they provide a MongoDB-like JSON querying interface on top of RDBMS databases.

Update: There’s also PostgREST which is an open source project in Haskell that is similar (via the awesome-postgres list).

TripAdvisor runs on Postgresql

Matthew Kelly of TripAdvisor.

4 datacenters. 100 dedicated Postgres servers. 768 GB RAM. Multi-terabyte databases. 315 million unique visitors per month.

Switching from DRBD to streaming replication.

Switching Collation: utf.en-us -> C because glibc keeps changing character sorting and affects indexes

Switching Hardware: RAM -> SSD

Cross datacenter replication is done by custom trigger-based replication.

Hopes to see BDR in core.

Active/Passive model of sites – two fully functional sites, keep flipping active role. Secondary site used for disaster recovery, load testing, hardware upgrades, etc.

Development environments – weekly dump restores of all schema and all non-PII (?) data into 3 mini sites – dev, prerelease and test lab. 36+ hour process that completes every weekend.

System Tuning:

  • Always separate your WAL, data and temp partitions onto different disks, even on SSDs.
  • Make sure your kernel thinks your SSD array isn’t a spinning disk array.

Cache Statements:

  • 60% CPU savings by properly caching prepared statements.

Cascading Failures:

  • Statement timeout is a must
  • Separating read and write threadpools

Standard Hardware:

  • From 256-768 GB RAM & 15K spinning drives to 256GB RAM & enterprise-grade SSDs
  • Next bottleneck
    • Kernel version – requires Puppet upgrade + moving to systemd
    • 1 Gbps networking isn’t enough

Prestogres – connecting presto query engine via postgresql protocol to visualization tools

Sadayuki Furuhashi of Treasure Data. Also created MessagePack and Fluentd.

Before: HDFC -> Hive daily/hourly batch -> Postgresql -> Dashboard / Interactive query Now: HDFC -> Presto -> Dashboard

Presto distributed query engine from Facebook. Connects to Cassandra, Hive, JDBC, Postgres, Kafka, etc.

Why Presto? Because elastic. Adding a server improves performance instantly. Scale performance when we need. Separate computation engine from storage engine.

Why Presto over MapReduce? Because:

  • memory-to-memory data transfer
    • no disk IO
    • data chunk must fit in memory
  • all stages are pipelined
    • no wait time
    • no fault tolerance

Writing connectors for data visualization & business intelligence tools to talk to Presto would be a lot of work, so why not create a Postgresql protocol adapter for Presto.

Other possible designs were:

  • MySQL protocol + libdrizzle : But Presto has syntax differences with MySQL
  • Postgresql + Foreign Data Wrapper : JOIN and aggregation pushdown is not available yet

Difficulties to implement Postgres protocol:

  • Emulating system catalogs : pg_class, pg_namespace, pg_proc, etc.
  • Rewriting transactions (BEGIN, COMMIT) since Presto doesn’t support transactions

Prestogres design: pgpool-II + postgresql + PL/Python. Basic idea is rewrite queries at pgpool-II and run presto queries using PL/Python.

Uses a patched pgpool-II which creates & runs functions in the postgresql instance that will create system tables & records, and queries will be translated via PL/Python into Presto queries.

Heap Analytics uses Citus DB

Dan Robinson, Heap Inc.

Store every event, analyze retroactively. Challenges:

  • 95% of data is never used.
  • Funnels, retention, behavioral cohorts, grouping, filtering, etc. can’t pre-aggregate.
  • As real-time as possible, within minutes.

5000 customers. 60 TB on disk. 80 billion events. 2 billion users. 2.4 billion events last week. Can’t scale vertically. So Citus DB.

Schema:

users – customer id bigint, user id bigint, data jsonb. events – customer id foreign key, user id foreign key, event jsonb.

Basic Query:

select count(*) from users where customer_id = 123 group by properties ->> 'ab_test_grp' 

Complex queries with joins, group by, etc. done real-time via Citus DB. Citus DB parallelizes the queries among the individual postgres (shard) instances and aggregates them on the master node.

Making use of postgresql partial indexes (indexes on WHERE queries) when customer creates the query, for performance. This works well because data is sparse.

Make use of user-defined functions (UDFs), e.g. to analyze whether a user matches a funnel.

Where does data live before it gets into the Citus DB cluster? -> Use Kafka as a short-term commit log.

Kafka consumers make use of Postgres UDFs to make writes commutative and idempotent. Makes use of user exists checks, upserts, updates, etc.

Sharding by user, not time range. All shards written to all the time. How do we move shards, split shards, rehydrate new replicas, etc.? Use Kafka commit number to replicate the data and replay data after that commit number.

Future Work:

  • Majority of queries touch only last 2 weeks of data – can we split out recent data onto nicer hardware?
  • Numerical analysis beyonds counts – min, max, averages, histograms
  • Richer analysis, more behavioral cohorting, data pivoting, etc.
  • Live updates

How real-time is it? Events are ingested within minutes.

MixRank on Terabyte Postgresql

Scott Milliken, founder of MixRank.

Low maintenance thanks to Postgresql, compared to war stories with newer big data solutions.

Vacuum can change query plans and cause regressions in production.

In low digit percentages of queries, cannot predict query planner, so try them all. Use CTEs (Common Table Expressions) to force different plans, race them, kill the losers. Ugly but surprisingly effective. Implemented generically using our own higher-level query planner. Why CTEs? Because they are an optimization boundary.

Use SQLAlchemy. We don’t use the ORM parts, we use it as a DSL on top of SQL. So dynamically introspect the queries and do permutations to generate the different plans. Don’t try to generate different query plans by hand, that will be hard to maintain. One way to do this is to query the pg_class table to figure out which indexes are present, and generate permutations to use different indexes.

Comment from audience: You can write your own C module and override postgresql to use your own query planner.

Batch update, insert, delete queries are a great substitute for Hadoop (for us). But correct results can lag and performance can suffer.

Schedule pg_repack to run periodically, not vacuum full.

You can scale a single postgres pretty far, more than you think. We have 1 (good dedicated hardware) box with 3.7 GB/s. Performance on a good dedicated hardware over others is 10-100 times, i.e. 1-2 orders of magnitude.

Using lz4 encoding for ZFS compression results in 43% lesser data size.

Amazon RDS for PostgreSQL : Lessons learned and deep dive on new features

Grant McAlister, Senior Principal Engineer, AWS RDS.

What’s new in storage:

  • From 3TB limit to 6TB
    • PIOPS limit is still 30K
  • Encryption at rest
    • Uses AWS Key Management Service (KMS), part of AWS IAM
    • Includes all data files, log files, log backups, and snapshots
    • Low performance overhead, 5-10% overhead on heavy writes
      • Will reduce over time because Intel CPUs are getting better on offloading encryption
      • Unencrypted snapshot sharing, even share to public

Major version upgrade to Postgresql 9.4, uses pg_upgrade. Recommendation: Test first with a copy instance. Will also help you figure out how much downtime to expect.

Use rds_superuser_reserved_connections to reserve connections for admin purposes.

Use pg_buffercache to determine working set memory size.

Use AWS Database Migration Service (DMS) to move data to same or different database engine. From customer premises to AWS RDS. 11 hours for a terabyte, depending on your network speed. At least version 9.4 for Postgresql because using logical decoding feature. In Preview release now.

Use AWS Schema Conversion Tool (SCT) to migrate stored procedures, etc.

Scale and Availability:

  • select sql query will check for buffer in shared_buffers, if not load from pagecache/disk, if not load, load from EBS.
    • shared buffers = working set size
  • Have replicas in different availability zones, i.e. multi-AZ
  • Use DNS CNAMEs for failover, takes 65 seconds
  • Read replicas = availability

Burst mode: GP2 & T2

  • Earn credits when performance below base
  • If < 10,000 transactions per second, using burst mode will cost much lesser than PIOPS

Cross-region replication is being planned. Currently, you can copy snapshots across regions.

In the past couple of months, I’ve been using and, a first for me, regularly contributing to this open source project called Spacemacs.

Spacemacs is a new distribution of Emacs. Think what Ubuntu did for GNU/Linux – Spacemacs is doing the same for GNU Emacs. It combines all the existing great pieces and providing an easy-to-use good-looking package.

Spacemacs

I used to use my own emacs configuration and then switched to Prelude for it’s neat Clojure integration because Bozhidar Batsov wrote CIDER (the Clojure-Emacs package) as well. This was mostly helpful when I was working at Helpshift.

It all started when I was watching Sacha Chua’s Emacs Hangout video and Howard Abrams mentioned Spacemacs, and for a change, I immediately jumped into using it.

What attracted me to Spacemacs was that it was initially based on evil-mode, a full vi emulation layer inside Emacs. This was great because I was indeed having an Emacs pinky problem. And then the sane key binding hierarchy combined with guide-key for visually seeing that hierarchy was icing on the cake.

I first tweeted questioning whether it’ll be difficult to integrate the rest of Emacs ecosystem and then took it as a challenge and added a layer for ERC (IRC package in Emacs), was impressed with the layer system of Spacemacs and I was hooked. Then, I added org-pomodoro, org-present, etc.

I talked about Spacemacs at Emacs SF meetup, and I sometimes help and ask questions myself in the Gitter chatroom via the IRC bridge.

Minor annoyances led me to submit pull requests upstream – git-link, evil-org-mode, projectile, etc.

Elisp hacking has been fun.

Things that I’d like to see improved in Spacemacs:

  1. The default configuration file should be empty. Spacemacs installs a ~/.spacemacs file which is already full of stuff and this confuses new users. New users expect to just copy/paste snippets and it should just work, they will not take the time to read a large config file. For example, there is a layers config variable where users are supposed to add names of the layers they want to use, instead users should be able to copy/paste (enable-layer 'org) instead and it can work equivalently.
  2. The holy-mode should enable normal Emacs usage. For example, I cannot use meta-shift-right in OrgMode (used for indenting a heading and it’s content) in holy-mode whereas there is a key > in evil-mode. Spacemacs needs to make up its mind on whether it’ll fully support a holy-mode vanilla Emacs key bindings and I hope it does.
  3. The develop branch moves fast and many early adopters are using that whereas newbies are using the master branch, and there is often confusion in the chat room when someone asks for help. I wish there was a command in spacemacs that will generate the useful information as a text (which operating system, which emacs version, which layers are enabled, holy-mode or evil-mode, etc.) which can be pasted into the chat room and will assist others to offer advice much faster. Update: I contributed a change to make this happen, and happy to see it’s adoption, both in the chat as well as the default issue template.

It’s funny how was using XEmacs a decade ago, then dived fully into Vim (even wrote a book on it) and now I’m back into Emacs land.

On the same note, I am fascinated with newer editors like GitHub’s Atom which is gaining traction, also has a good package management system and UI using HTML/CSS which makes for easy extensibility – a hallmark of a great editor, and fascinating new possibilities such as integrating IPython/Jupyter in a Light-Table inspired way. My curiosity about Atom first piqued because Electron, the core of Atom editor is being used by the Slack desktop apps, Microsoft Visual Studio for Mac and Linux, etc.

I don’t know if / when / how I’ll make a switch to Atom, but until then I’m happy with Spacemacs.

Comments

@phoe6 says:

I am going to try this out. did not not about it’s existence. I have aliased vim to emacs, just to break my nature.

@phoe6 says:

I use IntellIiJ for packaged things, but vim, subl and emacs for one off. But like to use the powerful features of emacs more.

@spradnyesh says:

impressed (especially because of evil-mode). going to try it! @swaroopch thanks for sharing, and contributing!

@frankiesardo says:

Great post about spacemacs. Love the editor but I’m struggling with evil-lisp-state. How do you find CLJ develpment on it?

@k4rtik says:

I should try Spacemacs sometime. @swaroopch writes about it … Didn’t know about evil mode before, cc: @shrayasr