Storm & RedStorm

Distributed Realtime Computation in Ruby

Jason Morrison

December 11, 2012


Presenter Notes

Data processing systems

Presenter Notes

  • design of data processing systems
  • interesting case: 1 processing machine is not sufficient
  • use cases like web/engagement analytics, app logging and performance analytics logging, financial/ecommerce information, physical sensor data
  • could be for many reasons: high ingest volume, expensive computation, or tight latency requirements
  • even if your processing system isn't large, storm does a very good job of documenting and exposing clean primitives and abstractions that are at play in these systems, and are valuable to understand

DFS + map-reduce

  • Queries are functions over all data
  • Queries typically executed in batch mode, offline
  • Results are high-latency

Presenter Notes

  • One popular approach for large-scale data processing systems is to dump the incoming data into a distributed filesystem like HDFS, run offline map-reduce queries over it, and place the results into a data store so your apps can read from it.

  • If you need to store all of your data and you need to execute queries which span large time-frames, or you don't know the queries up front, then batch mode is a great fit.

However, there are plenty of use cases where these parameters don't quite fit.

Design considerations

  • Value of low latency
  • Data retention needs
    • Adhoc vs known queries
    • Timeframe of queries

Presenter Notes

Consider, when:

  • low latency is valuable
  • you know queries ahead of time
  • query domain covers small time windows

then a stream processing model can allow you to get at your answers in a much faster, and cheaper way.

  • Instead of storing data and then executing batch queries over it, what if we persist the query and run the data through it?

Queues & workers

workers and queues

Presenter Notes

  • First, I want to examine a typical approach to assembling a realtime system:

Hands up, ever written a system made of queues+workers to process incoming data?

  • architecture here is
    • an incoming stream of data, from a web frontend or work queue or subscription to streaming API
    • cluster of queues persisting that data as jobs
    • cluster of workers taking jobs off the queue, working them, maybe persisting out to some store, and then emitting their processing results forward into the next queue layer
    • then more workers, maybe more persistence

Then you'll know there are some challenges...

  • data processing guarantees & fault tolerance
  • queues impose latency between layers
    • when one worker sends to another, impose a 3rd party between worker layers, where message has to go to disk
  • without another layer of abstraction, coding is tedius - spending your time thinking about where to write message to, where do I read messages from, how do I serialize messages, etc.
  • this is the kind of system that was in place in the product (Backtype) that became Twitter analytics

Twitter analytics

twitter analytics

Presenter Notes


storm on github

Presenter Notes

  • released september 2011 at StrangeLoop
  • 4,000 stars, >500 forks. most starred Java project on github.

  • Used by: Groupon, The Weather Channel, Alibaba, Taobao (ebay-alike, Alexa #11)

Design goals. Storm should:

  • Guarantee data processing
  • Provide for horizontal scalability
  • Be fault tolerant
  • Be easy to use

Presenter Notes

  • Guaranteed data processing: Choose whether, in the face of failure, data can be lost, or must be processed at-least once, or exactly once
  • Horizontal scalability: Distribute your processing across a cluster, tweak parallelization of computations and the allocation of cluster resources to match your workload
  • Fault tolerance: If there are any errors or when nodes fail, system should handle this gracefully and keep running
  • Easy to use: As the application developer, focus on writing your computation, not infrastructure like message serialization, routing, and retrying.
  • So, how is it put together?

Storm primitives

  • Streams
  • Components
    • Spouts
    • Bolts
  • Topologies

Presenter Notes



Presenter Notes

  • core abstraction of storm
  • unbounded sequences of tuples
  • tuples are named list of values
  • dynamically typed, use any type by providing serializer
  • streams are how that the parts of your computation talk to one another, they are the message passing



Presenter Notes

  • sources of streams
  • typically:
    • read from a pubsub/kestrel/kafka queue
    • connect to a streaming API
  • emits any number of output streams



Presenter Notes

  • process any # input streams
  • produce any # output streams
  • where computation happens
    • functions, filters, aggregations, streaming joins, talk to DBs...