One popular approach for large-scale data processing systems is to dump the incoming data into a distributed filesystem like HDFS, run offline map-reduce queries over it, and place the results into a data store so your apps can read from it.
If you need to store all of your data and you need to execute queries which span large time-frames, or you don't know the queries up front, then batch mode is a great fit.
However, there are plenty of use cases where these parameters don't quite fit.
Consider, when:
then a stream processing model can allow you to get at your answers in a much faster, and cheaper way.
Hands up, ever written a system made of queues+workers to process incoming data?
Then you'll know there are some challenges...
4,000 stars, >500 forks. most starred Java project on github.
decides how to partition the stream of messages
let's take a look at some code...
Tuple tree:
Supervisor nodes:
visibility: storm-ui
details statistics about each topology
deployment: storm-deploy to automate deploment and provisioning on ec2
1 2 3 4 5 6 7 8 9 10 11 12 13 | # /path/to/storm-deploy/conf/clusters.yaml
nimbus.image: "us-east-1/ami-08f40561" # 64-bit ubuntu
nimbus.hardware: "m1.large"
supervisor.count: 2
supervisor.image: "us-east-1/ami-08f40561" # 64-bit ubuntu
supervisor.hardware: "c1.xlarge"
#supervisor.spot.price: 1.60
zookeeper.count: 1
zookeeper.image: "us-east-1/ami-08f40561" # 64-bit ubuntu
zookeeper.hardware: "m1.large"
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | ; ~/.pallet/config.clj
(defpallet
:services
{
:default {
:blobstore-provider "aws-s3"
:provider "aws-ec2"
:environment {:user {:username "storm"
:private-key-path "~/.ec2/k.pem"
:public-key-path "~/.ec2/k.pem.pub"}
:aws-user-id "1234-5678-9999"}
:identity "AKIA1111111111111111"
:credential "abCDEFghijklmnpOPQRSTuvWXyz1234567890123"
:jclouds.regions "us-east-1"
}
})
|
1 2 3 4 5 6 7 8 | # start cluster
$ lein run :deploy --start --name mycluster --release 0.8.1
# attach to the cluster
$ lein run :deploy --attach --name mycluster
# stop cluster
$ lein run :deploy --stop --name mycluster
|
1 2 3 4 5 6 7 8 9 10 | # deploy your topology
$ redstorm jar examples/simple
$ redstorm cluster --1.9 examples/simple/word_count_topology.rb
# monitor with storm-ui and ganglia
$ open http://{nimbus-ip}:8080
$ open http://{nimbus-ip}/ganglia
# kill topology
$ storm kill word_count
|
Runaway complexity in Big Data
Diagrams from Nathan Marz' "Storm: Distributed and fault-tolerant realtime computation" talk.
Table of Contents | t |
---|---|
Exposé | ESC |
Full screen slides | e |
Presenter View | p |
Source Files | s |
Slide Numbers | n |
Toggle screen blanking | b |
Show/hide slide context | c |
Notes | 2 |
Help | h |