On December 11, 2012 I gave a talk at Boston.rb about
writing distributed realtime computations in Ruby using
Storm by Nathan Marz and
RedStorm by Colin Surprenant.
Basically, Storm provides a framework for building streaming/realtime
computations (like log analysis, for example) and distributed RPC for running
large adhoc computations on a cluster. RedStorm is a JRuby-based adapter for
writing these computations and assembling them into topologies (workflows) in
Ruby.
At Strange Loop, Runaway complexity in Big Data discussed “Common sources of complexity in data systems and a design for a fundamentally better data system”. October 2012.
Book
Big Data is an early access book by Nathan Marz which covers “Principles and best practices of scalable realtime data systems”
Other ESP/CEP resources
Storm lives in a space that’s often referred to as ESP (“Event Stream Processing”) or CEP (“Complex Event Processing”):
CLAHub is a small side project I cooked up a few
months ago, and just got around to open-sourcing. The goal is to remove the
friction of Contributor License Agreements
for contributors and maintainers alike. It’s not done yet, but I’m curious to
hear what people think.
What is it?
The general idea with CLAs is this: contributors grant the maintainer a license
to distribute the their code, and state that they’re legally able to do so. A
fair number of projects have a CLA in place, including
jQuery,
Node.js,
Django, and
Chef. In the best
cases the CLA is signed via electronic signature, like
Node.js does with a Google Form. In the worst
cases you have to print, sign, and fax the agreement. In all cases,
maintainers are responsible for cross-referencing contributions and signatures
to make sure all contributions have a corresponding signature.
With CLAHub and an open source project on GitHub
you can:
Sign in with GitHub and create a CLA for your project.
Ask contributors to sign in with GitHub to electronically sign the CLA.
See on each pull request whether the contributors have all signed your CLA.
This uses the handy Commit Status API,
similar to what CI tools do.
Do you use a CLA for your project(s)? Would this encourage you to add a CLA if
you don’t have one already? (That’s not really my goal - just to reduce
friction where CLAs are already valuable.) If you have a CLA, would you use
something like this to reduce the barrier to entry and your overhead? What
kinds of features would be useful?
If you regularly read scholarly papers, you likely use a
reference manager to
maintain your personal library. Papernaut
connects to your library to find online coverage and discussion of your papers
in blogs, forums, and mainstream media. My hope is that these discussions can
provide broader perspective on research and, in some cases, be the spark that
starts a new collaboration.
Here’s a very quick video demo. We start with a Zotero
library that includes a paper from Science on the
effect of pesticides on honey bees.
We then connect to Papernaut, and find several discussions and articles,
including one in The Guardian:
I’ve been working on Papernaut in my spare time for a few months, and I’m happy
to say that it’s now open source. The project comes in two parts, and the
source is on GitHub:
If you are interested in how the application is put together, the rest of this
article is a technical overview of the moving parts and how they interact.
Overview: A simple example
Let’s walk through a simplified example. Say I have only one paper in my
reference manager – that paper from earlier, about the effect of pesticides on
honey bees:
We’ll look at how the engine crawls and indexes this source feed. Then,
we’ll see how the frontend pulls the paper from my reference manager and
asks the engine for relevant discussions.
Papernaut-engine: Loading content and identifying papers
The goal of the engine is to produce a collection of Discussion records, each
of which links to several Identifier records, representing journal papers
that are referenced from the Discussion. In our example, the Discussion is
the article in The Guardian, and the relevant Identifier is
DOI:10.1126/science.1215039. There are also intermediate objects, Page and
Link which connect Discussions to Identifiers.
The engine consists of two main parts: loaders (which are Ruby classes), and the
query API (a Rails app). For loading, it also depends on an external running instance
of the Zotero translation-server.
Loading content by crawling feeds
The loaders load discussion candidates from feeds and archives, extract
outbound links, and store these in the database.
In the first step, I invoke the ResearchBlogging.org
loader to crawl and index the most recent 100 pages of their archives:
This will load a large number of Discussion entries into the database, with
zero or more Page entries for each Discussion, corresponding to outbound
links.
At this point, the engine database contains the Discussion:
#<Discussion id: 3424,
url: "http://www.guardian.co.uk/science/grrlscientist/2012/may/08/1",
title: " Bee deaths linked to common pesticides | video | G...", ...>
Identifying papers via the Zotero translation-server
The engine determines which outbound links (or Pages) are academic papers by
issuing calls to the
Zotero translation-server HTTP API.
The translation-server is a third-party project from open-source reference
manager Zotero. It examines a given URL and, if
that page contains an academic paper, it returns common publication identifiers such as
DOI or
PMID.
The translation-server wraps the
Zotero translators,
a set of JavaScript scripts that do the heavy lifting of parsing a webpage and
attempting to identify it as one or more academic publications. These
translators are
maintained by the community,
keeping them fairly up-to-date with publishers. The translation-server uses
XULRunner to run these
scripts in a Gecko environment, and makes them available through a simple HTTP
API:
Two of the four pages were identified (7531 and 7533), and both of those
pages received two identifiers apiece. This means that the Guardian Discussion
actually referenced two different papers, not just the one we’re interested in.
Now that there is a link between the paper in question and this discussion page,
we are ready to visit the frontend.
The frontend works in two distinct phases: first, it helps you import papers
from your reference manager. Second, it shows you discussions for those papers.
Alternatively, you can import papers from most reference management software by
exporting and uploading a .bibtex
file. Papers and their identifiers are then extracted with the
BibtexImport
class.
Many papers will have multiple identifiers, and the frontend attempts to clean and validate
your papers’ identifiers as best it can in an attempt to find the best matches.
So, in our example video above, we authenticate via Zotero and authorize
Papernaut’s API access via OAuth. The frontend extracts our library of papers
from Zotero and stores their Identifiers locally. It issues requests to the
engine’s query API for matching discussions, and displays those to the end
user:
Deployment
In production, the Papernaut engine and frontend are deployed to
Heroku. The translation-server is deployed to EC2.
I spin it up and run the loaders periodically, to reduce hosting overhead.
There is a DEPLOY.md file for both
the frontend
and the engine
that goes into further detail.
Next steps
I’m excited to see what kinds of results people get with Papernaut,
but it’s still very early software. I look forward to making a variety
of improvements.
I’d really like to add a bulk request API endpoint to the engine, so that the
frontend can discover discussions in a single HTTP request, rather that one
request per paper. That’s a big performance hit, and the user experience right
now for large libraries is that the frontend just hangs for a while.
On the engine side, I’d like to do a better job of culling false positives in
the matching engine, and of contributing to Zotero’s translators to improve the
match rate. I think the primary issue there is that the translator-server
actually only runs a subset of all the Zotero translators, as some declare that
they only work inside a real browser context
(see “browserSupport”).
I’d like to get a larger sample set of BibTeX files to try, as there are
probably edge cases and assumptions in the importer waiting to be hit.
I’d also like to background some of the tasks in the frontend’s import process;
validating DOIs is a big one there. Ideally, the whole library import would be
backgrounded, and the user interface would be notified when the import is
complete.
Currently, some matches are missed because the engine and frontend have
different identifiers for the same paper - say a DOI and a PMID. I also have
an experimental branch that cross-references papers with the
crossref.org API,
which yields more complete information. Ideally that would happen in the engine.
I’ve also seen some library management and import tools that use Google Scholar
to improve matching and identification.
After that, I’d like loaders to run semi-continuously instead of manually, and
to have more robust infrastructure around paper identification.
In the long term, it would be interesting to try and bring the discussion
matching experience directly into reference managers. This is one reason why
I provide the engine query API separately from the frontend.
Conclusion
I’m most interested in hearing feedback from people. Is this useful to you?
If you use a reference manager,
give Papernaut a spin
and let me know how it goes.
On December 4, Lindsay and I returned to the US after a year of traveling
abroad. Lindsay diligently blogged our
experiences and her photos at cadeparade.com.
We have spent December visiting family and friends. On December 31, we
fly to San Francisco to start the next chapter of our lives.
It’s adventure time all over again
Time to sift through apartments and carefully consider our work, to reacquaint
ourselves with first world amenities and first world problems. To reunite with
family and friends, to fondly shuffle through our notes and photos, and to
reflect on our travel experiences and put them into context.
Also, to eat fajitas and burritos en masse, because let me tell you: Mexican
and Tex-Mex food outside the Americas just is not the same.
The first half in photos
All photos are by Lindsay Cade, and are from the first
half of the year.
During the first six months of December 2011 through May 2012, we traveled in
India, Thailand, Laos, Vietname, Cambodia, and Burma (Myanmar).
We traveled to places beautiful and remote:
Ate incredible foods:
And some not-so-incredible ones:
We enjoyed amazing sunsets:
We ventured across deserts:
into backwaters:
through rivers and valleys:
We marvelled at constructions old and new:
During the second six months of June through November, we traveled in the Czech
Republic, Italy, Turkey, Germany, France, England, Thailand (again! we are
quite fond of it), South Korea, Malaysia, and Hong Kong. We ended the trip
where we began, returning to India for a month.
I cannot recommend this experience highly enough. My sense of perspective and
patience have been changed at a fundamental level. At the same time, I’m very
much ready for this return to the US, to be with friends and to focus on my
career, to do good in this world of which I’ve now seen a tiny slice more.
On November 28, my wife Lindsay and I are flying to India. We have
no return tickets, and little plan. I’m leaving a great job;
“professional ennui” is the furthest thing from my motivations. What’s going on?!
It’s adventure time!
If there’s one common lesson I could distill from my collegiate and
professional engagements, it would be the value of diverse experience, and the
difficulty of planning to build that experience. Sometimes you just gotta
jump in learning’s way.
We’re young, not tied down, and have seen like 0.0001% of the world. So,
earlier this year, after getting engaged, we decided: let’s hit the road! Our
plans are loose. As of now, we:
Have 1-way tickets to Delhi and 5-year visas to India. Many countries in
Asia have VOA (visa on arrival) for US citizens.
Got our arms jabbed (immunizations).
Are brandishing a fat sack of doxy and a veritable menagerie of antibiotics.
Booked two days booked at a hotel to buffer our jetlag.
Asked a friend-of-a-friend to find a short-term lease in Delhi.
Are super frigging pumped. I mean, come on!
I’ll miss the crap out of my friends here in the US. We’re flying around a bit
to visit folks before heading overseas - San Fran tomorrow through Wednesday,
then Buffalo, then Houston for Thanksgiving.
Then, on November 28, IAH-ORD-DEL.
Closing thoughts
Journeys are the midwives of thought. Few places are more conducive to internal
conversations than a moving plane, ship or train. There is an almost quaint
correlation between what is in front of our eyes and the thoughts we are able
to have in our heads: large thoughts at times requiring large views, new
thoughts new places. Introspective reflections which are liable to stall are
helped along by the flow of the landscape. The mind may be reluctant to think
properly when thinking is all it is supposed to do.
If we find poetry in the service station and motel, if we are drawn to the
airport or train carriage, it is perhaps because, in spite of their
architectural compromises and discomforts, in spite of their garish colours and
harsh lighting, we implicitly feel that these isolated places offer us a
material setting for an alternative to the selfish ease, the habits and
confinement of the ordinary, rooted world.
On Friday, October 14th, I hosted a two hour live Q&A chat for purchasers of the thoughtbot Backbone.js on Rails eBook, which I am co-authoring. We invited people who have purchased the book to submit questions ahead of time, and then join me in chat for a discussion.
On Tuesday, September 20, I gave a talk at the
New Hampshire Ruby Users Group on Backbone.js on Rails.
I’ll be giving a very similar talk on Tuesday, October 11 at
boston.rb and a
version more targeted to front-end developers on Wednesday, October 26 at the
Boston Front End Developers meetup
As an aside, I’m using landslide for the
slides - I love the resulting HTML and interface, though I’ve heard great things
about deck.js.
People found the resources sections useful. Many of the links are buried in the
presenter notes, so I’ll repeat them here. There are plenty more online, and
I’m sure I’m missing some content. Please link to any of your favorites in the
comments, and I’ll add them.
Yesterday, September 24 2011, I had the pleasure of attending MIT’s 2011
Startup Bootcamp. In its third year, Startup
Bootcamp brought an inspiring and thoughful collection of speakers who have had
a variety of startup successes.
The event hashtag #sb2011 is a stream
of reactions and pull-quotes from the event - mixed here and there with
excited anticipation for
a dance festival in Goa.
Ten speakers presented a variety of viewpoints, insight, and food for thought.
It was a mixed bag - yes, there was unnecessary focus on
vanity metrics
and the rah-rah of
startup theater.
Breathless celebration of hockeysticking uniques and of flying around to court
VCs makes for good TechCrunch articles. Like it or not, that’s an inculcated
part of startup culture.
But if you get past the Hollywooding and the Silicon Valley adulation, there
were gems of solid advice, grounded in experience, on hiring (Paul English of
Kayak), data-driven product development (Naveen Selvadurai of foursquare),
optimizing your life for personal growth (Drew Houston of Dropbox), identifying
underlying social and technological shifts that enable new products (Charlie
Cheever of Quora, Patrick Collison of Stripe), negotiation (Alex Polvi of
Cloudkick), the importance of on-the-ground and unscalable product development
tactics early on (Nathan Blecharczyk of Airbnb), earning and answering to the
responsibility of finding your own way in the world (Anthony Volodkin of
Hype Machine) and how important it is to empower yourself in perhaps the largest
disruptive theme of our time by learning to code (Patrick Collison of Stripe).
Paul spoke on three kinds of recruiting: companies recruiting new hires,
companies recruiting investors, and job-seekers recruiting companies.
When you’re recruiting, look for success, regardless of the kind. In fact, look
for a diversity of success. Paul once hired an olympic rower, and a chess
grandmaster, and couldn’t be happier with these decisions. Find people who
operate at the top levels of excellence.
Some companies have a “no assholes” rule - at Kayak, they have a policy of “no
neutrals”. Like Charlie Cheever, who later discussed the importance of hiring
people you have high-bandwidth communication with, Paul encouraged building a
team of people who are fully engaged: “intense and in-your-face - in a good way.”
Leah told an lighthearted and likeable story of her journey from big state
school CS major to Silicon Valley startup founder. Full of serendipity and
luck, she shared stories of driving a UHaul from her native Minnesota out to the
Bay Area (picked not primarily for its burgeoning tech scene, but for how much
better the weather is), getting started with Instructables, and bumping into
Pownce co-founders Kevin Rose and Daniel Burka at a party.
Have a good story to tell the press - you don’t have to tell people the ugly,
dirty truth.
Another of Leah’s pieces of advice was a common thread through the talks - that
of consistent applied effort. “Show up,” she said - in places with a critical
mass of startup people, such as Silicon Valley - and “say yes” to opporunities
that come your way.
I didn’t just rush it on my parents that I was leaving MIT. It took two whole weeks.
Andrew shared his story of inspiration for an online learning tool. When he
hacked together a prototype to help study for a French III class in high school
and subsequently aced the test, he knew he was onto something.
Andrew discouraged market research - “If I had googled for online flash cards, I
would have found other sites, that were not as good, and I wouldn’t have made
Quizlet. Now, we’re 10x the [volume] of our next competitor.”
This phrasing
raised some contention.
I would reframe his advice as: focus on your own products rather than on the
competition, and don’t be discouraged by incumbent players; rather, recognize
them as a validation of the market space, and proceed to out-execute them.
Naveen worked for Lucent and Sun in college. This was important - it was
real-world learning. Seeing engineering culture, doing code reviews, shipping
real products. Sun had an open culture of learning where you can dive into
other products. “How’d they build Solaris? File systems?” Just sign up for
the mailing list.
Naveen shared seven pieces of distilled advice:
Keep good company.
Make something that people want.
Build around an atomic action.
Seek mentors early.
At first, go with your hunch. Later, with data.
Balance unknowns with knowns.
Always be recruiting.
On the last point Naveen shared the four stages of foursquare’s hiring strategy:
Hire friends
Hire friends of friends
Use an external agency (but they didn’t find this valuable)
Hire an internal fulltime recruiter.
It needs to be someone’s job to think about recruiting, seven days a week.
Additionally, as a founder, you must always be recruiting.
Work with people you have really high-bandwidth communication with.
Understand how the other person is thinking.
Charlie shared great advice on early-stage tactics. Start with few users (Quora
started with fewer than fifty) and a low-cost MVP. Foster the community by
hand, be high-touch and, if your business builds on user-generated content, be
prepared at the beginning to build a lot of it by yourself. See how the
experiment goes, and then take the learning from that experience and apply it to
your MVP.
He shared the importance of collecting metrics early on. With Quora, they
actually stored the entire webpage for every visit for every customer, so that
they could go back later, having identified trends or formulated hypotheses, and
see the site as their users saw it.
They noticed a set of high-engagement users, looked at these users’ expereinces,
and found that they had all used Facebook connect. Running with this, the team
spent time focusing on improving their social experience.
Charlie also left the audience with good food for though:
What wave enables your product? Why is now the right time to build it?
For foursquare, it was GPS-enabled mobile phones. For Quora, it was that
“normal” people were comfortable sharing things online, and that the web was
turning into a mess; with Google turning up more content farm results, people
were moving onto safe harbors of organized information like IMDB and Wikipedia.
The timing was right.
Get out of your comfort zone. Learn a little about a lot.
“Everything big starts small” - Drew’s original perception of startups was that
of Tolkien’s Mount Doom. His original strategy to build a successful startup
was to be overwhelmingly prepared - nab an MIT CS degree, get a few years’
exerpience working for small companies and big companies alike, come back for a
PhD, maybe an MBA.
He then related a story from Dropbox’s origins: Drew had just settled into his
seat on a Chinatown bus from Boston, in which he could usually get in several
hours of undisturbed work. He popped open his laptop, and searched his pockets
for his ever-present USB thumb drive. “Shit.” Realization set in just as he
visualized, in his mind’s eye, the thumb drive sitting on his desk at home.
“Like any good engineer with a problem to solve, I opened my editor.” Drew then
wrote the first lines of what would eventually become Dropbox. Today, his
company has a multi-billion dollar valuation and “stores more files than Twitter
stores tweets.”
Drew exhorted the audience to learn about a broad variety of topics: sales,
marketing, finance, accounting, product design, psychology, influence,
negotiation, organizational design, management and leadership, business
strategy. Buy books (“today we have this amazing thing, Amazon”), dip in, find
mentors, and surround yourself with smart people.
Wrapping up, Drew shared his advice for success:
Take on more than you’re “ready for.”
Maximize how much you learn per unit time.
Stack the odds in your favor. Surround yourself with great people; you are the
average of your five closest friends.
The fastest way to learn about startups is to join one.
Starting a company is one of the best ways for engingeers to change the world.
No matter what number they offer, pause, count to 10 in your head, and then act as disappointed as possible.
Alex spoke on negotiation, specifically about his experience of his company
Cloudkick being acquired by Rackspace.
If a VP of Corp Dev says “strategic” to you, they are talking about
acquisition.
Acquisitions are a bit like romantic relationships: you often get the most
attention when you’re looking for it the least. Once you are involved with one
party, others can sense it. You somehow become more desirable.
Once you have a term sheet from one prospective buyer, you have great
leverage. When others call you up, you can very quickly get to hard numbers.
The best negotiation position is one of truth. Build something of value that
people want, and your position is irrefutable.
Alex also discussed the importance of taking care of your team, and the people
around you. Upon acquisition, he fully accelerated all employees’ options -
whether they had been with Cloudkick for four years or four weeks, they were all
fully vested and could share in the company’s success. It was important that
the acquiring party, Rackspace was on board with this - and they were.
Rackspace wanted the new team members to stick around not because they were
waiting to vest, but because they wanted to be there.
Venture Capital? You do not need anyone’s permission to make stuff.
Anthony shared the perspective that VC or angel investment can be very
important, but it’s not for everyone. “I don’t want to shut something off
because the math doesn’t work. For people to not remember it. That would make
me sad.”
Anthony’s vision was a question: while people with cool friends can get
interesting music recommendations from that network, what about people without
cool friends? He knew that there was great taste and insight being shared by
music bloggers online, and sought to aggregate and distill it. “I didn’t want
to miss anything.”
He started Hype Machine from his dorm room. He didn’t take investor money.
This gave Anthony and his team the freedom to run the company as they pleased.
“We wanted to travel,” he said - so they packed their bags and hung out in
Berlin for a month. It was cheaper than they would have thought, “about six
thousand dollars,” and incredibly fun. But if they’d had VC money? “No way,”
Anthony imagined an advisor’s response, “we thought you were, you know, going to
be working sixteen hour days. Now you want to go to Berlin and maybe work?”
YCombinator? TechStars? Just fucking make something.
Anthony exhorted: it’s okay to have a different process. Don’t discount
investment and the accompanying advisors, but don’t go blindly down that most
celebrated path. With a different process, it’s easier to stand out, to be
differentiated. You can always get money if you are making something great.
You have to have a vision, you have to be able to execute that vision.
Nathan shared a 2008 pitch deck for Airbnb (then AirBed&Breakfast) - the
first time this deck had ever seen the light of day.
Tiffany Kosolcharoen posted photos of the slides on her blog.
He highlighted its strengths - it had a problem statement, and had a bottom-up
business projection by analogy to CouchSurfing and Craigslist. He was also
quick to point out its weaknesses - it involved hand-wavy notions of unlikely
major player partnerships, and touted top down projections (“If we can capture
2% of the $1.9B travel booking market… imagine!”) that are quick to raise
doubt from savvy adviors or investors.
The company was accepted into Y Combinator’s Winter 2009 class. YC companies
are supposed to be heads-down; but at Paul Graham’s behest, the cofounders
zeroed in their market focus to just New York and hopped redeyes back and forth
every few weeks. They met with their initial supply-side renteres in bars, and
chatted about how things were going. As the team refined the product and
identified sticking points, they could be on the ground to help optimize
listings. They’d go with people into their homes and take high-quality photos.
They found that the initial asking rates were a little too high, so they asked
their listers (after a few drinks) to lower their prices. Things clicked, and
soon they had handled $250,000 in bookings of which they collected 10%.
Fast-forward to the YC W09 Demo Day, and although at that point Airbnb has
already accepted Sequoia investment, they had prepared a Demo Day deck. Gone
was the hand-wavy top-down projection and partnership hopefulness, replaced with
a quarter million dollars of demonstrable traction, a tight initial market
focus, and a tight, clear problem statement.
Like many of the speakers, Nathan stressed the importance of finding quality
mentors.
It is impossible to motivate great people by something that is merely going to be profitable.
Patrick’s talk was an excellent finish to the day. He delivered an essay full of engaging stories - I sincerely hope it will be posted online in full.
Patrick’s story was of his trip from hardcore Lisp academic to startup founder. Along the way, he developed one of the first iPhone apps, an offline Wikipedia, before the SDK and App Store, by debugging ARM assmebly. He shared the touching experience of getting emails form users whose lives he had changed; from bringing the world’s knowledge to villages in rural Peru and Ghana to delivering the freedom to browse Wikipedia without overisght to people behind the Great Firewall of China. At nineteen, he co-founded and sold an online action tool, and is currently working on a new payment startup, Stripe.
The anthropological story of the last twenty years is that software is taking over the world. Even if you’re a traveling violinist, you should learn how to program. Do all you can to ensure code is not a foreign language.
I’ve written sporadically here for several years about programming and language theory, synthetic biology, amateur biology, running user groups and barcamps, multitouch and immersive interactions.
I’ve imported my old posts from WordPress into Octopress. That was – oh wait, I was about to write about that experience before I even began. I was going to say how buttercream-frosting-smooth it was, and that’s probably because I have a lot of confidence in exactly that, mostly due to their well-coiffed htmls. Update! Turns out they’re Jekyll migrations instead. Still easy-peasy.
I’ll be traveling extensively over the next year, and will be writing about that, too. But that’s a different post.
On tools
I wrote most of my previous posts in Mephisto, which was kind of janky after a white, and then switched to WordPress, which is totally not Ruby, and more or less means I have to run a VPS and make sure I don’t get chainsawed by spammers. Also, I’m interested in switching to a toolset more near and dear to my heart. Octopress fits the bill.
This also means I can write using vim and git, like a champ.
Whither lurk multitouch, tactile computing, and lab instrument hacking? I’ve started to work on the SmartLab project again. You can see what I am up to on the SmartLab tumblr.