05 · 16

this, that, and Backbone.js views

Just a hour or so ago i wrote a post on how to bind "this" when using backbone. I read through the source, was very proud of myself, and wrote about somthing that didn't work. haha.

Here is what it should be

The idea is that "this" should be bound to prevent the tons of binding bugs and issues when a the shorthand objects are constructed (var obj = {})

05 · 03

AWS instance notes: Part 2 Large

Today i also had the pleasure of migrating a large instance over and running it at full capacity. Here are the metrics and dashboard output

The large was a bit more erratic despite running the same kind of load through it.

  • Maximum connections i believe is around 640. (http://www.lecloud.net/post/3780041484/max-connections-on-rds-mysql)
  • Max data write speed is about 6000 OPS (single row operations, no batches). Although in my experience its usually in the mid 5000 OPS normally.
  • max data rate write speed is peaking at 400KIBs.
  • Cpu utilization was much less utilized than on a small. Perhaps the bottleneck was actualy in my import script?
  • Max latency was 10ms. 

These tests were not optimized and were simply taking a mysql dump of large tables with 10-15 coulmns integer and string columns and importing them directly via sql commands. (RDS doesn't support INFILES or mysqlimport as far as i could tell)

You pay ~4X more for a large than a small. In these tests it seems like you get roughly 2X or 3X peformance boost with simple methods. The latency did decrease 4X though.

 

Large_performance

05 · 01

AWS instance notes

Images
At all the companies i have worked for we have had a small rds mysql instance. Tonight i was doing a couple larger data migrations at full speed and noted the stats on amazons dashboard. This is useful considering there are few benchmarks on upper bounds for aws rds instances.

I am noting the maximum possible performance on a small instance transfer from aws to aws account (would be similar for heroku machines)

  • Maximum connections to a small instance is around 125 (each uses up nontrivial memory for each connection).
  • Maximum data write speed is about 2000 OPS (single row operations, no batches)
  • Max data rate write speed is just a hair under 200 KIBs (kilo (1024) Bytes) per second.
  • Cpu utilization was near 100% (seemingly a cpu bounded write)
  • Max latency per op was 40ms.

Dashboard

04 · 26

Email Tracking: Lessons learned

Tracking email open rates is a dubious endeavor as the data is knowingly unreliable. For exactly is reason google analytics shutdown their email impression tracking service in 2006 and haven't brought it back since. 

The reason why open rate tracking is suspect is because by default most email clients disable showing images to the user. Instead the user has to opt into getting images and being tracked. Data swings from open rate tracking can be large and wholly dependenant on the motivations of the user to unblock images (for example a email made completely of images may get a higher "impression rate" than its more textual comptetitor even though the subject line resonated with less people.

One of the core tenants of data driven decision making is to make sure you have objective reliable data. The old saying is that its better to have no data than bad data leading your decisions because at least then you can acknowledge just what you don't know.

If you want you can engineer or pay litmus some cash to give you all kinds of unreliable but interesting data about your email contacts including browsers, time on email, open rate, and email clients. All of the special analytics used by litmus still relies on the tracking pixel and because of default image blocking leads to questionable data across the board.

My current conclusion is to use a strict testing methodology to get objective data using only link/click tracking with google analytics. By varying the subject lines first with a constant message you can reasonably attribute increased click through to a more effective subject line (assuming all other variables stayed constant as well). Picking the winning subject line then allows you to test content independently and produce a best in show email campaing. The only problem is that testing in stages is slower, and requires significant volume to get significant results from click through to make good decisions. 

04 · 25

On the anatomy of a friend picker widget Part2

After deliberating with a couple friends of mine i have come to a pattern that cleanly abstracts widget internals from other glue code on the page. Its funny because its the very same pattern i used before with a much more complex widgeting system but instead of html -> js compilation i just use rails partials, use ids, and manually insert them into a holding bay (hidden div) for instantiation. 

The pattern is derived from MVVM (just like i did at groupit a year ago) model view view-model. It is a pattern used in backbone and spinejs and its primary purpose is to encapsulate all display logic for a given set of dom elements. Backbone also thinks that the view-models should control communication with the server where i believe that that the view-model should expose callbacks where a persistence call would be executed for added flexibility and reuse. 

Just like my groupit days one of the changes is using classes for most selectors inside the widgets because widgets can be instasiated and their dom prototypes cloned for multiple instances of the same widget (for example a progress bar might be above and below a certain interface).

It isn't all peachy though. Groupit suffered from a huge widget tree, no jquery (so a larger library to maintain), event smarm debugging, and the fact that the library limited hiring oppurtunities and elongated hirigin cycles because very few frontend programmers had exposure to OOP. 

I believe the situation i am in today significantly reduces the risk with the addition of jquery as a core library (removing 40-50% of the library size and cross browser issues), fewer or no nested widgets to prevent event swarming, and little to no hiring needed.

 

 

04 · 23

On the anatomy of a friend picker widget

Writing a good friend picker is a rather simple exercise for most programmers. Write some html, throw a couple borders and colors up there and tie it all together with some event handlers. A good picker will have a fast search, will retain order moving friends from selected to unselected, and use some kind of client persistence to prevent heavy database writes. Again all simple. However once you get done with your battle hardened widget you would like to pack it into a library for use again and this is where it all goes downhill. 

The most obvious question is where does the html live (as a js string, separate file, perhaps dynamically created via dom element creation methods)? What happens if you need to slightly modify the html? Access to the html as a js encoded string blows, mentally parsing sets of dom creation functions also blows, leaving us with the sperate file. Perhaps you could store the html as a separate file and compile it down to a js string? (as much as coffeescript provides precedence for js cross compilation that approach its overkill for a single widget). UGH. Why does this suck so much?!

The practical thought (given you don't have enough time to build a html to js string compiler, or write scripts to automate that process, or the fact that testing widget changes becomes highly unintuitive to most everyone on the plannet except you, etc) is to try and avoid packaging the html and create a abstract logical framework in javascript that allows reuse of the "harder" logic. Unfortunately the bulk of the javascript really isn't "harder" processing at all. Instead its mostly presentational event handlers, api calls and the glue that holds this ball of crap together. The code that cleanly factors out into a library is about 40 lines 2-3 functions worth which frankly isn't worth factoring out to begin with.

Now, given that you can't cleanly factor out anything truly reusable you can do like one of my friends did and encapsulate all kinds of very specific but related logic into various namespaces and files in a gallant effort to do something to organize this mess. Unfortunately forcing encapsulation where it doesn't belong just makes simple code harder to understand, maintain, iterate on and read (jumping from file to file, with jquery css selectors pointing to elements in other html files). 

Lots of people point to backbone, knockout.js, sencha, jquery-ui etc as a panacea but really the libraries dont solve my problem. They provide easier ways of doing persistence, providing frameworks for client models/controllers, make html5 mobile programming faster or provide a static library of widgets that are fairly difficult to extend. My problem with javascript widgets isn't that i need help creating MVC organization or a theme rollers but instead a lightweight way to organize and (if possible) reuse my widgets very very fast and easily. 

I am coming to the potentially mundane position that the optimal solution today is just well documented, segmented, linear script blocks that are copy pasted from iteration to iteration, project to project with a canonical pattern derived from numerous copy/pastes of the same core code that hasn't changed much over time. Its very fast and easy to understand (all in one place and broken down somewhat logically). Its very easy to extend as all the code is visible and right next to the html (as html not as a js string). The only problem is that it can get daunting as the widget gets more complex. 

At best you could put all this into a hybrid partial (rails term for a sub-template) maybe in a directory called "widgets" that had all the css, js and html inline that you could copy and modify clones as needed. You could also moving as much pure logical into libraries but leaving the bulk of the event handling, selector garbage and glue in the blocks. I think thats very practical especially for medium sized widgets you want to iterate on. Its just not what I would expect given all the tools we have a with the language and various frameworks.

 

 

04 · 01

Weight loss strategy

Today marks the first day of a month of intensive weight loss. 

I have been told by the USDA that a pound of fat is 3500 calories. I also know that a 22 year old living a slightly sedentary lifestyle 5'11'' at 215 burns roughly 2400 a day just sitting. I probably suck more than others so i am going to low ball that and say 2200 per day for me.

My goal is to lose 10 pounds in april. Very agressive, i know. I will lose muscle doing it but frankly benching 300lbs never did much for me. 

To achieve that the math says i would need to attain a calorie deficit of 35,000 calories in 30 days. Thats quite a lot! 

Luckily if i run 1 hour a day at 5.5mph (30 min in morning and 30 min at night) at my weight and at that speed/time combo that burns 914 calories. (If i slack and only run 5.0mph its 718). I can also reasonably take 400 calories out of my diet. 

(900 + 400) * 30 = 39,000 > 35,000 calories

(700 + 400) * 30 = 33,000 ~ 35,000 calories

So there it is. 

  • Run 30 minutes in the morning at 5.5mph
  • Run 30 minutes when i get home at 5.5mph
  • Intake no more that 1800 calories per day
  • Do it for 30 days straight.

Lets see where i am in 30 days. :)

 

03 · 07

More than one way to id a database

In coming up with our horizontal sharding technique i did a large amount of research reading through a ton of the best practices out there.

Ultimately the base goals are

  1. Provide a mechanism for creating unique ids across all shards (so when you rebalance you don't have to rekey all your data)
  2. Provide a lookup table mapping rows to shards for fast lookups on at least one id.

There are bonus goals like:

  • Getting time sorted results
  • Keys that fit into a 64 bit space (for redis and index sizes) 

The properies that make a solution great in my mind is simplicy of implementation, maintence, and resilence to errors. Performance always takes a back seat to our ability to work reliably and understand what we are working on.

Ways to generate Ids: (Heavily helped by instagram blog post below.)

Approach 1: Ticket Server (like flickr)

Creating a dedicated service that on request generates ids. Flickr uses odd-even methodology to provide high availablity but the idea can scale out to any number of ticket servers with initial coordination. For example 1+ 3n, 2+3n, 3+3n where n is each request would generate 3 sources of unique autoincremented ids.

Pros:

  • Simple to understand.

Cons:

  • External service: potental for latency, maintenance or bottlenecks
  • Ids are not strictly time sortable

Approach 2: Timestamp-shardId-modulo(autoincrement) aka instagram sharding

See the instagram blog link above.

Pros: 

  • No external service required (no latency, maintenance or bottlenecks)

Cons:

  • Reliance on making sure timekeeping is accurate between shards
  • Has known limitations (a death clock, number of writes per second/shard) All can be made reasonably high though.

Approach 3: Guids

Automatically generated guaranteed unique ids.

Pros:

  • No external service required (no latency, maintenance or bottlenecks)
  • Extremely easy to generate and understand

Cons:

Approach 4: Preallocation registry and allocation

Ahead of time allocate a set of id-chunks in batches of 10K. Take each batch and assign them to a given shard. Shard then manages its chunk with its own internal counter (autoincrement) + batch start point. As batches are used up new batches are allocated on demand and shards continue generating ids but with a new offset.

Pros:

  • Solves both id allocation and row mapping
  • Lends itself to mysql database-per-shard best practices

Cons:

  • Requires a allocation service (potential for latency, maintenance and bottlenecks)
  • Slightly complex when orchestrating block allocation and id counters. 

Approach 4.1 Preallocation registry and fixed size shards

Again you generate a set of id-chunks in batches of 100K or so. Take each batch and assign it to a shard. Shard starts its autoincrementing at that offset but simply registers back as being full when it exhausts its allocation. Application layer would need to get the next shard for new writes.

Pros:

  • Extremely simple to implement on the db.
  • Lends itself to best practices with mysql database-per-shard techniques.
  • Solves the row to shard mapping at the sametime.
  • Doesn't require a external service as the allocation can be coded in a lookup table or configuration file on deploy.
  • Get the 64 bit ids while your id ranges stay below 2^64 unsigned. 

Cons:

  • Just a bit more complex at the application level with insert error checking and fail over into a new shard. (can be avoided for a long time with uniform distribution of writes across all shards and a large set of shards).
  • Don't get a natural sort by time created across all shards.

Knowledge needed to make this work:

You will need to set two autoincrementing null false fields (one as the primary id key) and another as the table size limit counter. You will need to offset the ids appropriately. For the primary key id you will want to offset to the starting block for the db. For the table size limit you will want to offset to 100K less than the max for a medium int (its different for signed or unsigned). If the CHECK constraint were implemented we would use that instead.

Error produced on autoincrement overflow

How to set a starting point for a autoincremented id

Ways to map rows to shards

Pretty much by necessity this must be a table or hash of somesort.

The approaches pretty much boil down to either a remote service for the lookup or maintaining a small enough and constant enough table to include as part of the application layer. Depending on how you picked ids above your choice will be picked for you.

 

03 · 06

Trying a traditional mysql horizontal sharding technique

After two weeks with mongo and high processing loads we have experienced multiple seg faults and server crashes. Luckily replication and journaling has prevented us from large scale data loss but overall mongodb has just not been consistent. 

The mongo database touts the exact features we needed for our growth, indexing, and write speeds. Unfortunately it wasn't reliable as a sharded datastore and while the community is vibrant, client tools abundant, and relatively easy to use, it just doesn't work reliably for long periods of time. I was able to get amazing 10000 ops/s on a relatively small cluster for a matter of hours before a full cluster crash occurred.

Like my previous post detailed i have a really bad taste in my mouth with all the relatively immature nosql solutions. Many seem less established than mongodb and even Amazons dynamodb didn't make the cut in a production environment with ruby ( i blame the sdk).

To solve our problems i have reverted to less "magical" application layer RDMS sharding techniques. While this route definitely seems like a step in the wrong direction (why is a startup not interested in databases going to try and solve horizontal scaling for themselves when so many other talented people are working on it?). I have seen that this is actually the road taken most often to scale reliably (even at larger companies like facebook and google where manpower is in excess compared to a small startup).

The trade off is that you have to manage the complexity of sharding directly. No abstractions, automation or rebalancing are provided for free. In exchange you regain your familiarity with a proven system and a near guarantee that the bugs you come across will be in your code and not deep in either the database code or drivers. 

I will detail out our solution in my next blog post. Right now i am benchmarking, and trying to work around limitations inherent in this strategy. 

03 · 03

NoSql/NewSql/Sharding ... just a mess

NoSql Options: Immature db or client libraries, tradeoff gotchas can cripple app.

  • Riak: Very little community, new, doesn't seem to support mapping large sets of data effeciently. More reliable than mongo, uses lucene search, limited secondard indexes.
  • Dynamodb: Client libraries in ruby complete garbage, scaling still very slow through admin interface, overall very disappointed.
  • Mongodb: Very fast writes, sharding support still in infancy, segfaults regularly under load (similar stories from around the internet and amoung close friends)
  • Cassandra: Too large to setup and while many large companies are starting to use it, it is still immature and many still use sql as primary store including twitter and fb.
  • Couch: Used a hosted solution but really didn't have as much throughput as a sharded solution should for writes. Too expensive to continue above 200Gb.
  • Redis: amazing speed but not a resonable solution for truely large datasets that don't fit into ram.
  • Hbase: Too complicated to get setup, and apparently complicated to run as well. Also very expsensive due to large memory requirements.
  • Neo4j: Not relevant for the data we are structuring, also very new

 

NewSql options: All way too expensive to get started

  • Greenplum: Won't tell you how expensive it is. Runs a distributed postgres. Large setup, still early.
  • Clustrix: 100k to get started...
  • Xeround: Limited to 50gb (@7K a month)
  • MySql Cluster: Huge databses with large amounts of memory to get large storage >1Tb up and running. Its expensive
  • VoltDb: In memory RDMS. Pretty cool, still very expensive for massive data sets. Unless you can find memory on the cheap.
  • Others: Similarly priced...

 

Rails enabled sharding solutions:

  • Mysql Proxy/spockproxy: Very early (still in alpha). Not reliable for production yet.
  • ScaleBase: Startup, uses proxy solution and a regression analysis to decide how to shard. It can't magically nail a growing set perfectly but might get close. Resharding can happen.
  • Datafabric gem: Seemly falling out of interest with maintainers switching constantly.
  • Db_charmer gem: Not compatible with recent rails versions (3.1 and 3.2 support not present).

 

How others are doing it:

 

Conclusion: The world of big data solutions for RoR (and other stacks) is pretty hideous. Nothing really works well out of the box for a reasonable price and the more traditional sql sharding techniques gems are out of date.

Timothy Cardenas

Cofounder at davia.com