To Start-Up or Not To Start-Up

May 10, 2012
Members of the TellApart team profiled about working at a start-up

Three of our engineers – Trevor Reid, Sophia Cui, and Jian Wei Gan are profiled in the Spring 2012 issue of Threads – the newsletter of the Duke University Department for Computer Science. In the article they share why they enjoy working at a start-up including that:

“the work is engaging — with a chance to focus more on
writing software 
than attending numerous meetings — and
the learning is constant”

Now is a great time to consider working at a start-up for a variety of reasons and we are certainly glad that these software engineers felt that way. This trio are currently among the youngest employees at TellApart but each has a significant impact on the engineering team – and on both
the performance and culture of the company as a whole.

To read the full article, please visit the Spring 2012 issue of Threads

 

 

 

Tabā: Low Latency Event Aggregation

March 20, 2012

Introduction

At TellApart we love metrics, and certainly have lots to measure. We track data for nearly 10,000 different event types across all parts of our stack (and that’s just the real-time data!). This information is used for all sorts of things, like monitoring, performance metrics, segmentation information, and feedback loops. In order to manage all this data, we created the Tabā service (the name is derived from the Japanese for bundle or flux (束)).

In designing Tabā, we had a number of goals:

  • Low latency: An event should be visible within seconds of occurring. This enables responsive monitoring, tight feedback loops, and other near-line applications.
  • Low impact: CPU and memory usage within the client applications should be minimized.
  • Durability: Events should be reasonably durable, so that a Tabā “Tab” can be used as a basis of important services like monitoring.
  • Scalability: All components of the Tabā service should be horizontally scalable to keep up with the applications it tracks.

Different Types

Each event type, or “Tab”, being tracked is identified by a Name, and a Type that maps to a Handler class. The core Tabā service doesn’t implement data manipulation operations – all schemas and transformations are left to the Handlers. By separating the schemas from the rest of the service, Tabs that behave differently can easily co-exist, and implementing new types of aggregation is simple.

Data Model 

Data life-cycle of a Tabā Tab

The fundamental element of data in the Tabā service is the Event. Events are composed of three pieces of data: the Tab, the time, and a value. Events are gathered to a central server (more on this later), and combined into State objects which are persisted in a database. The motivation behind persisting States instead of the individual Events is two-fold: (1) storing all events would consume far too much space (timestamps alone would consume gigabytes per hour), and (2) pre-calculated State objects mean faster query response times.

When the Tabā service is queried, States are converted into Projections and/or Aggregates. A Projection is a reduction of the State into a dictionary, and an Aggregate is a combination of Projections. Depending on the query, a Projection or an Aggregate will be returned, optionally rendered by the Handler into a human-readable format.

Architecture

Tabā System Architecture 

Applications generate Events by embedding a Tabā Client. The Client briefly buffers Events and posts them to a Tabā Agent over an HTTP based protocol. An Agent receives Events from multiple Clients (usually on the same machine) briefly buffers them, and posts them to the Tabā Server using the same protocol. The Client + Agent scheme allows for a very simple and resource-light Client, putting more complex buffering and durability functions in a separate process. The Agent also helps performance by batching requests to the Server.

The Server is where the real magic happens. Like TellApart’s TAFE server, it’s a distributed Python service sitting behind an Nginx reverse proxy, and uses gevent for simple and fast co-operative concurrency. The Server itself is stateless, meaning you can launch as many as needed; the Agents will distribute requests across Servers. The Server receives Events and invokes the appropriate Handler to fold them into States, or generate Projections and Aggregates. A State object is maintained for each Client and Tab combination, so that the status of any Client can be queried.

We use Redis for the database, but with our own transparent sharding layer on top to overcome single process limitations. Sharding is based on virtual buckets, and supports a subset of Redis data-types, and transactions/locking. Like the Server, as many Redis processes as needed can be run, and re-sharding without downtime is possible.

In Production

Currently, we have the Tabā service deployed in each region as a cluster of 2 Servers instances and 1 Database instance, each running 8 processes. Each cluster handles over 10,000,000 events per minute, from 300 individual Clients across 10,000 different Tabs, with an average latency of 30 seconds. The Tabā architecture and data model have allowed us to use it for fine-grained monitoring, real-time feedback systems, and internal Dashboards. We’ve only started to integrate the its full functionality, and already it has provided a deeper insight into TellApart’s system.

Kevin Ballard is a Software Engineer at TellApart.

TellApart Loves December

January 11, 2012

Ah, distinctly I remember, it was in the peak December
Season that shoppers ventured forth to many a retail store.

With apologies to Edgar Allan Poe, it’s safe to say that December is, by far, the busiest, craziest, and most exciting month at TellApart. Consumers make the rounds for their holiday shopping, and we get flooded with traffic and data like at no other time of year.

December 2011 was a record month for us — here are a few stats to help tell the story:

  • We handled 9.5 billion web requests (at peak traffic, over 10,000 requests per second).
  • Our products reached 23 million unique users (that’s about 2 million users per TellApart engineer).
  • A full 9 out of every 100 users who saw TellApart ads, clicked — an order of magnitude better than the industry standard for display ads.
  • If you were to virtually “cut out” each of the ads we showed from a typical computer monitor and sew them into a quilt, the quilt would cover over 430 acres — that’s 4x the size of Vatican City!
  • If you were to lay the same ads end-to-end, they would form a line of nearly 17,000 miles, extending 2/3 of the way around the globe. That’s like flying from Iceland to the southernmost tip of South America by heading north.

We’re grateful to our customers, team, friends, and colleagues for making it all possible. And of course, now that December is behind us, our New Year’s resolution is to leave these numbers in the dust in 2012.

Come build the technology and products that power this growth — join us!

Measuring Page Latency

October 31, 2011
In my final interview at TellApart, I asked the co-founders if they’d describe the company as data driven. They proceeded to show me an intricate set of dashboards that monitored everything about the business and our fleet of servers in real time. The dashboards ingested Ganglia and Amazon CloudWatch data, but also homegrown metrics for things like:
  • The number of third-party ad network bids considered in the last minute.
  • The distribution of product prices across retail sites.
  • The number of clicks that have occurred since March.

The best teams I’ve worked with see data visibility as an uncompromising necessity from Day 1, and I immediately jumped at the chance to join up. Now, nearly a year later, I found something missing from our internal measurements – web page latency. Fortunately, we found an ally to help us secure this missing piece of data.

First, some background. At TellApart, our two biggest concerns are tag latency and global thermonuclear war (to paraphrase Austin Powers). Our servers handle hundreds of millions of requests per day, many of which originate from end users browsing the web. But while we can easily measure the latency of these requests, we needed a way to measure the total, end-to-end impact of TellApart tags, including the third-party tags we’re responsible for.

There are many different web performance monitoring services out there, and we tried a few of them, but only Catchpoint lived up to our high expectations. Catchpoint has a lightweight, modern, self-explanatory user interface. Many stats are quickly broken down and summarized on waterfall reports that can jumpstart performance investigations. You can also see which page elements triggered third-party requests, giving you a view into who is responsible for different elements on the page.

A few of the services we tried prior to Catchpoint made it difficult to export data into a usable format. Dashboards can be attractive, but not if you want to reformat graphs or provide your own data analysis. Catchpoint offers well-formatted CSV exports that open cleanly in Excel and other statistical analysis suites. Test maintenance is also minimized with a strong inheritance system for setting defaults and an instant test feature that can verify parameters prior to starting long running tests.

Fast testing and well-formatted reports help us maintain a proactive approach with our clients. We catch anomalies faster, minimizing our impact to the user experience and optimizing campaign performance.

Ned Rockson is a software engineer at TellApart (with Ben Snider, technical solutions engineer.)

Interning at TellApart

September 26, 2011

My summer this year was unique in many ways. It was the first time I ever saw the Pacific, the North American West Coast, California, and the San Francisco Bay Area. It was also the coldest summer I ever experienced, having to wear a sweater on most nights. Apparently, this is the “normal” San Francisco summer weather. More importantly though, was my internship at TellApart.

TellApart is the fourth company I joined for an internship and, over the course of sixteen weeks, my time there grew to become the most rewarding and memorable internship I have experienced so far.

One of my main goals in joining TellApart was to experience the startup culture in Silicon Valley. TellApart has a very open and transparent culture; every member of the team knows about everything that goes on in the company. The team, aside from being super-talented, is always willing to listen to ideas, comments and suggestions (even from the intern!) A big portion of my internship was spent working on ideas that I proposed myself, and I started working on them within a week of proposing them!

Being a small company, it was very insightful to see the company evolve over a period of 16 weeks. To highlight just a few of the events that happened during my time there:

  • TellApart closed its series B funding round of $13 million. In the process, and in occasions thereafter, I had the chance to meet and get to know top tier venture capitalists.
  • Amazon Web Services blogged about TellApart’s technology and how it leverages Amazon’s infrastructure. A picture of the entire team (including me in my boxing gloves) was included!
  • TellApart was selected in VentureWire’s FASTech 50 most innovative startups for “paving a unique path technologically”.
  • TellApart hosted a “TellAparty”, where it invited techies from Silicon Valley over to the office to get to know more about the company. John Lilly, the former CEO of Mozilla, was also invited to come and speak about his experiences.

Company culture aside, TellApart works on really interesting technical challenges in machine learning, data analytics, and in scalability. Because we were such a small team, everyone got a big slice of the pie when it came to dealing with all these technicalities. As an intern, my work was no less relevant than anyone else’s work and I had a major impact on the company’s performance and revenue. Some of the technical challenges we faced include:

  • Responding to thousands of bid requests per second while keeping the response time for each request under 120ms.
  • Predicting how likely users are to make purchases given the data that we have about them.
  • Serving ads customized for each of TellApart’s millions of users.
  • Running experiments against millions of users to measure the performance of certain features.

Looking back, I can’t thank the TellApart team enough for their mentorship and support. I went to TellApart for the startup experience, and they certainly delivered the full-blown version of it.

Islam El-Ashi is an Engineering Intern emeritus at TellApart. He’s also a seasoned traveler, amateur boxer, and general man-about-town.

Serving Up a Storm

June 20, 2011

When putting together a high-performance web serving framework, a number of different software and network components typically have to be considered. For starters, there are CDNs, load balancing, reverse proxies, the application web server, and backend components like the database. In this post, we’ll start by delving into our app server software – how TellApart handles a dynamic request with extremely low latency.

Most of our incoming traffic is made up of a large number of concurrent, short-lived requests, each of which must complete in tens of milliseconds. Our initial design was based on a well-worn server configuration that performed reasonably well right out of the box: the venerable Apache web server, with mod_wsgi for running Python application code.

When we first fired up the web server and unleashed some live traffic, we noticed something curious in the performance data. Here’s how this configuration fared at handling a representative I/O-bound request.

Apache/mod_wsgi Request Times

Whoa! What’s up, 99th percentile?! In Apache/mod_wsgi, each request is handled in its own system thread. For instance, if three requests need to be handled concurrently, the OS is responsible for switching between them. So what’s the problem?

It turns out that in CPython, multithreading is subject to the limitations of the GIL. Request-handling threads occasionally enter “GIL battles” with other request-handling threads, burning CPU cycles and slowing everything down. David Beazley sets the record straight in a great series of presentations. Most Python web servers will have no trouble handling dozens of requests per second, but thread-based servers will strain when when attempting to handle hundreds or thousands.

We evaluated several alternatives that did not depend on threads for concurrency and eventually chose to replace Apache/mod_wsgi with a custom web server built around the excellent Gevent coroutine networking library. We call this server TAFE (“taffy”), the TellApart front end. Gevent handles each request using a lightweight thread-like structure called a greenlet (essentially, a coroutine). Unlike threads, greenlets must cooperatively yield control flow over to other greenlets, which frees the system from erratic GIL issues at the cost of a bit of increased development complexity.

Here’s how TAFE performed on the same workload:

TAFE Request Times

Much better. So, was it smooth sailing from here? Well, TAFE is definitely a big improvement, but giving up on system threads for concurrency means spending more time discovering code that doesn’t cooperatively yield. We wrote a Gevent Request Profiler to help us do just that. More on that next time.

Mark Ayzenshtat is TellApart’s CTO.

New Amazon Case Study Showcases TellApart Architecture

June 6, 2011

A big part of what sets TellApart apart is our technology. When we got started, we suspected that applying the right technical power tools to our clients’ large datasets would make all the difference. But we also knew that developing a clever algorithm or building an elegant system wouldn’t amount to much if we couldn’t put them into production, quickly, and at large scale. To that end, we quickly became enthusiastic proponents of Amazon Web Services — and last December, we were honored to be runner up (2nd of 1,500 entrants) in the 2010 AWS Startup Challenge.

Today, Amazon has released a new case study showcasing some of the cool things we’ve built atop AWS. Like a real-time bidding system that serves tens of thousands of request per second to hundreds of millions of end users. Or Tubes, our cross-region data replication system powered by SQS. These days, we use a veritable alphabet soup of AWS services and are always looking for new ways to integrate AWS into our stack.

The case study gives a good overview, but if you’re hungry for details, fret not — we’ll share much more about our architecture with readers of this blog in posts to follow. And if you like what you read and find these kinds of big data problems interesting…we’re hiring!

Mr. Jerb: How We Launch mrjob at TellApart

June 1, 2011

At TellApart, we’ve become fans of Mrjob, an open source Python package that provides much needed hooks into Hadoop’s streaming framework. It allows for development of a job without needing to worry about infrastructure, the format of the input, or code-base bloating shell scripts. Writing a streaming job is as easy as writing a generator in python.

Prior to Mrjob, the thought of writing a streaming MapReduce job made me shudder. Simply adding a mapper required adding the python module, a test, and a shell script (with the path to the mapper hardcoded) to invoke the module with different command line parameters than the test used. It was almost enough to make me avoid writing jobs that had a reduce phase as it meant doubling the time I’d have to put in.

Mrjob mostly does the trick — but we prefer to invoke it with a launcher that we callMrJerb. MrJerb makes launching mrjob jobs as simple as writing a generator or two. Note that we use command line parameters to launch the streaming job because that’s more consistent with our own code base, but your mileage may vary. (I recommend using version >=0.2.2 of Mrjob, or expect to pull your hair out while trying to work with hadoop_extra_args.)

Here are a few other things it simplifies:

  • Adding jar files and partitioners.
  • Setting the equivalent of –python_archive (set to be released in mrjob 0.2.7).
  • Setting output protocol and other job attributes (name, number of reduce tasks).
  • Setting cleanup options for the job.
  • Optionally deleting the output path.

Check out MrJerb on GitHub.

Ned Rockson is a software engineer at TellApart.

 

TellApart Engineers Sanjay & Dmitry Discuss Why Engineering at TellApart is a Unique Opportunity

May 23, 2011

TellApart Engineers Sanjay & Dmitry recently headed over to the UC Berkeley Startup Fair to share with prospective hires what it’s like to work at TellApart. It was a great opportunity for us to meet and spend time with undergraduate & graduate students in computer science, electrical engineering and mathematics. 

In this 90 second clip, they are able to convey both the breadth of technical challenges that we tackle on a daily basis and also share a little info about our company culture and exciting growth. We may request that they pitch in soon on some marketing collateral as well…

Recruiting stars from top universities is a big focus for us now, as we are meaningfully expanding our engineering team over the next few quarters. If you’re in an academic environment and are looking to make a move into a fast-paced startup, take a look through our open positions (recently updated) and drop us a line!

The 1st Tech TellAparty: Thursday, May 19th, 6-10pm

May 17, 2011

On this Thursday, we’re excited to welcome current Greylock Partner and Fmr. Mozilla CEO, John Lilly, to our office to give a talk entitled:

“How to Build an Eclectic Career:”
Reflections of an Engineer -> CEO -> VC

This talk will be geared towards software engineers from a variety of backgrounds & will be held at the TellApart office in Burlingame, CA. 

There will be food, wine & beer, as well!

Doors will open at 6pm and John will start speaking at 7. We’ll have Q&A until about 8 & we’ll wrap up around 10pm for those who want to hang out afterwards.

If you plan on attending, please RSVP on our Facebook Event page by Wednesday at 10pm, so we can properly plan our numbers. Thanks!

Copyright ©2012 TellApart, Inc. All rights reserved.   Privacy Policy