Do you struggle keeping your web-based application snappy and performant, and keep development of new features moving forward?
Do you spend far more in operations costs than for servers and hosting?
So did we.
Citrusleaf brings internet-scale development to the masses. Citrusleaf’s philosophy combines low latency transactions, extraordinary manageability, and a great developer interface to create a data storage platform that allows you to build any application faster, better, and cheaper.
Here is Citrusleaf’s promise to you:
Key to today’s new web applications is personalization. In web application development, tracking an individual person is done through a session object, which links to information about the user.
A user’s session information is stored partially in an HTTP cookie, and partially in a persistent datastore such as a database. Information such as ‘last page seen’ would be placed in a cookie, with more persistent user information such as contact information stored in a server.
As an application evolves, functionality increases by storing more and more information on the server, thus available to all users, and available to analytics engines. In this programming model, it is common to do at least one database read and write for every user page view, creating grave limits on performance and pushing up the cost per page view.
Data use of this type is often difficult for today’s standard application environment which uses a cache layer for speed, and a master / slave replication model. There is no parallelism or scaling for write operations, so there is no ability to add hardware to improve the write performance. The master – and all the slaves – must be upgraded constantly as application load increases.
A common solution to user state information is to store is solely in a fast non-persistent store, such as memcache. Memcache’s extraordinary read and write performance allows costs to be kept low while implementing advance user personalization. Yet, memcache has serious limitations. Typically, only a small amount of user history (the current session) is kept, thus disallowing features such as recommending new activity based on the user’s activity of a previous – and possibly days old – session. Memcache also has no ability to iterate the data in the store, disallowing even the most basic form of analytics for better business intelligence analysis. As a non-persistent store, critical user information could be lost at any moment.
Sharding a database is a common approach to remove write bottlenecks, but sharding reduces the power of centralized user storage. Centralized user information storage enables features like social networking and deep analytics, where one user can see a friend’s recent behavior, and jobs can be run to easily find hot trends in your web site. Simple joins may no longer work, as the target data may be in another shard, leading to a sharding layer that becomes a critical path.
Citrusleaf’s fully distributed architecture and immediate consistency model makes rich user application development a snap.
Citrusleaf distributes write operations throughout the cluster, removing the single-point bottleneck. Writes are slower than reads – as the write is committed on multiple servers according to the configured replication count – but are still very, very fast. More importantly, Citrusleaf’s scaling architecture allows easy incremental addition of capacity. Instead of needing to back up your databases to add a shard, simply start a new Citrusleaf node.
Citrusleaf’s immediate consistency model promotes easy application development. Any application server can read from the cluster and be assured of current user state information. The “compare and swap” (CAS) API feature allows simple and safe “read-modify-write” semantics in the application, which are simple yet extraordinarily powerful for application developers.
Finally, Citrusleaf’s ability of distributed cluster scanning allows simple but powerful analytics both for business intelligence and advanced application features.
Company X found a new and interesting business – creating an advertising marketplace.
Placing advertising optimally is a good business. Effective advertising placement systems run internal algorithms which compete against each other in real time to find the best advertisement for a given user, on a given page, taking into account all possible information. This architecture – where multiple algorithms run in parallel using different information – is known as a bid architecture.
Company X buys ad space cheap, and invites bidders to bid for each ad on a real-time basis, each ad spot at a time. The bidders can run any algorithm they want, but need to run within Company X’s hosting infrastructure, so that the auction runs within a very quick period of time – 10 to 30 milliseconds.
High auction values result when bidders are given a maximum amount of information. Company X’s infrastructure is built to aggregate as much information as possible about the placing web site, the content near the ad, the geographical location of the viewer, and the history and demographic information of this particular viewer.
While demographic information can be gleaned and bought from a number of sources, the history of a viewer is a challenging real-time problem. At the beginning of every auction, the user’s past month of activity must be retrieved and sent to all bidders. The fact that this user viewed this page at this time must be recorded. Any interaction with the user (clicks) must be recorded.
The business of advertising – as Google has shown – is a matter of fractions of a penny per ad. The cost of the storage and application infrastructure is critical, and must scale evenly as the business expands. If the company can show a profit on each bid with linear-scale infrastructure, the business will be a success.
Company X’s business target was personal information for 1 billion people, at about 2K per person. This datastore would need to exist in two data centers, one on America’s east coast, and one on the west coast. The target number of auctions per second was in excess of 50,000 per second, each auction generating one read and one write, but the system needed to scale based on business demand.
Enterprise hard disks are rated at about 200 seeks per second. A standard database might stream the write load, but require seeks for every non-cached read, or the reverse. To reach the target with a standard database such as MySQL would require in excess of 300 spindles, and about 50 database servers, each acting as a master, and dedicated engineers and DBA staff to manage the sharding. The system would need overbuilding to handle load spikes. Price can be modeled, and compared against each ad placement. A system of this design was shown to lose money.
Intel X25-E and STEC Mach4 SSDs and FusionIO ioExtreme cards have the necessary IO throughput. A Citrusleaf system with only 4 nodes can be built which provides the level of throughput and storage required, at a fraction of the cost of a database solution. This solution would have IO capacity in excess of 100k transactions per second, thus able to easily handle load spikes.
Citrusleaf’s operations characteristics, such as no-down-time backup and automatic failover, would have been ideal for such a demanding application. Saving a single seat of operations personal has extreme cost savings as well.
As volume grows, the system is able to scale to meet any reasonable need.
Back to topImagine a social networking application which doesn’t just allow posting of 140 characters of text a few times a day, but automatically transmits the web pages you browse, as you browse them. Pages which you spend time on - say, scroll up and down on - might get flagged as more interesting. You would see your friend’s current browsing locations in real time. Of course, site blacklists and a privacy button would be required.
Unlike today’s Twitter which handles hundreds of tweets a second, the TwitURL service would handle writes every few seconds for each active user, resulting in a thousand-fold traffic increase. A trivial implementation would use read polling. Every few seconds, the current friend list would be read, then for each friend read a “last written” field, then fetch recent URLs for the active friends.
Traffic would be enormous.
And, after only a few weeks, the collected data volume would be extreme. If a standard database is used, the service would likely require huge funding just to determine whether the service is profitable.
Due to the extraordinary transactional capabilities of Citrusleaf, this application can be handled with very little hardware – and be poised to scale out when the service becomes popular.
In example implementation, each user would have their list of friends, and each friend would have the recent list of URLs, and a “recently written” timestamp. Thus, every active user would generate one write every few seconds as they browse, and stagnant users would generate reads equal to the number of friends every polling interval – say, every 10 seconds.
A user’s list of behavior can be appended and modified using a read-modify-write (CAS) cycle, which allows lock-free safe modification of complex data structures in your preferred application language. Using the CAS cycle, the program reads a complex data structure like an array, and while reading it receives a token. The application then modifies the data structure (such as deleting old elements and adding new elements), then attempts a write, using the token. If a different write has been applied, the write being attempted fails and the application can retry or disregard the input.
The standard approach to this problem using a database would be to write every behavior into a single table as they occur. A rather simple SQL query can be created to find all recent behaviors of all one’s friends, sorted by time, with a small limit similar to the display space in the user interface.
The key-value approach creates more data motion between the application server and the database, but the database becomes a more general purpose tool. The extraordinary transaction speed of Citrusleaf becomes the enabling technology.
This general pattern can be used for similarity calculations. Similarity metrics, like cosine, Jaccard, Dice, and Mountford, rely on quickly finding connections, and applying a running calculation during the exploration of a graph. In the example of TwitURL, and interesting calculation might look at the connected friends, then the connected websites (or URLs). Interesting similarity metrics down-weight popular answers, and provide a spread of different answers. Different metrics should be field-tested and tuned to maximize whatever business effect is desired.
As there are a variety of different calculations leading to different results, it is best to apply these calculations in a high-level programming language. A fast database like Citrusleaf, which can easily handle thousands and thousands of requests a second, can be the foundation for a calculation engine of this type.
Read a specific example of behavorial calcuation - with sample code -
on our wiki.
With the advent of iPhone and Android phones, the growth in mobile smart phones is about to become even more furious. In the last quarter, world-wide mobile smart phone shipments exceeded 40 million. This, coupled with the evolving network plans into flat rate pricing will result in enormous growth in mobile users that can use the internet. For example, the number of downloads from Apple App Store recently crossed the 2 billion mark with the last half billion in the last quarter alone. The explosive growth in the mobile market presents a huge opportunity for application developers but only if applications can be built rapidly and deployed in a frictionless manner. Citrusleaf is exactly the kind of platform that enables this. Let us consider a concrete example.
Typically, adapting an application to run on mobile devices and networks requires addressing key limitations of the mobile device and network. Let us consider the problem of sending extra long urls to mobile devices:
There is a simple solution possible here for doing a server side compression of the URL before it is sent to the device. The shortening, however, needs to be done efficiently and scalably. Citrusleaf can used t solve this problem as follows.

The above solution is applicable to many cases that happen in real systems (e.g., analytics gathering for a web site). Note that one can run queries on the URL set to learn about the run time behavior of the URL processor and the application with respect to the URL translation function.