Posted by tibbetts
Tue, 25 Sep 2007 10:53:46 GMT
The opening keynote for this year's VLDB was a great presentation by Amazon CTO Werner Vogels, describing their data management challenges. I particularly appreciated it because it echoed something I've been saying for a few years now: Web-scale companies have problems which cannot be managed with standard RDBMS, or with any common research systems, and so they have started to route-around the database research community.
Vogels presented lots of information. A big part was about the operational realities of a site as large as Amazon. This has been said before, but saying it to a room full of database researchers was a good thing:
- Incremental scalability, up or down, one node at a time, is a key requirement
- Adding resources for redundancy must not hurt performance (best if it helps).
- Failures are not uncorrelated, they are generally quite correlated.
- Systems that fail do not fail stop, they generally fail in some more complex way (eg, periodically coming back up and failing again, or outputting garbage).
Amazon's query workload shares a lot of characteristics with other web sites with which I am familiar, such as eBay, Google, Facebook, and LiveJournal. Amazon uses lots of read mostly workloads and lots of pkey-lookup-only workload (65% of all queries are primary key). Only 5% of their queries need full RDBMS query functionality. Most of their writes are also by primary key, and many storage systems support only that kind of write.
When it comes to those writes.
- Only 10% of queries need strong consistency.
- "We consider strong consistency to be evil," because it is impossible to implement without the potential for downtime or failure.
- Most systems at Amazon are designed for eventual consistency.
- One interesting case is the need for always-writable data storage. One use case is ordering. "If the customer decides to give you his money, you must always take it, that's a business principle." Always writable storage obviously implies a conflict resolution scheme.
All this points to a traditional RDBMS being the wrong system. Interestingly, if you assume you need a single general purpose system, then RDBMS is the best available. But the real answer is to use a set of customized data management tools. This all echos Stonebraker's claim that one size no longer fits all uses.
Where do we go from here? The real problem for the research community is not to help Amazon and Google build a 5% better system. The real problem is to package up (aka abstract) these tools in such a way that, like databases, average programmers can work with them. Because problems that PhDs at Amazon and Google solve today will soon be problems everyone needs to solve.
2 comments
Posted by tibbetts
Wed, 06 Jun 2007 11:52:00 GMT
A few weeks ago
Bruce Schneier discovered a classic economics paper, "The Market for Lemons". The paper describes the behavior of markets where sellers have detailed information about the products, particularly the quality of the products, that buyers do not have. It uses the example of used cars.
In these markets, the price buyers are willing to pay is defined by average quality of good. Buyers lack information, so can only assume they are going to get a product of average value. Unfortunately, this lower price drives the best products out of the market, because sellers (who know they have the best goods) won't accept that price. When the best goods are removed from the market, the average quality drops, the price drops, and the next best goods are removed from the market. The conclusion is that in these markets quality falls until it matches the amount of information that buyers have.
Bruce applies this principle of economics to explain why there are so many bad security products. But others in the blogosphere have picked it up as a way to describe other parts of software. I first saw it at David Anderson's blog, where he talks about the lack of information when hiring software engineers. The most stark application to the job market comes in a Reddit comment through:
I just realized that it applies to the IT job market. Here the seller (the applicant) has all the info about himself, while the employer knows nothing. So what happens is that companies expect the average, and pay accordingly. That's why people who are smarter than average shouldn't go on job interviews, because they'll likely to get below what they are worth.
In the IT market it also helps to explain the pervasive use of certifications. Even if they don't indicate that a candidate is good, they do cut out the bottom of the distribution of candidates (the truly terrible sysadmins). Since top people will have already taken themselves out of the market for these jobs, it's ok to alienate them. Removing the bottom people pulls up the average, so prices (salaries) presumably rise.
As much as I like thinking about the talent market, my favorite application of this meme is Reg Braithwaite's The Not So Big Software Design where he applies it to tract housing (a pet issue of mine) and by metaphor to custom software development. The customers for both new houses and custom software are definitely ignorant, and they tend to buy based on what we in the software industry call buzzwords. For new home buyers, these are things like granite countertops, en-suite master bathrooms, the number of bedrooms, etc. Builders optimize for these easy-to-observe items, at the expense of things that can be critical to livability or maintainability of homes.
It's a good discussion of the realities of custom software. The metaphor does tragically fall down though. At least buyers of used cars and tract housing can resell them to the next ignorant buyer. Companies are generally stuck with their custom software.
Posted in Software Development, Business | 1 comment | no trackbacks
Posted by tibbetts
Mon, 26 Mar 2007 22:18:37 GMT
Just came across Phil Windley's summary of the tutorial Coder to CoFounder at ETech by Marc Hedlund. I didn't make it to ETech, so I can't give any first-hand account, but even Phil's summary is worth reading if you are interested in startups and trying to decide between being a founder and going to a more established startup.
My personal experience at StreamBase is that you get a phenomenal amount of education out of being a cofounder. Being involved in the early stage exposed me to all aspects of the business, from finance to marketing, sales to recruiting, all when I was just out of grad school. Of course, if you don't choose wisely (and get lucky) this can be an expensive lesson.
I think Paul Graham of Y Combinator had good advice when he spoke at MIT: Work at a later stage startup before you try to start one. It's a less expensive (they'll probably even pay you) way to learn some of the easy lessons, before you head for the real thing.
no comments
Posted by tibbetts
Mon, 26 Mar 2007 00:31:20 GMT
I was chatting last night with a Web 2.0-oriented friend of mine, and applications that support disconnected operation and synchronization came up. These have been in the noos a bit lately, with the release of the Ruby on Rails based Joyent Slingshot and talk of the Flash based Adobe Apollo.
The cliche that those who forget history (or never learn it) are doomed to repeat it is apt. A lot of people thinking about and building these applications have never heard of Bayou, though they have generally heard of Xerox PARC. The project ended in 1997, but there is still a retrospective page with publications list. From the overview:
The Bayou system was designed to support collaboration among users who cannot be or choose not to be continuously connected. Network connections may at times be too slow, too expensive, or too faulty for users to effectively utilize, or it may not be possible to establish a network connection at all. [...] A major premise that distinguished the Bayou effort from previous work on replicated data algorithms is that disconnected operation by mobile users is a common, rather than exceptional, case.
Bayou identified and solved many of the hard problems of disconnected operation. Of course, the solutions are not web-oriented, and sometimes take the wrong approach. Bayou saw their goal as standardizing a protocol for communication between an application and a data store. In the modern world, this would most likely be done over HTTP/XML/AJAX. But many of the principles are the same. Anyone who is interested in disconnected operation and synchronization should familiarize themselves with Bayou.
no comments
Posted by tibbetts
Mon, 26 Mar 2007 00:11:02 GMT
Two weeks ago I got a new laptop from work. After extensive hemming and hawing, I went with an Apple MacBook (the black one, cause it looks hotter^Wmore professional). Previous to this I had been running Ubuntu on a Thinkpad T40 bought around the founding of StreamBase. My goal for the change was to have a laptop that would "just work", and to stop having to administer my personal machine. I was last on a Mac when I was in grad school.
As far as that goal goes, I think the switch has been a rousing success. In non-development activities (eg, web, email, calendar, documents) it has been a significant improvement. The tools I'm finding myself using include:
- Firefox - Safari just isn't good enough, and on the Intel processor Firefox is plenty fast
- Terminal.app - In preference to X11.app and xterm, because it is better integrated with everything else
- Mail.app (aka Apple Mail) - Because as part of the switch I'm going to stop hacking my mail client and see how the other 90% of the population lives. Thus far, using a less featureful mail client has been a success for spending less time with email thanks to unsubscribing from things.
- Adium - This is the best graphical IM client I've ever seen.
- iCal - The Apple calendaring tool is adequate, though I think I may end up switching to something with Exchange support, as work moves in that direction.
- iTunes - Of course, this is a huge improvement over anything on Linux, particularly for synching with my iPod.
- Microsoft Office - I'd considered other alternatives, but the MacBook came with office preinstalled, and once I had it easily available (instead of in VMware) I couldn't say no.
- NetNewsWire Lite - A feed reader that is much better than bloglines. I haven't done much with the NewsGator integration, which might be interesting. And I haven't seen a reason to buy the non-lite version.
- OmniGraffle - This is the best diagramming program I've ever used. Vastly better than anything on Linux, and much better than Visio.
- Parallels Desktop - Much nicer than VMware workstation on Linux. Well polished, and the Coherence feature is pretty hot.
- Desktop Manager - Free tool to implement virtual desktops. Does everything I want in this space.
- Quicksilver - This is basically a graphical commandl ine for the mac, accessible from anywhere. It's very nice. Like screen, you have to try it to learn how much it will change your life.
- Visor - This is a cute hack that makes a Terminal only a keystroke away at any time. It's a good complement to Quicksilver.
- MailActOn - This is a little tool that lets you define keybindings in Mail.app, mostly to refile mail into folders with a few keystrokes.
- TextMate - This is trying to replace emacs in my life. It's a more mac-oriented text editor, with a pretty good feature set and good support for my emacs finger macros. But I may end up going back to emacs.
- Ecto - This is my latest addition. It's a blogging client that I'm using to write this post. I'm not sure I'm in love with it enough to pay for it, though it is a bit nicer than Performancing, the Firefox plugin I had been using.
That's about it for productivity tools. On development tools, I haven't had to install very much:
- Apple developer tools - This comes on the standard install media, and gets you gcc, autoconf, and all the other things you would expect.
- SSH Agent - This is a version of the standard ssh-agent which integrates with the account management on OSX, so that you can use the agent from any application/shell in your login.
- SVK and Subversion - These are special builds for OSX, they seem to work well.
- Eclipse - Standard Eclipse is available for OSX
- MacPorts - This is a package system for getting various free tools. I currently only use it to get Cocoa Emacs.
So, that's the enumeration of tools that I am using. Hopefully this is helpful to people. I may follow up on this with other posts about my experiences on OSX.
I will also shortly make a non-tools post.
Posted in Tools | 2 comments
Posted by tibbetts
Sat, 02 Dec 2006 01:10:19 GMT
I was at MIT today and so I ended up going to an invited talk on computer architecture,
Subtle Semantics and Unrestricted Implementation of Transactional Memory. Transactional Memory is a very hot topic in systems and architecture. It is perceived to be a better model for programmers, so language designers like it. And there are a variety of options for pure-software and hardware-assisted implementations. And because it enables optimistic concurrency control, transactional memory can help make programs faster and more scalable on new multi-core architectures. There is every reason to believe that processor vendors will begin including some form of hardware support for transactional memory.
The focus of the talk was on the subtle issues that come from this. It dismissed a few of the more positive myths about transactional memory:
- Programs using locks cannot be trivially converted to use transactions. In fact, a correctly behaving program using locks to manage things like inter-thread communication can easily deadlock or livelock when using transactions.
- Transactions are not perfectly composable or nestable, as is often claimed. This means heirarchical or nested transactions, where there is a program-level transaction wrapping several library-level transactions, can lead to deadlock or livelock.
- Whether a transactional system is weak (non-transactional code executes concurrently with transactional code, with side effects visible between) or strong transactional (a transaction is atomic not just from the perspective of other transactions, but also to other non-transactional code) can have a significant effect on program behavior. Not only can a program designed for strong transactions have problems under weak transactions, but a program that behaves correctly in a weak transactional system may deadlock in a strong transactional system.
This leads me to two primary conclusions, with which I think the speaker would agree: First, transactional memory can have significant confusing side effects, just like locks, and so it is not a solution to the difficulties of multithreaded programming. Second, if processor vendors implement, and many programmers use, weak transactions then we may never get strong transactions.
I'd like to take this one step further. I think language designers and computer architects who are excited about transactional memory are missing the point. They are trying to create a single concurrency control mechanism that can solve everyone's problems. In fact, I think there are two separate classes of problem that are best addressed separately: Concurrency control for systems programming, and concurrency control for application programming.
Systems programmers are close to the hardware, inside or right on top of the operating system. They are the people who are most comfortable with the concurrency control mechanisms of today (
pthreads and
java.util.concurrent). Transactional Memory should be targetting these users. The goal is not to give them an easy way to do concurrency, but to make optimistic concurrency control possible and efficient. Continuing to discuss the merits of strong and weak transactions is reasonable, but it should be in the context of what most efficiently represents the capabilities of the hardware.
Application programmers generally work on top of a platform, and don't have to think about concurrency in the same way. if they don't work on a platform then they should. For example, application programmers are often writing business applications as web services on top of J2EE. By using platforms, they can work with much higher level management and concurrency control. The most common idiom for concurrency control is the database transaction (or distributed transaction). It is able to handle most concurrency problems in traditional RPC-and-datastore systems.
To help application programmers, we need to develop additional idioms. Not all programs can utilize the J2EE style of platform. I expect to see a significant advance in scientific computing tools like MATLAB to take advantage of multiprocessor systems (in my ideal world it would be by vectorizing based on static and dynamic analysis, but I digress). Furthermore, there are a large set of event processing (stream processing, complex event processing) which are also not well served by J2EE-type platforms and database-transactions. Which is why I work on
StreamBase (and on
StreamSQL), which is creating next generation programming models for those kinds of applications, to enable parallelization, managability, and scalability.</shameless-plug>
Posted in Computer Science | no comments | no trackbacks
Posted by tibbetts
Sat, 18 Nov 2006 14:30:00 GMT
I find the economics of enterprise software to be quite interesting, but have never had a theory of pricing that I really felt comfortable with. I'd like to present a new (to me) explanation for how price is deterimined.
Many people are uncomfortable with the concept of "enterprise software" as some how different from other software. To me, the important difference is that enterprise software is purchased by a large organization for use in a business-critical system that they will be stuck with for many years, possibly decades. The "enterprise" isn't so much in the software, as in the company that is selling it, from their sales and marketing through their services and support.
Enterprise software is also consistently of very high price. There are a couple of potential reasons for this. One is that the benefits of the software are large, because the company is large (value-based-pricing). Another is that the software is built for a small market of large customers, so even though it is duplicable like most software, its development costs much be recouped from just a few sales (cost-based-pricing). I don't think either of these is a good explanation. The benefits of enterprise software don't generally scale with organization size, since the software is frequently used by a single group or business unit which is no larger than many non-enterprise customers. And even products where the cost is easily recouped over many customers (Oracle, SAP) maintain high prices.
I submit that the price of enterprise software is high to provide incentive for companies to go through the enterprise acquisition process. Because the enterprise customer is going to be stuck with the software for many years, their costs will be dominated by their internal costs of support, maintenance, training, etc. To any given business unit, the value proposition for a piece of software might be clear, but organizations quickly learn that they must restrict purchasing authority, lest they end up with an unmanagable IT infrastructure. Enterprise customers establish processes for buying software with long timelines and signficant technical and political hurdles. The result is a process so difficult companies must employ full time experts (enterprise software salespeople) to manage it. Some large organizations even go so far as to have their own internal consultants that work with new vendors, to help them through the process.
To a software vendor, which previously had an infinitely sharable good, this purchasing process greatly complicates their business model. They are limitted in the number of sales opportunities they can pursue. They must compensate their sales organization. And they must expend signficant effort before there is any guarantee of revenue. So, in order to incent companies to work through their process in the first place, large enterprises must pay a premium for their software.
One interesting corollary to this theory is the application to Free and Open Source Software. With free software, the common case is to treat the good as sharable, make it available online, and charge nothing. If cost of software were a significant part of large enterprise purchasing decisions, then free software would be doing really well. But, in fact, the important aspect of software sales is not price, it is having a professional push the deal along, assuring that your software meets with each of their objections and concerns. Free Software, without any such advocate, doesn't have a chance against commercial software in many enterprises.
Realizing this, many companies who already happened to have enterprise sales forces are developing techniques for using selling free software to enteprises and claiming their prize at the end. Some startups, like MySQL and Redhat, have also gotten involved. However, the enterprice acquisition process is still slanted towards buying from traditional vendors who understand enterprise concerns. And unfortunately, with a free software product, a small company has little protection against an establish enterprise vendor taking over their product, as Redhat is experiencing recently with Oracle.
If this is all true, then there is a different approach to be taken by Free Software: internal advocates. Rather than pay vendors significant sums of money to sell them a piece of free software, enterprise customers could develop a system of internal advocates, who filled the same role. Much like sales people, they would develop expertise in the acquisition process, and personal relationships with IT buyers throughout the organization. Unlike sales people, they could advocate for a variety of products, and they would be in a position to identify new free software products long before vendors like IBM, Oracle, and Novell. I don't know of any organization with such a structure, but it seems likely to emerge.
Posted in Business | no comments | no trackbacks
Posted by tibbetts
Sat, 18 Nov 2006 13:38:00 GMT
Had a few hours to kill in Midtown Manhattan today, and so took the opportunity to check out some retail that can only exist in New York.
First I went by the Apple Store on 5th Avenue, which I like to call the Mother Church (standard mall Apple Stores being known as the iShrine). I hadn't been there before. In person, it is quite impressive. The glass cube occupies the center of a previously open plaza. Reminds me of a NeXT box crossed with the Louvre, which is probably the point. Entering the cube, you are standing on glass over an open hole. In front of you is a cylindrical elevator, and around the elevator curves a glass spiral staircase. A worker (acolyte?) seems to be constantly cleaning the glass steps by hand. Decending the staircase, you enter the underground store.
The store is fairly large, probably 4 times the size of a standard Apple store. And it is completely full of people standing around large blond wood tables fondling iPods and laptops. In this way it is fairly similar to a regular Apple store. However, if you try to buy an iPod, a major difference becomes clear. There are no cashiers in evidence, certainly not between you and the door. But if you stand there long enough, a red-shirted Apple rep with a backpack and a handheld Symbol scanner comes by. He will then offer to sell you one of the iPods out of his backpack, swipe your credit card through his device, and you are on your way. A similar setup applies to laptops and presumably to desktop computers. The net effect is they can vary the number of "cashiers" in real time, deploy them around the store as appropriate, allow them to encourage customers to make a decision, and make the instant gratification of shopping at the Apple store just a little more instant.
Secondly, I went to Bloomingdales for the first time. I hadn't even been in a New York department store before, though I had been a few in Chicago. Much more so than any of the stores in Chicago, Bloomingdales reminded me what a department store was supposed to be, before malls and before they got killed by boutiques with good supply chains and big box retail. The most important difference between the Bloomingdales on 59th street and any department store I've visited in suburbia is not the size (it may actually be smaller) it is the sales people. I spent just a short time, checking out possible presents and looking at laptop bags. In that time I was approached by half a dozen sales people. And most interestingly, every one of them seemed intelligent and helpful, the kind of person who I would actually want to assist me with my shopping.
This is a level of service I have had trouble finding in Boston, though some of the stores at some of the high class malls in Newton can approach it. Obviously it is only possibly due to the density of rich people in Manhattan. But it is nice to benefit from it. I need to plan more time in New York for shopping.
Posted in Business | no comments | no trackbacks
Posted by tibbetts
Fri, 10 Nov 2006 13:27:00 GMT
I got invited into the Beta at Me.dium, a new collaborative/social browsing system. It's no dis-similar from the third-party-comments system I was pondering back in the spring, if anyone remembers that. However, rather than being focussed on comments, it is also focussed on real-time browsing. You get a side-bar that shows you what people are looking at. It also has a facility for inviting your friends and sharing browser state with your friends. If anyone is interested in an invitiation, let me know.
The model of being more-realtime has it's problems. In order to use the tool, I need to expose everything I do in my browser (when I don't remember to turn off the tool). The upside is the medium system gets more information, which may make it more valuable to other users. The downside is I have less control over what goes into it. The system I was conceiving of, which would be similar to Pearl Comments, would just be focussed on collecting comments that users actively gave.
Pearl Comments is actually an interesting bit of technology, which is being under-utilized in its current incarnation. They use it to moderate interactions between web designers and clients (it enables clients to privately comment on any page, and to see other comments from the developers or clients). I think a system like that, extended in a social way (so I put comments into groups, and my friends are members of groups, and I see all comments from groups I am a member of) would be better.
Let me know if you want a me.dium beta invite. It's not going to be very interesting unless a bunch of people use it.
Posted in Web | 1 comment | no trackbacks
Posted by tibbetts
Fri, 29 Sep 2006 02:51:17 GMT
Went to a talk at MIT tonight, by
Stephen C Johnson, a researcher at
The Mathworks. The talk was sponsored by the
Greater Boston ACM. The paper,
Algorithms for the 21st Century, was presented at Usenix 2006. The general thrust was how computer architecture has drifted from the idealized model used in most algorithms classes, in significant ways. The major focus was on the memory subsystem, and how caching can effect algorithmic performance. By looking at the behavior of very simple algorithms, we can see fairly complex variations in performance, by factors as high as 60 times worse. This in turn implies that the performance of complex algorithms will be even more diffi
cult to characterize.
One big take-away is that the built-in cache management schemes of modern memory systems are pessimal for certain algorithms. This actually sounds quite a bit like the complaints against operating system disk buffer/virtual memory management by database authors. Database algorithms, because they often have large storage access requirements that can be characterized in advance, benefit from managing their own buffers. Similarly, it seems like scientific computing (eg, Matlab), which wants to manipulate gigabytes of memory in very structured ways, would benefit from direct management of caching. Of course, cache management needs to run at processor-speed, unlike disk buffer management, so writing heavyweight algorithms is inappropriate. Unfortunately, the talk did not discuss any recommendations for extentions to enable management. I wonder if something as simple as being able to mark a page as no-cache in the TLB would be sufficient.
My other take-away is that computer science researchers don't understand computers nearly enough. The presenter, while familiar with processor design and memory architecture, wasn't really able to explain the behavior of these simple algorithms in their entirety. I'm always bothered when I encounter this lack of understanding. On some level it seems that it should be possible to fully understand computer systems. After all, they are developed by humans, and to a first approximation their behavior is deterministic and well defined. Unfortunately, they change rapidly, and so any abstract models we develop are shortly invalidated. Fundamentally, this is what makes computer science difficult.
Posted in Computer Science | no comments | no trackbacks