kdb+ now available for free download

Posted by tibbetts Sat, 12 Apr 2008 14:05:24 GMT

If you've talked to me about programming languages and Wall Street in the last 4 years, I've probably mentioned kx. This is a company which makes a combination programming environment and database based on a language called q which is derived from APL. (Yes, APL, the language invented in 1957 before there was a computer to run it on.) And this environment is in turn used by many of the top quants on Wall Street (and other parts of the financial world) for both research and production systems. Becoming a kx programmer is a good way to double your salary and quadruple your job security.

Well, it's been going around my corner of the blogosphere that kdb+ is now free for personal use. I first heard about it from Marc Adler. You can go download it from the kx download page. This represents a big step towards openness, which I think will be good for everyone.

The q environment is impressive, you have to give them that. There is an emphasis on brevity; the OSX binary of kdb+ is only 227K. That's smaller than the ncurses library it ships with. And brevity doesn't stop there. Utterances in the language are well known for their complexity and impenetrable internal logic. A lot of q code makes obfuscated perl look clear and verbose. It doesn't help that the culture of kx programmers discourages commenting. Error handling is tricky at best, and modularity and maintainability are in short supply. For confusion, q adds a bunch of SQL keywords on top of the previous language, k, in an almost but not quite fully compatible way.

But for all the faults you can kind find some really interesting features in q. And if nothing else, it is an example of a novel programming language, tightly integrated with a data management system, finding commercial success, which is always nice to see. And if you learn it, you might find a sweet job on wall street as a result.

Posted in Computer Science | no comments

When neither a column store nor a row store is the answer

Posted by tibbetts Sat, 12 Apr 2008 02:34:00 GMT

A few days ago I found myself giving database advice to a friend with a new startup. His problem is a pretty common one: he has a very large corpus of data, over which he will run compute-intensive proprietary algorithms. Both the data and the computation will require a cluster of machines. He has a prototype based on Postgres. His question to me: should I continue to use a row store (Postgres) or should I use new technology, specifically a column store?

If you read the title of the post, you can already guess that my answer was neither.

Read more...

Posted in Computer Science | no comments

Google commences another assault on the traditional database community

Posted by tibbetts Tue, 08 Apr 2008 13:32:19 GMT

Many in the blogosphere noticed when Stonebraker and Dewitt at The Database Column took offense at the idea that map-reduce is the solution to many of life's problems. The idea that a simple idea, promoted by a services company, can blow away 20+ years of distributed database research, bothers them for some reason. Sure, it may not be the best solution to every problem, but it is a sufficient solution to many problems.

Google is quietly at it again, this time with the Google AppEngine.

Read more...

Posted in Computer Science | no comments

Transactional Memory Not Solution To All Problems

Posted by tibbetts Sat, 02 Dec 2006 01:10:19 GMT

I was at MIT today and so I ended up going to an invited talk on computer architecture, Subtle Semantics and Unrestricted Implementation of Transactional Memory. Transactional Memory is a very hot topic in systems and architecture. It is perceived to be a better model for programmers, so language designers like it. And there are a variety of options for pure-software and hardware-assisted implementations. And because it enables optimistic concurrency control, transactional memory can help make programs faster and more scalable on new multi-core architectures. There is every reason to believe that processor vendors will begin including some form of hardware support for transactional memory.

The focus of the talk was on the subtle issues that come from this. It dismissed a few of the more positive myths about transactional memory:
  • Programs using locks cannot be trivially converted to use transactions. In fact, a correctly behaving program using locks to manage things like inter-thread communication can easily deadlock or livelock when using transactions.
  • Transactions are not perfectly composable or nestable, as is often claimed. This means heirarchical or nested transactions, where there is a program-level transaction wrapping several library-level transactions, can lead to deadlock or livelock.
  • Whether a transactional system is weak (non-transactional code executes concurrently with transactional code, with side effects visible between) or strong transactional (a transaction is atomic not just from the perspective of other transactions, but also to other non-transactional code) can have a significant effect on program behavior. Not only can a program designed for strong transactions have problems under weak transactions, but a program that behaves correctly in a weak transactional system may deadlock in a strong transactional system.
This leads me to two primary conclusions, with which I think the speaker would agree: First, transactional memory can have significant confusing side effects, just like locks, and so it is not a solution to the difficulties of multithreaded programming. Second, if processor vendors implement, and many programmers use, weak transactions then we may never get strong transactions.

I'd like to take this one step further. I think language designers and computer architects who are excited about transactional memory are missing the point. They are trying to create a single concurrency control mechanism that can solve everyone's problems. In fact, I think there are two separate classes of problem that are best addressed separately: Concurrency control for systems programming, and concurrency control for application programming.

Systems programmers are close to the hardware, inside or right on top of the operating system. They are the people who are most comfortable with the concurrency control mechanisms of today (pthreads and java.util.concurrent). Transactional Memory should be targetting these users. The goal is not to give them an easy way to do concurrency, but to make optimistic concurrency control possible and efficient. Continuing to discuss the merits of strong and weak transactions is reasonable, but it should be in the context of what most efficiently represents the capabilities of the hardware.

Application programmers generally work on top of a platform, and don't have to think about concurrency in the same way. if they don't work on a platform then they should. For example, application programmers are often writing business applications as web services on top of J2EE. By using platforms, they can work with much higher level management and concurrency control. The most common idiom for concurrency control is the database transaction (or distributed transaction). It is able to handle most concurrency problems in traditional RPC-and-datastore systems.

To help application programmers, we need to develop additional idioms. Not all programs can utilize the J2EE style of platform. I expect to see a significant advance in scientific computing tools like MATLAB to take advantage of multiprocessor systems (in my ideal world it would be by vectorizing based on static and dynamic analysis, but I digress). Furthermore, there are a large set of event processing (stream processing, complex event processing) which are also not well served by J2EE-type platforms and database-transactions. Which is why I work on StreamBase (and on StreamSQL), which is creating next generation programming models for those kinds of applications, to enable parallelization, managability, and scalability.</shameless-plug>

Posted in Computer Science | no comments | no trackbacks

Algorithms for the 21st Century

Posted by tibbetts Fri, 29 Sep 2006 02:51:17 GMT

Went to a talk at MIT tonight, by Stephen C Johnson, a researcher at The Mathworks. The talk was sponsored by the Greater Boston ACM. The paper, Algorithms for the 21st Century, was presented at Usenix 2006. The general thrust was how computer architecture has drifted from the idealized model used in most algorithms classes, in significant ways. The major focus was on the memory subsystem, and how caching can effect algorithmic performance. By looking at the behavior of very simple algorithms, we can see fairly complex variations in performance, by factors as high as 60 times worse. This in turn implies that the performance of complex algorithms will be even more diffi
cult to characterize.
One big take-away is that the built-in cache management schemes of modern memory systems are pessimal for certain algorithms. This actually sounds quite a bit like the complaints against operating system disk buffer/virtual memory management by database authors. Database algorithms, because they often have large storage access requirements that can be characterized in advance, benefit from managing their own buffers. Similarly, it seems like scientific computing (eg, Matlab), which wants to manipulate gigabytes of memory in very structured ways, would benefit from direct management of caching. Of course, cache management needs to run at processor-speed, unlike disk buffer management, so writing heavyweight algorithms is inappropriate. Unfortunately, the talk did not discuss any recommendations for extentions to enable management. I wonder if something as simple as being able to mark a page as no-cache in the TLB would be sufficient.
My other take-away is that computer science researchers don't understand computers nearly enough. The presenter, while familiar with processor design and memory architecture, wasn't really able to explain the behavior of these simple algorithms in their entirety. I'm always bothered when I encounter this lack of understanding. On some level it seems that it should be possible to fully understand computer systems. After all, they are developed by humans, and to a first approximation their behavior is deterministic and well defined. Unfortunately, they change rapidly, and so any abstract models we develop are shortly invalidated. Fundamentally, this is what makes computer science difficult.

Posted in Computer Science | no comments | no trackbacks