Let's Build the Next Generation Database Systems Together

Smart people should build things. – Andrew Yang

I will start as an assistant professor at UC Santa Barbara Computer Science since Fall 2020. I am actively looking for PhD students and postdocs that are interested in doing reasearch in the intersection of (database) systems, formal methods, and applied crypto. Please apply to UCSB CS and also send me an email at shumo@ucsb.edu.

I am writing to convince you to join my journey of building the next generation database systems.

The History

My research mission is to build the next generation database systems. Before revealing my vision on what will the next generation data systems look like, let’s take a glympse on the history of database systems (I am using the term “database systems” in a broader sense, since we should have a broader perspective instead of defining an area narrowly in my view). In one sentence, database systems is really a AMAZING achivement of computer science in whatever metric that you may propose. I just list two here:

Behind these numbers, there are tremendous technical advances made, which deeply affect our everyday life. We are using some of these advances every day and tend to take them as granted and ignore them. But in retrospect, they are really the exhibition of highest level human intellects. I will also give two examples here. The first example is the relational data model. The discovery of relational data model and the notion of physical and logical data independence is the corner stone of modern database systems. Looking back, it is still a marvous discovery. There are many mathematic abstractions that could be used to describe the organization of data. Why this is the one? I could write an essay for this topic alone here. But in short, it is really a beautiful abstractions that achieves 3 things at the same time: elegance, mathematic rigorous and practicality. It is almost like a mission impossible. Not many things in computer science history could really match that. The second example is transactions. Again, we are so getting used to it so usually ignore it. It is actually a non-trivial invention. The essence of transactions is to ask database systems to provide a fundamental semantic guarenttee to end users and application developers so that they don’t have to worry about the nitty details such as concurrency, durability, and efficiency. The invention of transaction makes building reliable information systems more easily. I am not sure how you will think about it. But I definitely don’t want to live in a world that there is no database transactions. Well, these are just two examples. There are numerous extraordinary work has been done in many other topics such as query languages, query processing and optimization, and system architecture.

The Mission: Not Just Fast Database Systems

One natural question to ask is, what next?

Well, if you are following the research trends, you may naturally come up with some buzz words. I think the buzz words right now are (database) systems for AI and AI for systems. Let’s look at them one by one. Sure, I fully agree that database system is no longer a solution but a part of solution in a bigger system that incorporates AI components. However, as researchers, we should really ask ourselves, what is the value added part of database systems to the bigger AI system stack? In my view, almost None, period. For incorporating AI techniques in database systems, I do think there are many great works can be done, especially for workload specific tuning of database systems. We definitely need to take over most (if not all) DBA’s jobs by doing their jobs much better.

Here, I have a different view of the future database systems than these buzz words.

Looking at the big picture, in the last 5 decades, the theme of database research is always making database systems faster and more scalable, ever since the introduction of relational data model in the 70s. One question to ask is, is this the only property or the only dimension of properties that we cared?

Clearly not!

During my PhD, I spent 5 years working on the foundations (formal semantics) and tools for automated reasoning of database queries. To my surprise, in such a field with strong mathematical foundations, there were still lacking meta theories and tools that could allow computer aided reasoning of database queries. To address this problem, me and my collaborators developed the axiomatic foundations for equisational reasoning of database queries and built Cosette, the first automated reasoner for SQL queries. Cosette could help you verify correctness optimization rules, validate your query rewrites, and automating the homework grading of database classes. During developing Cosette, we acutally developed machine checkable proofs for many query rewrite rules (including sophiscated ones such as magic set) from both research papers and production systems for the first time! All these authors should sleep more soundly :)

However, I am not done here. A even more ambitious goal is to build a high performance and provably correct database system that has end to end correctness guarantees (i.e. machine checkable proofs). A modern database system is really a beast of all kinds of complexities, both in terms of specifications and implementations. Many inspiring researchers has build provably correct C compiler(CompCert), crash safety file system (FSCQ), operating systems (Sel4, Hyperkernel), and information flow control system (Nickel). It is the time to build a high performance, end to end verfied database system.

Besides provable correctness, I would argue many other important properties of database systems needs to be provided by construction. For example, there are huge interests recently in cryptocurrencies and blockchain systems. Looking closely, a blockchain system is just a distributed transactional database systems with minimized trust. In fact, a blockchain system is orders of magnitude slower than a relational database systems such as MySQL or Postgres. The first generation of blockchain systems such as Ethereum or Bitcoin is about 5 - 6 order of magnitudes slower than your MySQL. The newer generation of blockchain platforms such as Algorand, Cardano already pushing the bounderies of what a consensus protocol can achieve (not far from the theoretical limiations that is determined by the network bandwidth and latency, see a detailed analysis here), is still 2 - 3 order of magnitude slower than a centralized database system. However, they indeed provide a property that a centralized system could never provide, the freedom of not trusting any single entity. In retrospect, blockchain systems is the a new kind of next generation database systems that are already arriving.

Besides trust minimization, another important property that database systems needs to provide by construction is data privacy, or more broadly speaking, respecting the ownership of data. In general, the user needs to be the owner of the data, rather than the database system administrators or any kinds of service providers such as Spotify or Netflix. Any privacy guarantee or data permission policy need to be inforced by construction rather than relying trusting the service provider’s good will or regulator’s audit. First, the service provider is not incentivised to respect the privacy and ownership of data since they can grab the a lot of economic value out of user data. Also, modern software systems are so complicated it is almost impossible for regulators audit them line by line. Last, Even audition is required, the data leakage still could happen due to adverserial compromises of these software systems. This is really a tough problem to solve. I believe that we need to leverage advancements in both formal methods and cryptography to build the next generation database systems providing much better privacy and data ownership guarantees by construction.

The Qualities

An applicant with the following qualities would be appreciated (You don’t need to have them all, but ideally should at least have one):

And yes, your GRE/TOFEL/IELTS score doesn’t matter. However, if you have a below average GRE/TOFEL/IELTS score, please send me an email just in case the system automatically rejects you because of the score.

The Culture

PhD is hard. I think it is my responsiblity to build a healthy group culture so that you could enjoy this journey. More concretely, these are the 3 things that I really care: